Friday, 28 November 2014

Scraping XML Tables with R

A couple of my good friends also recently started a sports analytics blog. We’ve decided to collaborate on a couple of studies revolving around NBA data found at www.basketball-reference.com. This will be the first part of that project!

Data scientists need data. The internet has lots of data. How can I get that data into R? Scrape it!

People have been scraping websites for as long as there have been websites. It’s gotten pretty easy using R/Python/whatever other tool you want to use. This post shows how to use R to scrape the demographic information for all NBA and ABA players listed at www.basketball-reference.com.

Here’s the code:

###### Settings

library(XML)

 ###### URLs

url<-paste0("http://www.basketball-reference.com/players/",letters,"/")

len<-length(url)

 ###### Reading data

tbl<-readHTMLTable(url[1])[[1]]

 for (i in 2:len)

    {tbl<-rbind(tbl,readHTMLTable(url[i])[[1]])}

 ###### Formatting data

colnames(tbl)<-c("Name","StartYear","EndYear","Position","Height","Weight","BirthDate","College")

tbl$BirthDate<-as.Date(tbl$BirthDate[1],format="%B %d, %Y")

Created by Pretty R at inside-R.org

And here’s the result:Result

Source: http://www.r-bloggers.com/scraping-xml-tables-with-r/

Wednesday, 26 November 2014

Data Mining KNN Classifier

Q1   

Suppose a data analyst working for an insurance company was asked to build a predictive model for predicting weather a customer will buy a mobile home insurance policy. S/he tried kNN classifier with different number of neighbours (k=1,2,3,4,5). S/he got the following F-scores measured on the training data: (1.0; 0.92; 0.90; 0.85; 0.82). Based on that the analyst decided to deploy kNN with k=1. Was it a good choice? How would you select an optimal number of neighbours in this case?

1 Answer

It is not a good idea to select a parameter of a prediction algorithm using the whole training set as the result will be biased towards this particular training set and has no information about generalization performance (i.e. performance towards unseen cases). You should apply a cross-validation technique e.g. 10-fold cross-validation to select the best K (i.e. K with largest F-value) within a range. This involves splitting your training data in 10 equal parts retain 9 parts for training and 1 for validation. Iterate such that each part has been left out for validation. If you take enough folds this will allow you as well to obtain statistics of the F-value and then you can test whether these values for different K values are statistically significant.

See e.g. also: http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Falg_knn_training_crossvalidation.htm

The subtlety here however is that there is likely a dependency between the number of data points for prediction and the K-value. So If you apply cross-validation you use 9/10 of the training set for training...Not sure whether any research has been performed on this and how to correct for that in the final training set. Anyway most software packages just use the abovementioned techniques e.g. see SPSS in the link. A solution is to use leave-one-out cross-validation (each data samples is left out once for testing) in that case you have N-1 training samples(the original training set has N).

Source:http://stackoverflow.com/questions/21121509/data-mining-knn-classifier?rq=1

Sunday, 23 November 2014

A Content Marketer's Guide to Data Scraping

As digital marketers, big data should be what we use to inform a lot of the decisions we make. Using intelligence to understand what works within your industry is absolutely crucial within content campaigns, but it blows my mind to know that so many businesses aren't focusing on it.

One reason I often hear from businesses is that they don't have the budget to invest in complex and expensive tools that can feed in reams of data to them. That said, you don't always need to invest in expensive tools to gather valuable intelligence — this is where data scraping comes in.

Just so you understand, here's a very brief overview of what data scraping is from Wikipedia:

    "Data scraping is a technique in which a computer program extracts data from human-readable output coming from another program."

Essentially, it involves crawling through a web page and gathering nuggets of information that you can use for your analysis. For example, you could search through a site like Search Engine Land and scrape the author names of each of the posts that have been published, and then you could correlate this to social share data to find who the top performing authors are on that website.

Hopefully, you can start to see how this data can be valuable. What's more, it doesn't require any coding knowledge — if you're able to follow my simple instructions, you can start gathering information that will inform your content campaigns. I've recently used this research to help me get a post published on the front page of BuzzFeed, getting viewed over 100,000 times and channeling a huge amount of traffic through to my blog.

Disclaimer: One thing that I really need to stress before you read on is the fact that scraping a website may breach its terms of service. You should ensure that this isn't the case before carrying out any scraping activities. For example, Twitter completely prohibits the scraping of information on their site. This is from their Terms of Service:

    "crawling the Services is permissible if done in accordance with the provisions of the robots.txt file, however, scraping the Services without the prior consent of Twitter is expressly prohibited"

Google similarly forbids the scraping of content from their web properties:

    Google's Terms of Service do not allow the sending of automated queries of any sort to our system without express permission in advance from Google.

So be careful, kids.
Content analysis

Mastering the basics of data scraping will open up a whole new world of possibilities for content analysis. I'd advise any content marketer (or at least a member of their team) to get clued up on this.

Before I get started on the specific examples, you'll need to ensure that you have Microsoft Excel on your computer (everyone should have Excel!) and also the SEO Tools plugin for Excel (free download here). I put together a full tutorial on using the SEO tools plugin that you may also be interested in.

Alongside this, you'll want a web crawling tool like Screaming Frog's SEO Spider or Xenu Link Sleuth (both have free options). Once you've got these set up, you'll be able to do everything that I outline below.

So here are some ways in which you can use scraping to analyse content and how this can be applied into your content marketing campaigns:

1. Finding the different authors of a blog

Analysing big publications and blogs to find who the influential authors are can give you some really valuable data. Once you have a list of all the authors on a blog, you can find out which of those have created content that has performed well on social media, had a lot of engagement within the comments and also gather extra stats around their social following, etc.

I use this information on a daily basis to build relationships with influential writers and get my content placed on top tier websites. Here's how you can do it:

Step 1: Gather a list of the URLs from the domain you're analysing using Screaming Frog's SEO Spider. Simply add the root domain into Screaming Frog's interface and hit start (if you haven't used this tool before, you can check out my tutorial here).

Once the tool has finished gathering all the URLs (this can take a little while for big websites), simply export them all to an Excel spreadsheet.

Step 2: Open up Google Chrome and navigate to one of the article pages of the domain you're analysing and find where they mention the author's name (this is usually within an author bio section or underneath the post title). Once you've found this, right-click their name and select inspect element (this will bring up the Chrome developer console).

Within the developer console, the line of code associated to the author's name that you selected will be highlighted (see the below image). All you need to do now is right-click on the highlighted line of code and press Copy XPath.

For the Search Engine Land website, the following code would be copied:

//*[@id="leftCol"]/div[2]/p/span/a

This may not make any sense to you at this stage, but bear with me and you'll see how it works.

Step 3: Go back to your spreadsheet of URLs and get rid of all the extra information that Screaming Frog gives you, leaving just the list of raw URLs – add these to the first column (column A) of your worksheet.

Step 4: In cell B2, add the following formula:

=XPathOnUrl(A2,"//*[@id='leftCol']/div[2]/p/span/a")

Just to break this formula down for you, the function XPathOnUrl allows you to use the XPath code directly within (this is with the SEO Tools plugin installed; it won't work without this). The first element of the function specifies which URL we are going to scrape. In this instance I've selected cell A2, which contains a URL from the crawl I did within Screaming Frog (alternatively, you could just type the URL, making sure that you wrap it within quotation marks).

Finally, the last part of the function is our XPath code that we gathered. One thing to note is that you have to remove the quotation marks from the code and replace them with apostrophes. In this example, I'm referring to the "leftCol" section, which I've changed to ‘leftCol' — if you don't do this, Excel won't read the formula correctly.

Once you press enter, there may be a couple of seconds delay whilst the SEO Tools plugin crawls the page, then it will return a result. It's worth mentioning that within the example I've given above, we're looking for author names on article pages, so if I try to run this on a URL that isn't an article (e.g. the homepage) I will get an error.

For those interested, the XPath code itself works by starting at the top of the code of the URL specified and following the instructions outlined to find on-page elements and return results. So, for the following code:

//*[@id='leftCol']/div[2]/p/span/a

We're telling it to look for any element (//*) that has an id of leftCol (@id='leftCol') and then go down to the second div tag after this (div[2]), followed by a p tag, a span tag and finally, an a tag (/p/span/a). The result returned should be the text within this a tag.

Don't worry if you don't understand this, but if you do, it will help you to create your own XPath. For example, if you wanted to grab the output of an a tag that has rel=author attached to it (another great way of finding page authors), then you could use some XPath that looked a little something like this:

//a[@rel='author']

As a full formula within Excel it would look something like this:

=XPathOnUrl(A2,"//a[@rel='author']")

Once you've created the formula, you can drag it down and apply it to a large number of URLs all at once. This is a huge time-saver as you'd have to manually go through each website and copy/paste each author to get the same results without scraping – I don't need to explain how long this would take.

Now that I've explained the basics, I'll show you some other ways in which scraping can be used…

2. Finding extra details around page authors

So, we've found a list of author names, which is great, but to really get some more insight into the authors we will need more data. Again, this can often be scraped from the website you're analysing.

Most blogs/publications that list the names of the article author will actually have individual author pages. Again, using Search Engine Land as an example, if you click my name at the top of this post you will be taken to a page that has more details on me, including my Twitter profile, Google+ profile and LinkedIn profile. This is the kind of data that I'd want to gather because it gives me a point of contact for the author I'm looking to get in touch with.

Here's how you can do it.

Step 1: First we need to get the author profile URLs so that we can scrape the extra details off of them. To do this, you can use the same approach to find the author's name, with just a little addition to the formula:

=XPathOnUrl(A2,"//a[@rel='author']", <strong>"href"</strong>)

The addition of the "href" part of the formula will extract the output of the href attribute of the atag. In Lehman terms, it will find the hyperlink attached to the author name and return that URL as a result.

Step 2: Now that we have the author profile page URLs, you can go on and gather the social media profiles. Instead of scraping the article URLs, we'll be using the profile URLs.

So, like last time, we need to find the XPath code to gather the Twitter, Google+ and LinkedIn links. To do this, open up Google Chrome and navigate to one of the author profile pages, right-click on the Twitter link and select Inspect Element.

Once you've done this, hover over the highlighted line of code within Chrome's developer tools, right-click and select Copy XPath.

Step 3: Finally, open up your Excel spreadsheet and add in the following formula (using the XPath that you've copied over):

=XPathOnUrl(C2,"//*[@id='leftCol']/div[2]/p/a[2]", "href")

Remember that this is the code for scraping Search Engine Land, so if you're doing this on a different website, it will almost certainly be different. One important thing to highlight here is that I've selected cell C2 here, which contains the URL of the author profile page and not just the article page. As well as this, you'll notice that I've included "href" at the end because we want the actual Twitter profile URL and not just the words ‘Twitter'.

You can now repeat this same process to get the Google+ and LinkedIn profile URLs and add it to your spreadsheet. Hopefully you're starting to see the value in this, and how it can be used to gather a lot of intelligence that can be used for all kinds of online activity, not least your SEO and social media campaigns.

3. Gathering the follower counts across social networks

Now that we have the author's social media accounts, it makes sense to get their follower counts so that they can be ranked based on influence within the spreadsheet.

Here are the final XPath formulae that you can plug straight into Excel for each network to get their follower counts. All you'll need to do is replace the text INSERT SOCIAL PROFILE URL with the cell reference to the Google+/LinkedIn URL:

Google+:

=XPathOnUrl(<strong>INSERTGOOGLEPROFILEURL</strong>,"//span[@class='BOfSxb']")

LinkedIn:

=XPathOnUrl(<strong>INSERTLINKEDINURL</strong>,"//dd[@class='overview-connections']/p/strong")

4. Scraping page titles

Once you've got a list of URLs, you're going to want to get an idea of what the content is actually about. Using this quick bit of XPath against any URL will display the title of the page:

=XPathOnUrl(A2,"//title")

To be fair, if you're using the SEO Tools plugin for Excel then you can just use the built-in feature to scrape page titles, but it's always handy to know how to do it manually!

A nice extra touch for analysis is to look at the number of words used within the page titles. To do this, use the following formula:

=CountWords(A2)

From this you can get an understanding of what the optimum title length of a post within a website is. This is really handy if you're pitching an article to a specific publication. If you make the post the best possible fit for the site and back up your decisions with historical data, you stand a much better chance of success.

Taking this a step further, you can gather the social shares for each URL using the following functions:

Twitter:

=TwitterCount(<strong>INSERTURLHERE</strong>)

Facebook:

=FacebookLikes(<strong>INSERTURLHERE</strong>)

Google+:

=GooglePlusCount(<strong>INSERTURLHERE</strong>)

Note: You can also use a tool like URL Profiler to pull in this data, which is much better for large data sets. The tool also helps you to gather large chunks of data from other social networks, link data sources like Ahrefs, Majestic SEO and Moz, which is awesome.

If you want to get even more social stats then you can use the SharedCount API, and this is how you go about doing it…

Firstly, create a new column in your Excel spreadsheet and add the following formula (where A2 is the URL of the webpage you want to gather social stats for):

=CONCATENATE("http://api.sharedcount.com/?url=",A2)

You should now have a cell that contains your webpage URL prefixed with the SharedCount API URL. This is what we will use to gather social stats. Now here's the Excel formula to use for each network (where B2 is the cell that contaiins the formula above):

StumbleUpon:

=JsonPathOnUrl(B2,"StumbleUpon")

Reddit:

=JsonPathOnUrl(B2,"Reddit")

Delicious:

=JsonPathOnUrl(B2,"Delicious")

Digg:

=JsonPathOnUrl(B2,"Diggs")

Pinterest:

=JsonPathOnUrl(B2,"Pinterest")

LinkedIn:

=JsonPathOnUrl(B2,"Linkedin")

Facebook Shares:

=JsonPathOnUrl(B2,"Facebook.share_count")

Facebook Comments:

=JsonPathOnUrl(B2,"Facebook.comment_count")

Once you have this data, you can start looking much deeper into the elements of a successful post. Here's an example of a chart that I created around a large sample of articles that I analysed within Upworthy.com.

The chart looks at the average number of social shares that an article on Upworthy receives vs the number of words within its title. This is invaluable data that can be used across a whole host of different on-page elements to get the perfect article template for the site you're pitching to.

See, big data is useful!

5. Date/time the post was published

Along with analysing the details of headlines that are working within a site, you may want to look at the optimal posting times for best results. This is something that I regularly do within my blogs to ensure that I'm getting the best possible return from the time I spend writing.

Every site is different, which makes it very difficult for an automated, one-size-fits-all tool to gather this information. Some sites will have this data within the <head> section of their webpages, but others will display it directly under the article headline. Again, Search Engine Land is a perfect example of a website doing this…

So here's how you can scrape this information from the articles on Search Engine Land:

=XPathOnUrl(<strong>INSERTARTICLEURL</strong>,"//*[@class='dateline']/text()")

Now you've got the date and time of the post. You may want to trim this down and reformat it for your data analysis, but you've got it all in Excel so that should be pretty easy.

Extra reading

Data scraping is seriously powerful, and once you've had a bit of a play around with it you'll also realise that it's not that complicated. The examples that I've given are just a starting point but once you get your creative head on, you'll soon start to see the opportunities that arise from this intelligence.

Here's some extra reading that you might find useful:

    http://findmyblogway.com/scraping-communities-with-xpath/

    http://builtvisible.com/data-entry-is-a-waste-of-time/

    http://www.seotakeaways.com/data-scraping-guide-for-seo/

    http://okdork.com/2014/04/30/the-step-by-step-guide-to-10x-growth-for-any-blog/

TL;DR

    Start using actual data to inform your content campaigns instead of going on your gut feeling.

    Gather intelligence around specific domains you want to target for content placement and create the perfect post for their audience.

    Get clued up on XPath and JSON through using the SEO Tools plugin for Excel.

    Spend more time analysing what content will get you results as opposed to what sites will give you links!

    Check the website's ToS before scraping.

Source:http://moz.com/blog/a-content-marketers-guide-to-data-scraping

Wednesday, 19 November 2014

NHL ending dry scraping of ice before overtime

TORONTO (AP) — The NHL will no longer dry scrape the ice before overtime.

Instituted this season in an effort to reduce the number of shootouts, the dry scraping will stop after Friday's games.

The general managers decided at their meeting Tuesday to make the change after the league talked to the players' union the past few days.

Beginning Saturday, ice crews around the league will again shovel the ice after regulation as they did in previous years. The GMs said the dry scrape was causing too much of a delay. Director of hockey operations Colin Campbell said the delays were lasting from more than four minutes to almost seven.

The dry scrape initially had been approved in hopes of reducing shootouts by improving scoring chances without unduly slowing play by recoating the ice.

The GMs also discussed expanded video review, including goaltender interference, and the possibility of three-on-three overtime. The American Hockey League is experimenting with the three-on-three format this season.

This annual meeting the day after the Hockey Hall of Fame induction usually doesn't produce actual changes, with the dry scrape providing an exception.

The main purpose is to set up the March meeting in Boca Raton, Florida, where these items will be further addressed.

Source:http://missoulian.com/sports/hockey/nhl-ending-dry-scraping-of-ice-before-overtime/article_3dd5473c-6102-5800-99f7-2c98be0f99ad.html

Web Scraping for SEO with these Open-Source Scrapers

When conducting Search Engine Optimization (SEO), we’re required to scrape websites for data, our campaigns, and reports for our clients. At the lowest level we utilize scraping to keep track of rankings on search engines like Google, Bing, and Yahoo, even keep a track of links on websites to know when it’s completed its lifespan. Then we’ve used them to help us aggregate data from APIs, RSS feeds, and websites to conduct some of our data mining to find patterns to help us become more competitive. 

So scraping is a function majority of companies (SEOmoz, Raventools, and Google) have to do to either save money, protect intellectual property, track trends, etc… Businesses can find infinite uses with scraping tools, it just depends if you’re an printed circuit board manufacturer looking for ideas on your e-mail marketing campaign or a Orange County based business trying to keep an eye out on the competition. which is why we’ve created a comprehensive list of open source scrapers out there to help all the businesses out there. Just keep in mind we haven’t used all of them!

Words of caution, web scrapers require knowledge specific to the language such as PHP & cURL. Take into considerations issues like cookie management, fault tolerance, organizing the data properly, not crashing the website being scraped, and making sure the website doesn’t prohibit scraping.

If you’re ready, here’s the list…

Erlang

    eBot

Java

    Heritrix
    Nutch
    Piggy Bank
    WebSPHINX
    WebHarvest

PHP

    PHPCrawl
    Snoopy
    SpiderMonkey

Python

    BeautifulSoap
    HarvestMan
    Scrape.py
    Scrapemark
    Scrapy **
    Mechanize

Ruby

    Anemone
    scRUBYt

We’ll come back and update this list as we encounter more! If you would like to submit a solution we missed, feel free. Also we’re looking for guides related to each of these, so if you know of any or would be interested in guesting blogging about one, let us know!

Source:http://www.annexcore.com/blog/web-scraping-for-seo-with-these-open-source-scrapers/

Monday, 17 November 2014

How to scrape data without coding? A step by step tutorial on import.io

Import.io (pronounced import-eye-oh) lets you scrape data from any website into a searchable database. It is perfect for gathering, aggregating and analysing data from websites without the need for coding skills. As Sally Hadadi, from Import.io, told Journalism.co.uk: the idea is to “democratise” data. “We want journalists to get the best information possible to encourage and enhance unique, powerful pieces of work and generally make their research much easier.” Different uses for journalists, supplemented by case studies, can be found here.

A beginner’s guide

After downloading and opening import.io browser, copy the URL of the page you want to scrape into the import.io browser. I decided to scrape the search results website of orphanages in London:

001 Orphanages in London

After opening the website, press the tiny pink button in top right corner of the browser and follow up with “Let’s get cracking!” in the bottom right menu which has just appeared.

Then, choose the type of scraping you want to perform. In my case, it’s a Crawler (we’ll be getting data from multiple similar pages on the same site):

crawler

And confirm the URL of the website you want to scrape by clicking “I’m there”.

As advised, choose “Detect optimal settings” and confirm the following:

data

In the menu “Rows per page” select the format in which data appears on the website, whether it is “single” or “multiple”. I’m opting for the multiple as my URL is a listing of multiple search results:multiple

Now, the time has come to “train your rows” i.e. mark which part of the website you are interested in scraping. Hover over an entire “entry” or “paragraph”:hover over entry

…and he entry will be highlighted in pink or blue. Press “Train rows”.

train rows

Repeat the operation with the next entry/paragraph so that the scraper gets the hang of the pattern of your selections. Two examples should suffice. Scroll down to the bottom of your website to make sure that all entries until the last one are selected (=highlighted in pink or blue alternately).

If it is, press “I’ve got all 50 rows” (the number depends on how many rows you have selected).

Now it’s time to focus on particular chunks of data you would like to extract. My entries consist of a name of the orphanage, address, phone number and a short description so I will extract all those to separate columns. Let’s start by adding a column “name”:

add column

Next, highlight the name of the first orphanage in the list and press “Train”.

highlighttrain

Your table should automatically fill in with names of all orphanages in the list:table name

If it didn’t, try tweaking your selection a bit. Then add another column “address” and extract the address of the orphanage by highlighting the two lines of addresses and “training” the rows.

Repeat the operation for a “phone number” and “description”. Your table should end up looking like this:table final

*Before passing on to the next column it is worth to check that all the rows have filled up. If not, highlighting and training of the individual elements might be necessary.

Once you’ve grabbed all that you need, click “I’ve got what I need”. The menu will now ask you if you want to scrape more pages. In this case, the search yielded two pages of search results so I will add another page. In order to this this, go back to your website in you regular browser, choose page 2 (or any next one) of your search results and copy the URL. Paste it into the import.io browser and confirm by clicking “I’m there”:

i'm there

The scraper should automatically fill in your table for page 2. Click “I’ve got all 45 rows” and “I’ve got what I needed”.

You need to add at least 5 pages, which is a bit frustrating with a smaller data set like this one. The way around it is to add page 2 a couple of times and delete the unnecessary rows in the final table.

Once the cheating is done, click “I’m done training!” and “Upload to import.io”.

upload

Give the name to your Crawler, e.g. “Orphanages in London” and wait for import.io to upload your data. Then, run crawler:run crawler

Make sure that the page depth is 10 and that click “Go”. If you’re scraping a huge dataset with several pages of search results, you can copy your URLs to Excel, highlight them and drag down with a black cross (bottom right of the cell) to obtain a comprehensive list. Paste it into the “Where to start?” window and press “Go”.go

crawlingAfter the crawling is complete, you can download you data in EXCEL, HTML, JSON or CSV.dataset

As a result, we obtain a data set which can be easily turned into a map of orphanages in London, e.g. using Google Fusion Tables.

Source:http://www.interhacktives.com/2014/03/06/scrape-data-without-coding-step-step-tutorial-import-io/

Sunday, 16 November 2014

Is Web Scraping Legal?

Web scraping might be one of the best ways to aggregate content from across the internet, but it comes with a caveat: It’s also one of the hardest tools to parse from a legal standpoint.

For the uninitiated, web scraping is a process whereby an automated piece of software extracts data from a website by “scraping” through the site’s many pages. While search engines like Google and Bing do a similar task when they index web pages, scraping engines take the process a step further and convert the information into a format which can be easily transferred over to a database or spreadsheet.

It’s also important to note that a web scraper is not the same as an API. While a company might provide an API to allow other systems to interact with its data, the quality and quantity of data available through APIs is typically lower than what is made available through web scraping. In addition, web scrapers provide more up-to-date information than APIs and are much easier to customize from a structural standpoint.

The applications of this “scraped” information are widespread. A journalist like Nate Silver might use scrapers to monitor baseball statistics and create numerical evidence for a new sports story he’s working on. Similarly, an eCommerce business might bulk scrape product titles, prices, and SKUs from other sites in order to further analyze them.

Legality of Web ScrapingWhile web scraping is an undoubtedly powerful tool, it’s still undergoing growing pains when it comes to legal matters. Because the scraping process appropriates pre-existing content from across the web, there are all kinds of ethical and legal quandaries that confront businesses who hope to do leverage scrapers for their own processes.

In this “wild west” environment, where the legal implications of web scraping are in a constant state of flux, it helps to get a foothold on where the legal needle currently falls. The following timeline outlines some of the biggest cases involving web scrapers in the United States, and allows us to achieve a greater understanding on the precedents that surround the court rulings.

Terms of Use Tug-of-War—2000-2009

For years after they first came into use, web scrapers went largely unchallenged from a legal standpoint. In 2000, however, the use of scrapers came under heavy and consistent fire when eBay fired the first shot against an auction data aggregator called Bidder’s Edge. In this very early case, eBay argued that Bidder’s Edge was using scrapers in a way that violated Trespass to Chattels doctrine. While the lawsuit was settled out of court, the judge upheld eBay’s original injunction, stating that heavy bot traffic could very well disrupt eBay’s service.

Then in 2003’s Intel Corp. v. Hamidi, the California Supreme court overturned the basis of eBay v. Bidder’s Edge, ruling that Trespass to Chattels could not extend to the context of computers if no actual damage to personal property occurred.

So in terms of legal action against web scraping, Tresspass to Chattels no longer applied, and things were back to square one. This began a period in which the courts consistently rejected Terms of Service as a valid means of prohibiting scrapers, including cases like Perfect 10 v. Google, and Cvent v. Eventbrite.

The Takeaway: The earliest cases against scrapers hinged on Trespass to Chattels law, and were successful. However, that doctrine is no longer a valid approach.

Facebook Web Scraping2009—Facebook Steps In

In 2009, Facebook turned the tides of the web scraping war when Power.com, a site which aggregated multiple social networks into one centralized site, included Facebook in their service. Because Power.com was scraping Facebook’s content instead of adhering to their established standards, Facebook sued Power on grounds of copyright infringement.

In denying Power.com’s motion to dismiss the case, the Judge ruled that scraping can constitute copying, however momentary that copying may be. And because Facebook’s Terms of Service don’t allow for scraping, that act of copying constituted an infringement on Facebook’s copyright. With this decision, the waters regarding the legality of web scrapers began to shift in favor of the content creators.

The Takeaway: Even if a web scraper ignores infringing content on its way to freely-usable content, it might qualify as copyright infringement by virtue of having technically “copied” the infringing content first.

2011-2014— U.S. v Auernheimer

In 2010, hacker Andrew “Weev” Auernheimer found a security flaw in AT&T’s website, which would display the email addresses of users who visited the site via their iPads. By exploiting the flaw using some simple scripts and a scraper, Auernheimer was able to gather thousands of emails from the AT&T site.

Although these email addresses were publicly available, Auernheimer’s exploit led to his 2012 conviction, where he was charged with identity fraud and conspiracy to access a computer without authorization.

Data ScrapingEarlier this year, the court vacated Auernheimer’s conviction, ruling that the trial’s New Jersey venue was improper. But even though the case turned out to be mostly inconclusive, the court noted the fact that there was no evidence to show that “any password gate or code-based barrier was breached.” This seems to leave room for the web scraping of publicly-available personal information, although it’s still very much open to interpretation and not set in stone.

The Takeaway: Using a web scraper to aggregate sensitive personal information can lead to a conviction, even if that information was technically available to the public. While there is hope in the court’s observation that no passwords or barriers were broken to retrieve this information, the waters here are still very volatile.

2013—Associated Press vs. Meltwater

Meltwater is a software company whose “Global Media Monitoring” product uses scrapers to aggregate news stories for paying clients. The Associated Press took issue with Meltwater’s scraping of their original stories, some of which had been copyrighted. In 2012, AP filed suit against Meltwater for copy infringement and hot news misappropriation.

While it’s already been established that facts cannot be copyrighted, the court decided that the AP’s copyrighted articles—and more specifically, the way in which the facts within those articles were arranged—were not fair game for copying. On top of this, Meltwater’s use of the articles failed to meet the established fair use standards, and could not be defended on that front either.

The Takeaway: Fair use is limited when it comes to web scrapers, and copyrighted content is not always open to be scraped.

~~

By closely observing the outcomes of previous rulings, you’ll find that there are a few guidelines that a scraper should attempt to adhere to:

    Content being scraped is not copyright protected
    The act of scraping does not burden the services of the site being scraped
    The scraper does not violate the Terms of Use of the site being scraped
    The scraper does not gather sensitive user information
    The scraped content adheres to fair use standards

While all of these guidelines are important to understand before using scrapers, there are other ways to acclimate to the legal nuances. In many cases, you’ll find that a simple conversation with a business software developer or consultant will lead to some satisfying conclusions: Odds are, they’ve used scrapers in the past and can shed light on any snags they’ve hit in the process. And of course, talking with a lawyer is always an ideal course of action when treading into questionable legal territory.

Source:http://blog.icreon.us/2014/09/12/web-scraping-and-you-a-legal-primer-for-one-of-its-most-useful-tools/

Friday, 14 November 2014

Interactive Crawls for Scraping AJAX Pages on the Web

Crawling pages on the web has become an everyday affair for most enterprises. Too often do we come across offline businesses as well who’d like data gathered from the web for internal analyses. All this eventually to serve customers faster and better. At times, when the crawl job is high-end cum high-scale, businesses also consider DaaS providers to supplement their efforts.

However, the web landscape too has evolved with newer technologies that provide fancy experiences to web users. AJAX elements are one such common aid that leave even the DaaS providers perplexed. They come in various forms from a user’s point of view-

1. Load more results on the same page

2. Filter results based on various selection criteria

3. Submit forms, etc.

When crawling a non-AJAX page, simple GET requests do the job. However, AJAX pages work with POST requests that are not easy to trace for a normal bot.

Difference between GET request and POST request- Scraping

GET vs. POST

At PromptCloud, from our experience with a number of AJAX sites on the web, we’ve crossed the tech barrier. Below is a quick review about the challenges that come with AJAX crawling and its indicative solutions-

1. Javascript Emulations- A bot essentially emulates human browsing to fetch pages. When this needs to be done for Javascript components on a page, it gets tricky. Headless browser, which emulates human interaction with a web page without an interface, is the current approach. These browsers click on various elements/ dropdown lists that are embedded within Javascript code and capture responses to be transferred to programs. Which headless browser to pick depends on what fits well into your current stack.

2. Fetch Bandwidths- Unlike GET requests which complete pretty quickly, POST requests take quite a bit of time due to the number of events involved per fetch. Hence a good amount of bandwidth needs to be allocated in order to receive the response. For the same reason, wait times need to be taken care of too else you might end up with incomplete responses.

3. .NET Architectures- This is a more complex scenario related to maintaining the View State. Most of the postbacks come with an event and its validation. The bot needs to track the view state and pass validations for the event to occur so that the code can be executed and results captured. This is achieved by adopting a mechanism to restore states if things break midway.

4. Page Encoding- Request and response headers need to be taken care of on AJAX pages. The request needs to be sent in the exact format as expected by the server (Content-type or media type, accept fields, etc.) and similarly responses need to be parsed based on the content-type.

A Use Case

One of our clients who is into sale of event tickets at discounted rates had us crawl one of the ticketing sites on the web weekly; one of the most complex AJAX crawling we’ve dealt with so far. For the data that was to be extracted, multiple AJAX fetches were needed depending on the selections made. Requests had to be made for a combination of items from the dropdown box. These came with cookies and session IDs. To add to the challenge the site was extremely dynamic and changed its structure every week making it difficult for us to follow what data was where on the page.

We developed an AJAX crawler specific to this site to take care of all the dynamics. Response times were taken care of so that we didn’t miss any relevant information. We included an ML component to improve the crawler which is now pretty stable irrespective of changes on the site.

Overall, AJAX crawling requires more compute power in addition to the tech expertise. And because there’s no uniformity on the web, there’s always a new challenge to overcome in this landscape. It wouldn’t be an overrating if we said we’ve done a good job at that so far and have developed the knack :)

Reach out to us for any kind of web scraping/ crawling- either AJAX or not. We’ll take care of the complexities.

Source: https://www.promptcloud.com/blog/web-scraping-interactive-ajax-crawls/

Wednesday, 12 November 2014

Web scraping services-importance of scraped data

Web scraping services are provided by computer software which extracts the required facts from the website. Web scraping services mainly aims at converting unstructured data collected from the websites into structured data which can be stockpiled and scrutinized in a centralized databank. Therefore, web scraping services have a direct influence on the outcome of the reason as to why the data collected in necessary.

It is not very easy to scrap data from different websites due to the terms of service in place. So, the there are some legalities that have been improvised to protect altering the personal information on different websites. These ‘rules’ must be followed to the letter and to some extent have limited web scraping services.

Owing to the high demand for web scraping, various firms have been set up to provide the efficient and reliable guidelines on web scraping services so that the information acquired is correct and conforms to the security requirements. The firms have also improvised different software that makes web scraping services much easier.

Importance of web scraping services

Definitely, web scraping services have gone a long way in provision of very useful information to various organizations. But business companies are the ones that benefit more from web scraping services. Some of the benefits associated with web scraping services are:

    Helps the firms to easily send notifications to their customers including price changes, promotions, introduction of a new product into the market. Etc.
    It enables firms to compare their product prices with those of their competitors
    It helps the meteorologists to monitor weather changes thus being able to focus weather conditions more efficiently
    It also assists researchers with extensive information about peoples’ habits among many others.
    It has also promoted e-commerce and e-banking services where the rates of stock exchange, banks’ interest rates, etc. are updated automatically on the customer’s catalog.

Advantages of web scraping services

The following are some of the advantages of using web scraping services

    Automation of the data

    Web scraping can retrieve both static and dynamic web pages

    Page contents of various websites can be transformed

    It allows formulation of vertical aggregation platforms thus even complicated data can still be extracted from different websites.

    Web scraping programs recognize semantic annotation

    All the required data can be retrieved from their websites

    The data collected is accurate and reliable

Web scraping services mainly aims at collecting, storing and analyzing data. The data analysis is facilitated by various web scrapers that can extract any information and transform it into useful and easy forms to interpret.

Challenges facing web scraping

    High volume of web scraping can cause regulatory damage to the pages

    Scale of measure; the scales of the web scraper can differ with the units of measure of the source file thus making it somewhat hard for the interpretation of the data

    Level of source complexity; if the information being extracted is very complicated, web scraping will also be paralyzed.

It is clear that besides web scraping providing useful data and information, it experiences a number of challenges. The good thing is that the web scraping services providers are always improvising techniques to ensure that the information gathered is accurate, timely, reliable and treated with the highest levels of confidentiality.

Source: http://www.loginworks.com/blogs/web-scraping-blogs/191-web-scraping-services-importance-of-scraped-data/

Monday, 10 November 2014

How to scrape Amazon with WebDriver in Java

Here is a real-world example of using Selenium WebDriver for scraping.
This short program is written in Java and scrapes book title and author from the Amazon webstore.
This code scrapes only one page, but you can easily make it scraping all the pages by adding a couple of lines.

You can download the souce here.

import java.io.*;
import java.util.*;
import java.util.regex.*;

import org.openqa.selenium.*;
import org.openqa.selenium.firefox.FirefoxDriver;


public class FetchAllBooks {

    public static void main(String[] args) throws IOException {

        WebDriver driver = new FirefoxDriver();
      

driver.navigate().to("http://www.amazon.com/tag/center%20right?ref_=tag_dpp_cust_itdp_s_t&sto

re=1");

        List<WebElement> allAuthors =  driver.findElements(By.className("tgProductAuthor"));
        List<WebElement> allTitles =  driver.findElements(By.className("tgProductTitleText"));
        int i=0;
        String fileText = "";

        for (WebElement author : allAuthors){
            String authorName = author.getText();
            String Url = (String)((JavascriptExecutor)driver).executeScript("return

arguments[0].innerHTML;", allTitles.get(i++));
            final Pattern pattern = Pattern.compile("title=(.+?)>");
            final Matcher matcher = pattern.matcher(Url);
            matcher.find();
            String title = matcher.group(1);
            fileText = fileText+authorName+","+title+"\n";
        }

        Writer writer = new BufferedWriter(new OutputStreamWriter(new

FileOutputStream("books.csv"), "utf-8"));
        writer.write(fileText);
        writer.close();

        driver.close();
    }
}

Source: http://scraping.pro/scraping-amazon-webdriver-java/

Saturday, 8 November 2014

Web Scraping Enters Politics

Web scraping is becoming an essential tool in gaining an edge over everything about just anything. This is proven by international news on US political campaigns, specifically by identifying wealthy donors. As is commonly known, election campaigns should follow a rule regarding the use of a certain limited amount of money for the expenses of each candidate. Being so, much of the campaign activities must be paid by supporters and sponsors.

It is not a surprise then that even politics is lured to make use of the dynamic and ever growing data mining processes. Once again, web mining has proven to be an essential component of almost all levels of human existence, the society, and the world as a whole. It proves its extraordinary capacity to dig precious information to reach the much aspired for goals of every individual.

Mining for personal information
The CBC News online very recently disclosed that the US Republican presidential candidate Mitt Romney has used data mining in order to identify rich donors. It is reported that the act of getting personal information such as the buying history and church attendance were vital in this incident. Through this information, the party was able to identify prospective rich donors and indeed tap them. As a businessman himself, Romney knows exactly how to fish and where the fat fish are. Moreover, what is unique about the identified donors is that they have never been donating before.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/web-scraping-enters-politics/

Wednesday, 5 November 2014

Web Scraping: The Invaluable Decision Making Tool

Business decisions are mandatory in any company. They reflect and directly influence about the future of the company. It is important to realize that decisions must be made in any business situation. The generation of new ideas calls for new actions. This in turn calls for decisions. Decisions can only be made when there is adequate information or data regarding the problem and the cause of action to be taken. Web scraping offers the best opportunity in getting the required information that will enable the management make a wise and sound decision.

Therefore web scraping is an important part in generation of the practical interpretations for the business decision making process. Since businesses take many courses of actions the following areas call for adequate web scraping in order to make outstanding decisions.

1. Suppliers. Whether you are running an offline business there is need to get information regarding your suppliers. In this case there are two situations. The first situation is about your current suppliers and the second situation is about the possibility of acquiring new suppliers. By web scraping you has the opportunity to gather about your suppliers. You need to know other business they are supplying to and the kind of discounts and prices they offer to them. Another important aspect about consumers is to determine the periods when they have surplus and therefore be able to determine the purchasing prices.

Web scraping can provide new information concerning new suppliers. This will make a cutting edge in the purchasing sector. You can get new suppliers that have reasonable prices. This will go a long way in ensuring a profitable business. Therefore web scraping is an integral process that should be taken first before making a vital decision concerning suppliers.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/web-scraping-invaluable-decision-making-tool/