Alexander The Great

November 17, 2008

Is Scraping Data Wrong?

Filed under: Ethics,Modern Life — alexanderthegreatest @ 9:23 pm
Tags: , , , , ,

In 1492, the Old and New Worlds clashed in a very dramatic way. Volumes have been written about what followed, so I won’t cover this here. A similar, if less vital, clash is happening today. You’ve heard about California’s “water wars,” but you may not have heard of the scraper wars.

For better or worse (and I’d say it’s for the better), the world wide web is an open technology. This is partly true in the sense that Apache’s code can be downloaded and modified, but the real sense this is true, is the one in which a client makes a request to the server, and the server sends down a response. It’s true that login prompts, captcha forms, and payment walls abound, but this is far from the default. It’s difficult (or at least cumbersome) to program a web server to refuse to serve a client. Prejudice is not the nature of the web.

People with something to say, get on their soap box and say it. This becomes available to the world by default. It’s assumed that this will be consumed by a target market (some restricted demographic of humans using web browsers), but all computers have the ability to make HTTP calls, and do as they will with the replies they find. And therein lies the problem.

You may have found this site through Google’s web search – if not, surely you’ve found another this way. Google uses a “robot”, a wanderer, a spider that crawls the web, copying down what it finds, indexing it for search. Many people object that this violates copyright law, although both the searcher and the found benefit from the arrangement, so nobody complains (apart from book authors) about Google. When others do the same thing, though, people assume nefarious trickery. At best, it might be spam, but, this typically seems to mean plagiarism where a person steal’s someone else’s informational work to profit from cheap advertisements.

Mashups, though, are the definition of “web 2.0″. Ouseful discovered how to use Google Spreadsheets to translate “foreign” HTML to RSS, and then Yahoo for geocoding. Arachnode is an open source (for SQL Server and C#!) home scraping platform. Whether people like it or not, the ability to make use of this great federated data store we call the web is being brought down to the lowly masses. Democracy over data is the future, and it’s in everybody’s best interest to learn to deal with it. (Like the Census.)

About these ads

2 Comments »

  1. A new version of arachnode.net is due very soon and has been updated since your posted about arachnode.

    Comment by arachnode.net — February 25, 2009 @ 9:52 pm | Reply

  2. Here is a post regarding techniques for ‘Scraping your way to RSS feeds’ albeit in a non-programmatic (layman) way:

    http://technosiastic.wordpress.com/2009/04/08/scraping-your-way-to-rss-feeds/

    Comment by Shahriar Hyder — April 16, 2009 @ 6:35 am | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: