Alexander The Great

November 17, 2008

Is Scraping Data Wrong?

Filed under: Ethics,Modern Life — alexanderthegreatest @ 9:23 pm
Tags: , , , , ,

In 1492, the Old and New Worlds clashed in a very dramatic way. Volumes have been written about what followed, so I won’t cover this here. A similar, if less vital, clash is happening today. You’ve heard about California’s “water wars,” but you may not have heard of the scraper wars.

For better or worse (and I’d say it’s for the better), the world wide web is an open technology. This is partly true in the sense that Apache’s code can be downloaded and modified, but the real sense this is true, is the one in which a client makes a request to the server, and the server sends down a response. It’s true that login prompts, captcha forms, and payment walls abound, but this is far from the default. It’s difficult (or at least cumbersome) to program a web server to refuse to serve a client. Prejudice is not the nature of the web.

People with something to say, get on their soap box and say it. This becomes available to the world by default. It’s assumed that this will be consumed by a target market (some restricted demographic of humans using web browsers), but all computers have the ability to make HTTP calls, and do as they will with the replies they find. And therein lies the problem.

You may have found this site through Google’s web search – if not, surely you’ve found another this way. Google uses a “robot”, a wanderer, a spider that crawls the web, copying down what it finds, indexing it for search. Many people object that this violates copyright law, although both the searcher and the found benefit from the arrangement, so nobody complains (apart from book authors) about Google. When others do the same thing, though, people assume nefarious trickery. At best, it might be spam, but, this typically seems to mean plagiarism where a person steal’s someone else’s informational work to profit from cheap advertisements.

Mashups, though, are the definition of “web 2.0″. Ouseful discovered how to use Google Spreadsheets to translate “foreign” HTML to RSS, and then Yahoo for geocoding. Arachnode is an open source (for SQL Server and C#!) home scraping platform. Whether people like it or not, the ability to make use of this great federated data store we call the web is being brought down to the lowly masses. Democracy over data is the future, and it’s in everybody’s best interest to learn to deal with it. (Like the Census.)

About these ads

13 Comments »

  1. A new version of arachnode.net is due very soon and has been updated since your posted about arachnode.

    Comment by arachnode.net — February 25, 2009 @ 9:52 pm | Reply

  2. Here is a post regarding techniques for ‘Scraping your way to RSS feeds’ albeit in a non-programmatic (layman) way:

    http://technosiastic.wordpress.com/2009/04/08/scraping-your-way-to-rss-feeds/

    Comment by Shahriar Hyder — April 16, 2009 @ 6:35 am | Reply

  3. Way cool! Some extremely valid points! I appreciate you penning this post plus the rest of the website is also really good.

    Comment by Olivia — December 8, 2013 @ 2:00 pm | Reply

  4. continuously i used to read smaller articles or reviews which as well clear their
    motive, and that is also happening with this paragraph which I am reading
    at this place.

    Comment by massage therapist Pasadena tx — December 15, 2013 @ 9:14 am | Reply

  5. Valuable information. Fortunate me I discovered your website accidentally, and
    I’m surprised why this twist of fate didn’t happened earlier!
    I bookmarked it.

    Comment by make autopilot money — December 19, 2013 @ 12:02 am | Reply

  6. Attractive element of content. I simply stumbled upon your site and in
    accession capital to say that I acquire actually enjoyed account your weblog
    posts. Anyway I’ll be subscribing on your feeds or even I achievement you get right
    of entry to persistently fast.

    Comment by amazon.com — January 17, 2014 @ 2:30 pm | Reply

  7. Hi Dear, are you in fact visiting this website regularly, if so afterward you will absolutely take nice experience.

    Comment by Pam — January 18, 2014 @ 3:57 am | Reply

  8. Remarkable! Its really awesome article, I have got much clear idea on the topic of from this post.

    Comment by hack — February 11, 2014 @ 1:42 am | Reply

  9. Just want to say your article is as amazing.
    The clarity in your post is just cool and i could assume you’re
    an expert on this subject. Well with your permission let me to grab your RSS feed to keep updated with forthcoming post.
    Thanks a million and please keep up the enjoyable work.

    Comment by Travel — March 1, 2014 @ 2:02 am | Reply

  10. Thank you a bunch for sharing this with all people you actually recognise
    what you are talking about! Bookmarked. Please additionally discuss with my site =).
    We may have a hyperlink exchange arrangement between us

    Comment by baby quotes — March 2, 2014 @ 1:29 am | Reply

  11. What’s Happening i am new to this, I stumbled
    upon this I have discovered It positively helpful and it has helped me out
    loads. I am hoping to contribute & help different customers like
    its helped me. Great job.

    Comment by https://play.google.com/store/apps/details?id=com.respectapps.motherdaughterquotes — March 7, 2014 @ 10:21 pm | Reply

  12. Hurrah! In the end I got a blog from where I can genuinely take useful information concerning my study and knowledge.

    Comment by www.youtube.com — September 10, 2014 @ 6:07 pm | Reply

  13. 28 oz Delmonte Chunky Diced Zesty Chili Style tomatoes or
    equivalent. Power outlets make sure that you are getting problem free experience even with your old laptops because
    they may turn dark without power outlets. It was
    well known that oats were grown, stored, milled and packaged with wheat products are were typically
    very contaminated with gluten.

    Comment by Gluten Free Society continue reading this.. — September 20, 2014 @ 11:41 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: