Alexander The Great

December 8, 2008

Html Agility Pack

Filed under: Programming — alexanderthegreatest @ 10:42 pm
Tags: , , , ,

Has anyone else used this? I’m teetering between it’s given name and Html Agitation Pack. Written by an MSDN guru and moved to CodePlex, this is the XmlDocument of the web.

An aside, people who use open source languages like Pearl and Ruby will be shocked to know how difficult it is for Microsoft developers to use html programatically. We’re able to consume xml very quickly, so long as it’s well formed, but any error in the markup renders the whole document unreadable. Microsoft’s design goal was to never guess at the developer’s intent, so, anything the least bit ambiguous is an exception. Agility Pack is an open source library for parsing html and making the guesses MS was unwilling to make, outside IE.

I’m finding it slow. The software has trouble with certain encodings, and, worse, it throws stack overflow exceptions! This means it makes far too heavy use of recursion. Genereally a loop (sometimes with a stack or a queue) will fix the problem, but it’s very heard to search for, in such a large code base.

Still, this hasn’t stopped others from finding creative uses for the library. The page localizer is a fascinating example. And here’s a converter, allowing LINQ over web pages!

Advertisements

2 Comments »

  1. Hi,

    Usually, encoding problems come from a misuse of the library, which is poorly documented :). I would be interested to know about the StackOverflow exceptions though…

    Simon.

    Comment by Simon Mourier — December 8, 2008 @ 11:28 pm | Reply

  2. Hi, Simon

    I’ll have to do some more digging to try and reproduce specific instances of what I was complaining about. I believe all the StackOverflow exceptions are from InnerText (which is being called on the document’s root node). This is intermittent, and only happens on very large pages – obviously. If I can get you a specific URL that can reliably cause this exception, I’ll be in touch. 😀

    In all fairness, I should thank you for producing (and releasing!) such an in depth product. It’s unfair to expect perfection and 100 % coverage when many pages on the web are horribly mangled. And I’d be unable to do any automated web programming without your generous help.

    Comment by alexanderthegreatest — December 9, 2008 @ 9:47 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: