So you thought you might like to parse HTML

A friend of mine wanted to get a price of a specific product from a price-comparison site in a programmatic way, so he asked me how he could that.
Now, this shopping-comparer service doesn’t have an API, so in order to get the price from a page, he needs to write a script that is able to do the following:
1. Request, and get the page for the specific product.
2. Parse the received HTML page, and understand it.

Seams to be a doable task, but the problem is that the only programming he ever did was a bit of C++. So I compiled for him a checklist of things he has to learn in order to be able to implement such a script. Of course many other alternatives exist (Python, Java, PHP etc.), so the following list simply reflects my .Net and C# background.
So here is the checklist:

1. Learn the basics of C# (the programming language in which to implement the script)
http://www.ssw.uni-linz.ac.at/Teaching/Lectures/CSharp/Tutorial/(2 short pdf files)
http://www.functionx.com/csharp/index.htm(full tutorial)

2. Download and install Visual C# 2008 Express Edition for free (the programming environment in which to wok on the implementation)
http://www.microsoft.com/express/download/#webInstall

3. Download and learn HtmlAgilityPack – library that lets you easily parse HTML pages into nice C# objects and collections (.Net libraries don’t have this feature built-in)
http://www.codeplex.com/htmlagilitypack/Release/ProjectReleases.aspx?ReleaseId=272

4. Use “DOM Inspector” tool, which comes built-in in Firefox. It parses HTML pages, and shows the structure of page as a tree of tags. It will be very useful in the implementation of the script.
Just download Firefox, it comes with it.

Now, all that is left for my friend is to find lots of free time, and a big pot of coffee.

Update:
5. Forget about the DOM Inspector. What you really need is FireBug. It’s great Firefox extension that lets you view and edit html, css, javascript right inside Firefox. You can point to element on your current page (link, image, text) and it shows you its location inside html source code. Its magic really.
It has got many features, haven’t got time to check them all out yet.
Firebug for Firefox 2.0

4 Replies

  • Python’s philosophy is “batteries included”. It doesn’t support HTML parsing out of the box?

  • Python’s philosophy is “batteries included”. It doesn’t support HTML parsing out of the box?

  • Well, obviously not. You can’t include all the batteries… you don’t want to carry around 100 kg of batteries ;)Useful stuff that is included: HTTP client library, URL encoding, character set codecs, 3 different APIs for XML handling.Sure, C# has all this stuff too, but Python is way more fun!

  • Well, obviously not. You can’t include all the batteries… you don’t want to carry around 100 kg of batteries 😉

    Useful stuff that is included: HTTP client library, URL encoding, character set codecs, 3 different APIs for XML handling.

    Sure, C# has all this stuff too, but Python is way more fun!

Leave a Reply

Your email address will not be published. Required fields are marked *