Read my latest article: Launching Rails projects, an open call for lessons learned (posted Tue, 23 Jun 2009 17:33:00 GMT)

Get to Know a Gem: Hpricot

Posted by Robby Russell Tue, 13 Feb 2007 14:48:00 GMT

12 comments Latest by Rahul Fri, 20 Mar 2009 19:17:19 GMT

In this new series, Get to Know a Gem, we’re going to take a look at hpricot.

What is Hpricot?

WhyTheLuckyStiff released Hpricot in July of 2006 in an effort to bring fast HTML parsing to the masses. It’s currently unknown what prompted it, but my guess would be that Why is secretly scraping all the pages on the internet that archive the future. To speed it up, Why has written the Hpricot scanner in C, to be much faster than the other options available in Ruby.

Installation

This process… is as always with most gems, very simple.


$ sudo gem install hpricot
Password:
Need to update 23 gems from http://gems.rubyforge.org
.......................
complete
Select which gem to install for your platform (powerpc-darwin8.7.0)
 1. hpricot 0.5 (ruby)
 2. hpricot 0.5 (mswin32)
 3. hpricot 0.4 (mswin32)
 4. hpricot 0.4 (ruby)
 5. Cancel installation
> 1

Great, let’s now play with it!

Usage

In this first example, we’re going to use Hpricot to parse a web page through the Open-URI library. For this, we’ll need to require a few libs.


require 'rubygems'
require 'hpricot'
require 'open-uri'
Now that we have the libraries loaded, we can create a new Hpricot object and in this example, we’ll load the PLANET ARGON About page.

# Open the PLANET ARGON about page
page = Hpricot( open( 'http://www.planetargon.com/about.html' ) )    

Great, let’s have some parsing fun. Let’s parse for the first instance of a div with a class name of team. Hpricot will return array of elements that meet your search request.


page.search( "//div[@class='team']" ).size 
=> 7    

Great, this is a good sign that I need to add several people to the website. :-)

If we want to peak at the first instance of this class, we can do:


page.search( "//div[@class='team']" ).first
=> {elem <div class="team"> "\n" {elem <div class="team_name"> {elem <strong> "Robby Russell" </strong>} ", Founder &#38; Executive Director" </div>}    ....SNIP

You’ll notice that there is a <strong> element within the results, which we can search deeper into this tree.


page.search( "//div[@class='team']" ).first.search( "//strong" )
=> #<Hpricot::Elements[{elem <strong> "Robby Russell" </strong>}]>

Hpricot provides a method named inner_html, which will return the contents within the element.


page.search( "//div[@class='team']" ).first.search( "//strong" ).inner_html
=> "Robby Russell" 

Let’s now iterate through each of the elements and output all of the team member names.


# search for each team member div and iterate through them
page.search( "//div[@class='team']" ).each do |team|
  puts team.search( "//strong").inner_html
end    

Robby Russell
Allison Beckwith
Brian Ford
Nicole Fritz
Alain Bloch
Audrey Eschright
Gary Blessington

So, there you have it. A quick and basic introduction into using Hpricot for parsing HTML content. You can use Hpricot for a wide variety of structured data, such as XML and CSS. For more examples, please visit the HpricotBasics page.

Final Thoughts

I’m going to guess that Why built this for hoodwink.d, which I’ve been a regular user of for a long time. I haven’t spent much time playing with the XPath syntax and playing around with Hpricot has given me a much better understanding of it.

As mentioned at the beginning of this post, I am going to make Getting to Know a Gem a regular feature on my blog. If you know of a lesser known Gem that needs some attention, please send a suggestion to me.

Until next time…

Subscribe to my RSS feed Enjoying the content? Be sure to subscribe to my RSS feed.
Comments

Leave a response

  1. Avatar
    Raymond Brigleb Tue, 13 Feb 2007 16:06:52 GMT

    Hey, nice write-up, Robby. Thanks!

    Is this by any chance a tribute to “Better Know a District?”

  2. Avatar
    Robby Russell Tue, 13 Feb 2007 17:10:37 GMT Recommend me on Working with Rails

    Raymond,

    Is this by any chance a tribute to “Better Know a District?”

    Of course! :-)

  3. Avatar
    Josh Tue, 13 Feb 2007 18:34:19 GMT

    Sweet, I didn’t know about hpricot. Thanks for providing a nice introduction!

  4. Avatar
    John Douthat Tue, 13 Feb 2007 19:25:26 GMT

    you can even shorten up the last example with a terser query:

    page.search( ”//div[@class=’team’]//strong//text()” ).each { |name| puts name }

  5. Avatar
    John Douthat Tue, 13 Feb 2007 19:38:22 GMT

    actually, if we’re going to terseness, this one works too: page.search(“div.team strong//text()”).each { |name| puts name }

    _why’s infinite ingenuity allows one to arbitrarily switch between xpath(ish) and css(ish) selector syntax.

    I also recommend looking at WWW::Mechanize (http://rubyforge.org/projects/mechanize/) if you’re building a screen scraper. It handles things like redirects and cookies, and integrates Hpricot to make scraping a breeze.

  6. Avatar
    Steven A Bristol Thu, 15 Feb 2007 00:47:45 GMT

    If you like hpricot, check out the really cool assert_elements plugin (written by the excellent Yehuda Katz) which uses hpricot to aid in testing!

  7. Avatar
    pedro mg Thu, 15 Feb 2007 03:15:10 GMT

    Oh man, hope spammers don’t put hands on material like this. HTML fast parsing ? Hide those emails from html source, run :)

    Cool gem, tx Robby

  8. Avatar
    Raghu Thu, 14 Jun 2007 05:15:27 GMT

    Can we login, I mean submit a for ?

  9. Avatar
    Raghu Thu, 14 Jun 2007 05:15:47 GMT

    Can we login, I mean submit a form ?

  10. Avatar
    Ginaefu Tue, 14 Aug 2007 00:43:25 GMT

    http://adbrgdafggefg.host.com desk3 [url=http://adbsgdafggefg.host.com]desk4[/url] [link=http://adbagdafggefg.host.com]desk6[/link]

  11. Avatar
    Randykgq Tue, 14 Aug 2007 19:16:56 GMT

    http://adbrsrafsabsw.host.com desk3 [url=http://adbssrafsabsw.host.com]desk4[/url] [link=http://adbasrafsabsw.host.com]desk6[/link]

  12. Avatar
    Rahul Fri, 20 Mar 2009 19:17:19 GMT

    The above example does not work for me, when I try to execute it on command prompt, I do not get any output? What can be wrong??

    Here is my file:-

    require ‘rubygems’ require ‘hpricot’ require ‘open-uri’

    page = Hpricot( open( ‘http://www.planetargon.com/about.html’ ) )

    page.search( ”//div[@class=’team’]” ).each do |team| puts team.search( ”//strong”).inner_html end

Share your thoughts... (really...I want to hear them)

Comments