Get to Know a Gem: Hpricot

In this new series, Get to Know a Gem, we’re going to take a look at hpricot.

What is Hpricot?

WhyTheLuckyStiff released Hpricot in July of 2006 in an effort to bring fast HTML parsing to the masses. It’s currently unknown what prompted it, but my guess would be that Why is secretly scraping all the pages on the internet that archive the future. To speed it up, Why has written the Hpricot scanner in C, to be much faster than the other options available in Ruby.

Installation

This process… is as always with most gems, very simple.

```ruby
$ sudo gem install hpricot
Password:
Need to update 23 gems from http://gems.rubyforge.org
.......................
complete
Select which gem to install for your platform (powerpc-darwin8.7.0)
 1. hpricot 0.5 (ruby)
 2. hpricot 0.5 (mswin32)
 3. hpricot 0.4 (mswin32)
 4. hpricot 0.4 (ruby)
 5. Cancel installation
> 1

Great, let's now play with it!

## Usage

In this first example, we're going to use Hpricot to parse a web page
through the Open-URI library. For this, we'll need to require a few
libs.
````ruby
```ruby
require 'rubygems'
require 'hpricot'
require 'open-uri'

Now that we have the libraries loaded, we can create a new Hpricot
object and in this example, we'll load the [PLANET ARGON About
page](http://www.planetargon.com/about.html).
```shell
```ruby
# Open the PLANET ARGON about page
page = Hpricot( open( 'http://www.planetargon.com/about.html' ) )

Great, let's have some parsing fun. Let's parse for the first instance
of a `div` with a class name of `team`. Hpricot will return array of
elements that meet your search request.
```javascript
```javascript
page.search( "//div[@class='team']" ).size 
=> 7

Great, this is a good sign that I need to add several people to the
website. :-)

If we want to peak at the first instance of this class, we can do:
```html
```html
page.search( "//div[@class='team']" ).first
=> {elem <div class="team"> "
" {elem <div class="team_name"> {elem <strong> "Robby Russell" </strong>} ", Founder &#38; Executive Director" </div>}   ....SNIP

You'll notice that there is a \<strong&gt; element within the results,
which we can search deeper into this tree.
```javascript
```ruby
page.search( "//div[@class='team']" ).first.search( "//strong" )
=> #<Hpricot::Elements[{elem <strong> "Robby Russell" </strong>}]>

Hpricot provides a method named `inner_html`, which will return the
contents within the element.
```javascript
```javascript
page.search( "//div[@class='team']" ).first.search( "//strong" ).inner_html
=> "Robby Russell"

Let's now iterate through each of the elements and output all of the
team member names.
````ruby
```ruby
# search for each team member div and iterate through them
page.search( "//div[@class='team']" ).each do |team|
  puts team.search( "//strong").inner_html
end

Robby Russell
Allison Beckwith
Brian Ford
Nicole Fritz
Alain Bloch
Audrey Eschright
Gary Blessington

```

So, there you have it. A quick and basic introduction into using Hpricot for parsing HTML content. You can use Hpricot for a wide variety of structured data, such as XML and CSS. For more examples, please visit the HpricotBasics page.

Final Thoughts

I’m going to guess that Why built this for hoodwink.d, which I’ve been a regular user of for a long time. I haven’t spent much time playing with the XPath syntax and playing around with Hpricot has given me a much better understanding of it.

As mentioned at the beginning of this post, I am going to make Getting to Know a Gem a regular feature on my blog. If you know of a lesser known Gem that needs some attention, please send a suggestion to me.

Until next time…

Get to Know a Gem: Hpricot

What is Hpricot?

Installation

Final Thoughts

Tags

Hi, I'm Robby.

Get to Know a Gem: Hpricot

What is Hpricot?

Installation

Final Thoughts

Tags

Hi, I'm Robby.

Related Posts

Every Week You Wait, Your Rails App Accumulates More Debt