Get to Know a Gem: Hpricot
42 comments Latest by NFL Jerseys Tue, 31 Aug 2010 01:13:53 GMT
In this new series, Get to Know a Gem, we’re going to take a look at hpricot.
What is Hpricot?
WhyTheLuckyStiff released Hpricot in July of 2006 in an effort to bring fast HTML parsing to the masses. It’s currently unknown what prompted it, but my guess would be that Why is secretly scraping all the pages on the internet that archive the future. To speed it up, Why has written the Hpricot scanner in C, to be much faster than the other options available in Ruby.
Installation
This process… is as always with most gems, very simple.
$ sudo gem install hpricot
Password:
Need to update 23 gems from http://gems.rubyforge.org
.......................
complete
Select which gem to install for your platform (powerpc-darwin8.7.0)
1. hpricot 0.5 (ruby)
2. hpricot 0.5 (mswin32)
3. hpricot 0.4 (mswin32)
4. hpricot 0.4 (ruby)
5. Cancel installation
> 1
Great, let’s now play with it!
Usage
In this first example, we’re going to use Hpricot to parse a web page through the Open-URI library. For this, we’ll need to require a few libs.
require 'rubygems'
require 'hpricot'
require 'open-uri'
Now that we have the libraries loaded, we can create a new Hpricot object and in this example, we’ll load the PLANET ARGON About page.
# Open the PLANET ARGON about page
page = Hpricot( open( 'http://www.planetargon.com/about.html' ) )
Great, let’s have some parsing fun. Let’s parse for the first instance of a div with a class name of team. Hpricot will return array of elements that meet your search request.
page.search( "//div[@class='team']" ).size
=> 7
Great, this is a good sign that I need to add several people to the website. :-)
If we want to peak at the first instance of this class, we can do:
page.search( "//div[@class='team']" ).first
=> {elem <div class="team"> "\n" {elem <div class="team_name"> {elem <strong> "Robby Russell" </strong>} ", Founder & Executive Director" </div>} ....SNIP
You’ll notice that there is a <strong> element within the results, which we can search deeper into this tree.
page.search( "//div[@class='team']" ).first.search( "//strong" )
=> #<Hpricot::Elements[{elem <strong> "Robby Russell" </strong>}]>
Hpricot provides a method named inner_html, which will return the contents within the element.
page.search( "//div[@class='team']" ).first.search( "//strong" ).inner_html
=> "Robby Russell"
Let’s now iterate through each of the elements and output all of the team member names.
# search for each team member div and iterate through them
page.search( "//div[@class='team']" ).each do |team|
puts team.search( "//strong").inner_html
end
Robby Russell
Allison Beckwith
Brian Ford
Nicole Fritz
Alain Bloch
Audrey Eschright
Gary Blessington
So, there you have it. A quick and basic introduction into using Hpricot for parsing HTML content. You can use Hpricot for a wide variety of structured data, such as XML and CSS. For more examples, please visit the HpricotBasics page.
Final Thoughts
I’m going to guess that Why built this for hoodwink.d, which I’ve been a regular user of for a long time. I haven’t spent much time playing with the XPath syntax and playing around with Hpricot has given me a much better understanding of it.
As mentioned at the beginning of this post, I am going to make Getting to Know a Gem a regular feature on my blog. If you know of a lesser known Gem that needs some attention, please send a suggestion to me.
Until next time…
Enjoying the content? Be sure to subscribe to my RSS feed.






Hey, nice write-up, Robby. Thanks!
Is this by any chance a tribute to “Better Know a District?”
Raymond,
Of course! :-)
Sweet, I didn’t know about hpricot. Thanks for providing a nice introduction!
you can even shorten up the last example with a terser query:
page.search( ”//div[@class=’team’]//strong//text()” ).each { |name| puts name }
actually, if we’re going to terseness, this one works too: page.search(“div.team strong//text()”).each { |name| puts name }
_why’s infinite ingenuity allows one to arbitrarily switch between xpath(ish) and css(ish) selector syntax.
I also recommend looking at WWW::Mechanize (http://rubyforge.org/projects/mechanize/) if you’re building a screen scraper. It handles things like redirects and cookies, and integrates Hpricot to make scraping a breeze.
If you like hpricot, check out the really cool assert_elements plugin (written by the excellent Yehuda Katz) which uses hpricot to aid in testing!
Oh man, hope spammers don’t put hands on material like this. HTML fast parsing ? Hide those emails from html source, run :)
Cool gem, tx Robby
Can we login, I mean submit a for ?
Can we login, I mean submit a form ?
http://adbrgdafggefg.host.com desk3 [url=http://adbsgdafggefg.host.com]desk4[/url] [link=http://adbagdafggefg.host.com]desk6[/link]
http://adbrsrafsabsw.host.com desk3 [url=http://adbssrafsabsw.host.com]desk4[/url] [link=http://adbasrafsabsw.host.com]desk6[/link]
The above example does not work for me, when I try to execute it on command prompt, I do not get any output? What can be wrong??
Here is my file:-
require ‘rubygems’ require ‘hpricot’ require ‘open-uri’
page = Hpricot( open( ‘http://www.planetargon.com/about.html’ ) )
page.search( ”//div[@class=’team’]” ).each do |team| puts team.search( ”//strong”).inner_html end
Hey robby I’m trying to link my rails app to my mysql db and when i type “rake db:migrate”, the terminal answers :
-
“dhcp-28-114:depot gg$ rake db:migrate (in /Users/gg/work/depot) /Users/gg/.gem/ruby/1.8/gems/hpricot-0.8.2/lib/fast_xs.bundle: [BUG] Segmentation fault ruby 1.8.7 (2009-06-12 patchlevel 174) [i686-darwin10.0.0]
Abort trap
-
i guess theres’s something wrong with hpricot but i don’t know how to fix it … do you knwo what i should do ? ps : i’m on snow leopard…
thx for you help
Guillaume
Thanks for sharing.
ed hardy clothing
wholesale ed hardy
ed hardy boots
Hey, nice write-up, Robby
Hey, nice write-up, Robby
o speed it up, Why has written the Hpricot scanner in C, to be much faster than the other options available in Ruby.
yutyuty
ty u
yt
ert re t
esr ew
i
Hey, nice write-up, Robby
The reason why ED clothing is called “the Godfather of the Modern Tattoo” is because of his ability to use art histories of Japanese. Before creating the ED Hardy Shoes , hardy shirt includes in almost all of his works display the creative genius. hardy shirt attended the San Francisco Art Institute.
lida daidaihua are mostly safe and able but several affected and abortive articles actualize a bad name for all daidaihua . These effective lida slimming do not have any harmful side-effects. You get great discounts on your box of slimming capsule , the entire details and instructions to take these effective slimming capsules is clearly mentioned on the site, lida that are easily available on the online stores.
iron on transfer Jewish Directory Bar Mitzvah Wholesale menu keep jersey city shenzhen direct shenzhen discount shenzhen buy shenzhen city affiliate promos bars cash disco tumblr city star jar twink pinoy agency oak bar stool bar accessories Flair Bartending Posi Pourers
Thanks for share.This site is very useful to me to improve my designing skill.
Tahnks
Thanks
China professional Bags Manufacturer and Bags Factory, Supply Laptop bag, Laptop backpack, Laptop sleeve Backpack, Day pack Duffle bag, Luggage duffle bag, Sports Gym bag Shoulder bag, Messager bag Solar bag, Solar backpack Cooler bag, ice cooler bag Shopping bag, Promotional bag
China USB Flash Drives Manufacturer, supply UDP Flash Drives, Plastic USB Flash Drives, Metal USB Flash Drives, Pen USB Flash Drives, Leather USB Flash Drives, Wood USB Flash Drives, Credit Card USB Drives
We are China Laptop Battery and Laptop Adapter Manufacturer, We have own both professional R&D and QA teams ensure our products with the high quality, excellent performance and safety at completive price. Also we strive to provide customers with satisfactory and extraordinary before and after sales service
Louis Vuitton Davis M56708 Louis Vuitton Davis Louis Vuitton Bass MM M56715 Louis Vuitton Bass Louis Vuitton Bass PM M56717 Louis Vuitton Bass Louis Vuitton Porte Documents Voyage PM M40225 Louis Vuitton Porte Documents Voyage Louis Vuitton Porte Documents Voyage GM M40224 Louis Vuitton Porte Documents Voyage Louis Vuitton Pochette Document M56721 Louis Vuitton Pochette Document Louis Vuitton Soft Briefcase M56719 Louis Vuitton Soft Briefcase
This may be a challenge because we still need to jordan on sale free shipping shine for other to see. http://www.jordanshoes11.com/sitemap.html
so good
s. It’s currently unknown what prompted it, but my guess would be that Why is secretly scraping all the pages on the internet that archive the future. To speed it
/i> gardens and open spaces. Founded jewelry tiffany by Bette Midler in 1995, prs200 it was fueled by her Tissot T-Touch dream to make New York replica juicy couture jewelry City a cleaner, greener place. Today, tiffany and co rings the project achieves that tiffany necklaces not only with programs such couples ring as Feed the Seed but chanel cc earrings with its MillionTreesNYC, an initiative replica tiffany jewelry with the city of New knock off jewelry York to plant a million concord mariner watch trees in all five boroughs rado coupole by 2017. Click here to return replica louis vuitton jewelry to the “Good Morning America” Hermes Jewelry website. DIY Danger: Linseed Oil That new Chanel handbags Can Self-Combust ‘GMA’ Experiment Shows Little-Known bvlgari b zero 1 Linseed Oil Fire Danger
/i> gardens and open spaces. Founded jewelry tiffany by Bette Midler in 1995, prs200 it was fueled by her Tissot T-Touch dream to make New York replica juicy couture jewelry City a cleaner, greener place. Today, tiffany and co rings the project achieves that tiffany necklaces not only with programs such couples ring as Feed the Seed but chanel cc earrings with its MillionTreesNYC, an initiative replica tiffany jewelry with the city of New knock off jewelry York to plant a million concord mariner watch trees in all five boroughs rado coupole by 2017. Click here to return replica louis vuitton jewelry to the “Good Morning America” Hermes Jewelry website. DIY Danger: Linseed Oil That new Chanel handbags Can Self-Combust ‘GMA’ Experiment Shows Little-Known bvlgari b zero 1 Linseed Oil Fire Danger
_why’s infinite ingenuity allows one to arbitrarily switch between xpath(ish) and css(ish) selector syntax.This may be a challenge because we still need to jordan on sale free shipping shine for other to see. coach bag
air jordan retro 19,air jordan retro 19
air jordan retro 22,air jordan retro 22
air jordan retro 2,air jordan retro 2
air jordan retro 3,air jordan retro 3
Nike Air Force 1 For Sale,Nike Air Force 1 For Sale
BBC Ice Creams For Sale,BBC Ice Creams For Sale
Bathing Apes For Sale,Bathing Apes For Sale
Women Air Yeezy For Sale,Women Air Yeezy For Sale
Air Yeezy For Sale,Air Yeezy For Sale
Air Jordan 1,Air Jordan 1
Retirees who want to supplement their Social Security and pension income can look to their savings. They can invest those savings to generate an income or they can annuitize all or part of them.
Adorable butterfly on the “giggle” card! Have a wonderful week! ♥♥ Kay