html parsing with hpricot
DESCRIPTION
We can use Hpricot to virtually parse any website. Some cool techniques were shown in this slide to parse a site by Tags, Element IDs, XPath.TRANSCRIPT
What’s Special?
Ruby !!!
So … Let’s See ! • Dynamic
• EasytoLearn• Easytomaintainandgrow
• ConvenientShort‐CutsEx:Str=“LinuxCrea=veGroup”
Str_join=Str.split(““).join(“+”)
• Transparent,codefaster• FewSyntaxErrors,FewerBugs• It’sFun
Ruby Gems
• PackageManagementSystemforRubyApplica=onsandLibraries
• ResolveDependencies.• ProvidesCentralRepositoryofSoUware.• OneCommandRules: ‐geminstall<gem_name>
• CanHaveyourOwnLocalGemServer ‐geminstall<gem_name>‐‐source<gem_server_ip_and_port>
Hpricot makes it easy to Parse
Hpricot
• Pullinforma=onfromvirtuallyanywebsite.
• SearchbyElementID,Tags,CSSSelectors.• ParseHTMLincludingbrokenHTML• UpdateHTML
• Usethisdataanywhereandanywayyouwant!• ParsebyXPathfordirectlyparsinganelement.• Let’ssee….Howitworks.
Let’s Parse A Badly Designed Site !!
• h^p://www.worldweather.org• It’sasitethatprovidesweatherinforma=onfordifferentloca=onsacrosstheglobe.
• Inthemainpagetheyhaveabadlynestedtablestructure!!
• AnidealWeb‐DevelopercouldhaveputthemnicelyindivswithmeaningfulIDs.
• Butlet’sfacethetruthandparsetheCountryNamesandtheirURLs.
Easy Steps – 1. Open The Site
Easy Steps – 2. Inspect With Firebug
Easy Steps – 3. Copy X-Path of the Element
Easy Steps – 4. Parse By X-Path Using Hpricot
Use some Logic & You’ll Get
Just Try it Out
Questions?
References
• RubyProgrammingLanguage:h^p://www.ruby‐lang.org/en/
• Hpricot:h^p://code.whytheluckys=ff.net/hpricot/
• X‐Path:h^p://en.wikipedia.org/wiki/XPath• Firebug:h^p://gecirebug.com/
Thanks