html parsing with hpricot

15
Linux Creative Group Hpricot Dig The Impossible With Ruby By: Subhransu Behera [email protected]

Upload: subhransu-behera

Post on 13-May-2015

4.063 views

Category:

Technology


0 download

DESCRIPTION

We can use Hpricot to virtually parse any website. Some cool techniques were shown in this slide to parse a site by Tags, Element IDs, XPath.

TRANSCRIPT

Page 1: HTML Parsing With Hpricot

Linux Creative Group

Hpricot – Dig The Impossible With Ruby

By: Subhransu Behera [email protected]

Page 2: HTML Parsing With Hpricot

What’s Special?

Ruby !!!

Page 3: HTML Parsing With Hpricot

So … Let’s See ! •  Dynamic

•  EasytoLearn•  Easytomaintainandgrow

•  ConvenientShort‐CutsEx:Str=“LinuxCrea=veGroup”

Str_join=Str.split(““).join(“+”)

•  Transparent,codefaster•  FewSyntaxErrors,FewerBugs•  It’sFun

Page 4: HTML Parsing With Hpricot

Ruby Gems

•  PackageManagementSystemforRubyApplica=onsandLibraries

•  ResolveDependencies.•  ProvidesCentralRepositoryofSoUware.•  OneCommandRules: ‐geminstall<gem_name>

•  CanHaveyourOwnLocalGemServer ‐geminstall<gem_name>‐‐source<gem_server_ip_and_port>

Page 5: HTML Parsing With Hpricot

Hpricot makes it easy to Parse

Page 6: HTML Parsing With Hpricot

Hpricot

•  Pullinforma=onfromvirtuallyanywebsite.

•  SearchbyElementID,Tags,CSSSelectors.•  ParseHTMLincludingbrokenHTML•  UpdateHTML

•  Usethisdataanywhereandanywayyouwant!•  ParsebyXPathfordirectlyparsinganelement.•  Let’ssee….Howitworks.

Page 7: HTML Parsing With Hpricot

Let’s Parse A Badly Designed Site !!

•  h^p://www.worldweather.org•  It’sasitethatprovidesweatherinforma=onfordifferentloca=onsacrosstheglobe.

•  Inthemainpagetheyhaveabadlynestedtablestructure!!

•  AnidealWeb‐DevelopercouldhaveputthemnicelyindivswithmeaningfulIDs.

•  Butlet’sfacethetruthandparsetheCountryNamesandtheirURLs.

Page 8: HTML Parsing With Hpricot

Easy Steps – 1. Open The Site

Page 9: HTML Parsing With Hpricot

Easy Steps – 2. Inspect With Firebug

Page 10: HTML Parsing With Hpricot

Easy Steps – 3. Copy X-Path of the Element

Page 11: HTML Parsing With Hpricot

Easy Steps – 4. Parse By X-Path Using Hpricot

Page 12: HTML Parsing With Hpricot

Use some Logic & You’ll Get

Page 13: HTML Parsing With Hpricot

Just Try it Out

Questions?

Page 14: HTML Parsing With Hpricot

References

•  RubyProgrammingLanguage:h^p://www.ruby‐lang.org/en/

•  Hpricot:h^p://code.whytheluckys=ff.net/hpricot/

•  X‐Path:h^p://en.wikipedia.org/wiki/XPath•  Firebug:h^p://gecirebug.com/

Page 15: HTML Parsing With Hpricot

Thanks