html parsing with hpricot

Linux Creative Group Hpricot – Dig The Impossible With Ruby By: Subhransu Behera [email protected]

Upload: subhransu-behera

Post on 13-May-2015

4.063 views

Category:

Technology

0 download

Report

Download

Tags:

Embed Size (px):

DESCRIPTION

We can use Hpricot to virtually parse any website. Some cool techniques were shown in this slide to parse a site by Tags, Element IDs, XPath.

TRANSCRIPT

Linux Creative Group

Hpricot – Dig The Impossible With Ruby

By: Subhransu Behera [email protected]

What’s Special?

Ruby !!!

So … Let’s See ! • Dynamic

• EasytoLearn• Easytomaintainandgrow

• ConvenientShort‐CutsEx:Str=“LinuxCrea=veGroup”

Str_join=Str.split(““).join(“+”)

• Transparent,codefaster• FewSyntaxErrors,FewerBugs• It’sFun

Ruby Gems

• PackageManagementSystemforRubyApplica=onsandLibraries

• ResolveDependencies.• ProvidesCentralRepositoryofSoUware.• OneCommandRules: ‐geminstall<gem_name>

• CanHaveyourOwnLocalGemServer ‐geminstall<gem_name>‐‐source<gem_server_ip_and_port>

Hpricot makes it easy to Parse

Hpricot

• Pullinforma=onfromvirtuallyanywebsite.

• SearchbyElementID,Tags,CSSSelectors.• ParseHTMLincludingbrokenHTML• UpdateHTML

• Usethisdataanywhereandanywayyouwant!• ParsebyXPathfordirectlyparsinganelement.• Let’ssee….Howitworks.

Let’s Parse A Badly Designed Site !!

• h^p://www.worldweather.org• It’sasitethatprovidesweatherinforma=onfordifferentloca=onsacrosstheglobe.

• Inthemainpagetheyhaveabadlynestedtablestructure!!

• AnidealWeb‐DevelopercouldhaveputthemnicelyindivswithmeaningfulIDs.

• Butlet’sfacethetruthandparsetheCountryNamesandtheirURLs.

Easy Steps – 1. Open The Site

Easy Steps – 2. Inspect With Firebug

Easy Steps – 3. Copy X-Path of the Element

Easy Steps – 4. Parse By X-Path Using Hpricot

Use some Logic & You’ll Get

Just Try it Out

Questions?

References

• RubyProgrammingLanguage:h^p://www.ruby‐lang.org/en/

• Hpricot:h^p://code.whytheluckys=ff.net/hpricot/

• X‐Path:h^p://en.wikipedia.org/wiki/XPath• Firebug:h^p://gecirebug.com/

Thanks

Parsing Binary File Formats With PowerShell

Parsing with Boost.Spirit

XML Parsing with Map Reduce

Image Parsing with Stochastic Scene Grammar

Lift your Speed Limits with Cython - archive.fosdem.org · NLTK, SpaCy – text analysis Tornado, asyncio – async web services lxml – XML / HTML parsing + processing ... NumPy

Version 2 Release 3 z/OS...Querying XML documents.....4 Parsing XML documents without validation.....4 Parsing XML Parsing XML document fragments with validation.....5 Parsing XDBX

PARSING WITH CLAUSE AND INTRACLAUSAL COORDINATION

Parsing Strategies With 'Lexicalized' Grammars

Chapter 10. Parsing with CFGs

Parsing With Transformers

Parsing Expression With Xtext

Visual Parsing with Weak Supervisionpages.cs.wisc.edu/~jiaxu/pub/defense-slides-jia-xu.pdf · Visual Parsing with Weak Supervision Jia Xu ... Chapter Parsing Task Weak Supervision

Lecture 21: Parsing with Features

The Problem with Probabilistic Parsing

Human Parsing with Contextualized Convolutional …users.eecs.northwestern.edu/~xsh835/assets/iccv2015...Human Parsing with Contextualized Convolutional Neural Network Xiaodan Liang1;

Video Scene Parsing with Predictive Feature Learning ...openaccess.thecvf.com/content_ICCV_2017/supplemental/Jin_Video… · Video Scene Parsing with Predictive Feature Learning:

Language-Independent Text Parsing of Arbitrary HTML-Documents

I/O and Parsing Reading and Writing with Java's Input/Output Streams and Parsing Utilities

Creating and Parsing XML Files with DOM - Java Programming · Agenda • Options for input files • XML overview • Comparing XML with HTML • Parsing an XML document – Creating

Parsing With Context Free Grammars

Perly Parsing with Regexp::Grammars

Incremental, Predictive Parsing with Psycholinguistically Motivated

Creating HTML Help with Microsoft's HTML Help Workshop · Creating HTML Help with Microsoft's HTML Help Workshop ... CREATING AN HTML HELP PROJECT WITH HHW ... Microsoft provides

Parsing with Structure-Preserving Categorial Grammars · Parsing with Structure-Preserving Categorial Grammars Zinsontleding met Structuurbehoudende Categoriale Grammatica’s (met

Parsing with CFG

Parsing with unification

AN EXTENSIBLE TRANSCODER FOR HTML TO VOICEXML …gupta/narayanthesis.pdftranscoder is divided into two phases: The parsing phase where the input HTML ﬁle is converted to HTML node

CIS 192: Lecture 8 HTML Parsingcis192/fall2014/files/lec8.pdf · CIS 192: Lecture 8 HTML Parsing Author: Lili Dworkin Created Date: 10/27/2014 1:04:50 PM

Parsing with PHP

Low-Resource Parsing with Crosslingual Contextualized

Scrapy and Elasticsearch: Powerful Web Scraping and ...€¦ · Web scraping with Python I Beautifulsoup: Python package for parsing HTML and XML document I lxml: Pythonic binding

spanish.snu.ac.krspanish.snu.ac.kr/html/download/2016. 8. 김실비아.pdftelescopio]' 'Mi hermano vio [a boy with a telescope]' -9-1 parsing Low Attachment brother saw a boy [COTI

A new methodology for testing HTML5 parsing implementations · 2.1 HTML history The Hypertext Markup Language (HTML) was born between 1989 and 1990 as an application of the Standard

Semantic Object Parsing with Local-Global Long Short-Term ...users.eecs.northwestern.edu › ~xsh835 › assets › cvpr2016_lstmparsing.pdfSemantic Object Parsing with Local-Global

NLP2003-Parsing with Context-Free Grammars

html parsing with hpricot

Technology

easy steps

hpricot pullinforma

vegroup str

linux creative group

x path

orgwikixpath firebug

inspectwith firebug

badlydesigned site