Transcript
Page 1: London XQuery Meetup: Querying the World (Web Scraping)

XQuery: Querying the World(formerly known as Web Scraping)

Dennis Knochenwefel <[email protected]>

Page 2: London XQuery Meetup: Querying the World (Web Scraping)

EvolutionWeb Scraping

Page 3: London XQuery Meetup: Querying the World (Web Scraping)

$url = "http://www.nfl.com/teams/sandiegochargers/roster?team=SD";$raw = file_get_contents($url);$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");$content = str_replace($newlines, "", html_entity_decode($raw));$start = strpos($content,'<table cellpadding="2" class="standard_table"');$end = strpos($content,'</table>',$start) + 8;

$table = substr($content,$start,$end-$start);

preg_match_all("|<tr(.*)</tr>|U",$table,$rows);foreach ($rows[0] as $row){

if ((strpos($row,'<th')===false)){ preg_match_all("|<td(.*)</td>|U",$row,$cells); $number = strip_tags($cells[0][0]); $name = strip_tags($cells[0][1]); $position = strip_tags($cells[0][2]); echo "{$position} - {$name} - Number {$number} <br>\n"; }}

source: http://www.bradino.com/php/screen-scraping/

PHP (2007)

$url = "http://www.nfl.com/teams/sandiegochargers/roster?team=SD";

$raw = file_get_contents($url);

$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");

$content = str_replace($newlines, "", html_entity_decode($raw));

$start = strpos($content,'<table cellpadding="2" class="standard_table"');

$end = strpos($content,'</table>',$start) + 8;

$table = substr($content,$start,$end-$start);

preg_match_all("|<tr(.*)</tr>|U",$table,$rows);foreach ($rows[0] as $row){

if ((strpos($row,'<th')===false)){

preg_match_all("|<td(.*)</td>|U",$row,$cells);

$number = strip_tags($cells[0][0]);

$name = strip_tags($cells[0][1]);

$position = strip_tags($cells[0][2]);

echo "{$position} - {$name} - Number {$number} <br>\n";

}

}

Page 4: London XQuery Meetup: Querying the World (Web Scraping)

$url="http://www.rtu.ac.in/results/reformat.php";

$post="rollnumber=08epccs060&filename=fetchmodulesem_4_btech410m.php&button=Submit";

$ch=curl_init();

curl_setopt($ch,CURLOPT_URL,$url);

curl_setopt($ch,CURLOPT_POST,1);

curl_setopt($ch,CURLOPT_POSTFIELDS,$post);

curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);

curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);

$content=curl_exec($ch);

curl_close($ch);

$totalPath="html/body/table[4]/tbody/tr[3]/td[4]";

$page=new DOMDocument();

$xpath=new DOMXPath($page);

$page->loadHTML($content);

$page->saveHTML();  // this shows the page contents

$total=$xpath->query($totalPath);

echo $total->length;    //shows 0

echo $total->item(0)->nodeValue;   //shows nothing

source: http://stackoverflow.com/questions/6283361/unable-to-get-table-data-from-a-html-page

PHP (June 2011)

$url="http://www.rtu.ac.in/results/reformat.php";

$post="rollnumber=08epccs060&filename=fetchmodulesem_4_btech410m.php&button=Submit";

$ch=curl_init();

curl_setopt($ch,CURLOPT_URL,$url);

curl_setopt($ch,CURLOPT_POST,1);

curl_setopt($ch,CURLOPT_POSTFIELDS,$post);

curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);

curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);

$content=curl_exec($ch);

curl_close($ch);

$totalPath="html/body/table[4]/tbody/tr[3]/td[4]";

$page=new DOMDocument();

$xpath=new DOMXPath($page);

$page->loadHTML($content);

$page->saveHTML();  // this shows the page contents

$total=$xpath->query($totalPath);

echo $total->length;    //shows 0

echo $total->item(0)->nodeValue;   //shows nothing

!

!

Page 5: London XQuery Meetup: Querying the World (Web Scraping)

XQuery

Page 6: London XQuery Meetup: Querying the World (Web Scraping)

Real WorldExample

Page 7: London XQuery Meetup: Querying the World (Web Scraping)

awesome site

awesome data

no API

Page 8: London XQuery Meetup: Querying the World (Web Scraping)

Deal with sessions

Page 9: London XQuery Meetup: Querying the World (Web Scraping)

Need to emulate setting options

Page 10: London XQuery Meetup: Querying the World (Web Scraping)

Different NotionsPublisher <=> Consumer

Page 11: London XQuery Meetup: Querying the World (Web Scraping)

Website App

CSV !HTML !XLS !Zip !

JSON ?XML ?

Page 12: London XQuery Meetup: Querying the World (Web Scraping)

Website App

CSV !HTML !XLS !Zip !

JSON ?XML ?

Session!

Stateless

REST

API ?

Page 13: London XQuery Meetup: Querying the World (Web Scraping)

Website App

CSV !HTML !XLS !Zip !

JSON ?XML ?

Session!

Stateless

REST

API ?

Customize with URL Params

HTML Forms

Page 14: London XQuery Meetup: Querying the World (Web Scraping)

Website App

CSV !HTML !XLS !Zip !

JSON ?XML ?

Session!

Stateless

REST

API ?

Customize with URL Params

HTML Forms

Page 15: London XQuery Meetup: Querying the World (Web Scraping)

Website App

CSV !HTML !XLS !Zip !Session!

HTML Forms

HTML !

Session!

HTML Forms

XQuery !

Page 16: London XQuery Meetup: Querying the World (Web Scraping)

Summary

Page 17: London XQuery Meetup: Querying the World (Web Scraping)

XQuery Web Data Processing

A browser can do it?

XQuery can do it!

Session handling

Forms

!

!

Page 18: London XQuery Meetup: Querying the World (Web Scraping)

Result:http://www.unemployment.by/country


Top Related