when rss fails: web scraping with http
DESCRIPTION
A brief introduction to the HTTP protocol for use in web scraping, best practices, and availability of PHP-based HTTP client libraries.TRANSCRIPT
When RSS Fails:Web Scraping with HTTP
Matthew TurlandSenior ConsultantBlue Parabola LLCFebruary 27, 2009
What is Web Scraping?
A 2 Step Process
Its Goal: Data
Obtain It
Transform It
Automate It
Step 1: Retrieval
The Client
The Server
The Request
The Response
Or In Your Case
Step #2: Analysis
Locate Desired Data
Extract It
Use It
2 Step Process
Step 1:Retrieval GET /some/resource
...
HTTP/1.1 200 OK... Resource
with data you want
Step 2:Analysis
Rawresource
Usabledata
So To Recap
Data Mining
Focus in data mining Focus in web scraping
Consuming Web Services
Web service data formats Web scraping data formats
How Is It Different?
System integration
Crawlersand indexers
Integrationtesting
What Is It Used For?
Disadvantages
One small change to markup...
... may break your application.
Or in modern terms...
Reverse Engineering Required
Multiple Requests
No Nice Neat Data Package
Quite the Opposite, In Fact
Use one like this:
To do this:
Know enough HTTP to...
PEAR::HTTP_Client pecl_http Zend_Http_Client
Learn to use and troubleshoot one like this:
Or roll your own!
cURLFilesystem + Streams
Know enough HTTP to...
GET /wiki/Main_Page HTTP/1.1
Host: en.wikipedia.org
method or operation
URI address for the desired resource
protocol version in use by the client
header name header value
request line
header
more headers follow...
Let's GET Started
1. Uniquely identifies a resource
2. Indicates how to locate a resource
3. Does both and is thus human-usable.
URI
URL
More info in RFC 3986 Sections 1.1.3 and 1.2.2
URI vs URL
In principle:"Let's do this by the book."
GET
In reality:"'Safe operation'? Whatever."
GET
Warning about GET
http://en.wikipedia.org/w/index.php? title=Query_string&action=edit
URLQuery String
Question mark to separatethe resource address and query string
Equal signs to separate parameternames and respective values
Ampersands to separate parameter name-value pairs. Parameter
Value
Query Strings
Parameter Value
first
second
this is a field
is it clear enough (already)?
Query Stringfirst=this+is+a+field&second=is+it+clear+%28already%29%3F
Also called percent encoding.
parse_str, urlencode, urldecode: Handy PHP URL functions
$_SERVER['QUERY_STRING'] / http_build_query($_GET)
More info on URL encoding in RFC 3986 Section 2.1
URL Encoding
Most CommonHTTP Operations1. GET2. POST...
/w/index.phpPOST
/new/resource-or-
/updated/resource
GET /some/resource HTTP/1.1Header: Value...
POST /some/resource HTTP/1.1Header: Value
request body
none
POST Requests
POST /w/index.php?title=Wikipedia:Sandbox HTTP/1.1ContentType: application/xwwwformurlencoded
wpStarttime=20080719022313&wpEdittime=20080719022100...
Blank line separatesrequest headers and body
Content type for datasubmitted via HTML form(multipart/form-data for file uploads)
Request body... look familiar?
Note: Most browsers have a query string length limit.Lowest known common denominator: IE7strlen(entire URL) <= 2,048 bytes.This limit is not standardized. It applies to query strings, but not request bodies.
POST Request Example
HEAD /wiki/Main_Page HTTP/1.1Host: en.wikipedia.org
Same as GET with two exceptions:
1
HTTP/1.1 200 OKHeader: Value
2 No response body
HEAD vs GET
HeadersBody
Sometimes headersare all you want
?
HEAD Request
HTTP/1.0 200 OKServer: ApacheXPoweredBy: PHP/5.2.5...
[body]
Lowest protocol versionrequired to process theresponse
Responsestatus code Response
status description
Status line
Same header format asrequests, but different headers are used(see RFC 2616 Section 14)
Responses
1xx InformationalRequest received, continuing process.
2xx SuccessRequest received, understood, and accepted.
3xx RedirectionClient must take additional action to complete the request.
4xx Client ErrorRequest is malformed or could not be fulfilled.
5xx Server ErrorRequest was valid, but the server failed to process it.
See RFC 2616 Section 10 for more info.
Response Status Codes
Set-Cookie
Cookie
Location Watch out for infinite loops!
Last-Modified
If-Modified-Since
304 Not Modified
ETag
If-None-MatchOR
See RFC 2109 or RFC 2965for more info.
Headers
WWW-Authenticate
Authorization
User-Agent
200 OK / 403 Forbidden
See RFC 2617for more info.
User-Agent:
Some servers performuser agent sniffing
Some clients performuser agent spoofing
More Headers
Best Practices
Simulate User Behavior
Minimize Requests
Batch Jobs, Non-Peak Hours
Questions?
No heckling... OK, maybe just a little.
I generally blog about my experiences with web scraping
and PHP at http://ishouldbecoding.com.
</shameless_plug>
Thanks for coming!