by robert vesco - meetupfiles.meetup.com/1503964/2010-06-24_webscrapeintro.pdf · check to make...
TRANSCRIPT
-
Introduction to Webscraping with R
By Robert Vesco
>> For Access to R Code, Please Open this Presentationin Dedicated PDF Application and Click on Pin
-
Introductionto
Webscrapingwith R
By RobertVesco
Why Use R forWebscraping
Why XML,XPATHApproach
The Basics ofWebscraping
R Example
RCurl
PracticalAdvice
References
Outline
1 Why Use R for Webscraping
2 Why XML, XPATH Approach
3 The Basics of Webscraping
4 R Example
5 RCurl
6 Practical Advice
7 References
-
Introductionto
Webscrapingwith R
By RobertVesco
Why Use R forWebscraping
Why XML,XPATHApproach
The Basics ofWebscraping
R Example
RCurl
PracticalAdvice
References
Why Use R for Webscraping
No external languages or scripts needed
Makes workflow more efficient
Easier to share with othe[R] colleagues
Can accomplish most webscraping needs quickly andefficiently
-
Introductionto
Webscrapingwith R
By RobertVesco
Why Use R forWebscraping
Why XML,XPATHApproach
The Basics ofWebscraping
R Example
RCurl
PracticalAdvice
References
Outline
1 Why Use R for Webscraping
2 Why XML, XPATH Approach
3 The Basics of Webscraping
4 R Example
5 RCurl
6 Practical Advice
7 References
-
Introductionto
Webscrapingwith R
By RobertVesco
Why Use R forWebscraping
Why XML,XPATHApproach
The Basics ofWebscraping
R Example
RCurl
PracticalAdvice
References
Why XML, XPATH Approach
Faster than using regular expressions
More robust
Nearly all languages now support XPATH approach
HTML code in the wild getting better all the time andhence makes XPATH more reliable
Can and should be used with regular expressions
-
Introductionto
Webscrapingwith R
By RobertVesco
Why Use R forWebscraping
Why XML,XPATHApproach
The Basics ofWebscraping
R Example
RCurl
PracticalAdvice
References
Outline
1 Why Use R for Webscraping
2 Why XML, XPATH Approach
3 The Basics of Webscraping
4 R Example
5 RCurl
6 Practical Advice
7 References
-
Introductionto
Webscrapingwith R
By RobertVesco
Why Use R forWebscraping
Why XML,XPATHApproach
The Basics ofWebscraping
R Example
RCurl
PracticalAdvice
References
HTML - The Code Behind the Web
1 < l i i d=member 4403063 >2 3 4 6 7 8 9
10 Robert V
11 12 13 14 < l i>15 Joined : April 12 , 201016 17 18
19 1) I 'm a l o v e [ r ] not a ha t e [ r ]20 2)
-
Introductionto
Webscrapingwith R
By RobertVesco
Why Use R forWebscraping
Why XML,XPATHApproach
The Basics ofWebscraping
R Example
RCurl
PracticalAdvice
References
Using XPATH to Select Items
We want to select just my name
1 . //* [ @class='memName ' ]
or use fuller path
1 . //*div/div/div/ etc . . . . /a [ @class='memName ' ]
Will both pull out Robert V from this code:
1 Robert V
BUT - more importantly, the above code will also pull outevery name if youre looking at all the code on the page!
-
Introductionto
Webscrapingwith R
By RobertVesco
Why Use R forWebscraping
Why XML,XPATHApproach
The Basics ofWebscraping
R Example
RCurl
PracticalAdvice
References
XPATH Tutorials
http://www.zvon.org/xxl/XPathTutorial/General/examples.html
http://www.w3schools.com/xpath/
http://www.zvon.org/xxl/XPathTutorial/General/examples.htmlhttp://www.zvon.org/xxl/XPathTutorial/General/examples.htmlhttp://www.w3schools.com/xpath/
-
Introductionto
Webscrapingwith R
By RobertVesco
Why Use R forWebscraping
Why XML,XPATHApproach
The Basics ofWebscraping
R Example
RCurl
PracticalAdvice
References
Important XML Functions in R
htmlTreeParse() parses file,cleans malformed HTML, and make it available forquerying
1 web
-
Introductionto
Webscrapingwith R
By RobertVesco
Why Use R forWebscraping
Why XML,XPATHApproach
The Basics ofWebscraping
R Example
RCurl
PracticalAdvice
References
Some Tools - Firebug + FireXPATH
Allows you to select items on a webpage and inspect theirunderlying tags. Also allows you to query your XPATH tosee what it will select!
-
Introductionto
Webscrapingwith R
By RobertVesco
Why Use R forWebscraping
Why XML,XPATHApproach
The Basics ofWebscraping
R Example
RCurl
PracticalAdvice
References
Some Tools - Selector Gadget
Allows you to select multiple elements on a screen
Useful for very complicated layouts
-
Introductionto
Webscrapingwith R
By RobertVesco
Why Use R forWebscraping
Why XML,XPATHApproach
The Basics ofWebscraping
R Example
RCurl
PracticalAdvice
References
Outline
1 Why Use R for Webscraping
2 Why XML, XPATH Approach
3 The Basics of Webscraping
4 R Example
5 RCurl
6 Practical Advice
7 References
-
Introductionto
Webscrapingwith R
By RobertVesco
Why Use R forWebscraping
Why XML,XPATHApproach
The Basics ofWebscraping
R Example
RCurl
PracticalAdvice
References
General Steps
Select and download the pages you want
Query the document
Select which items you want and save to dataframe
Repeat
-
Introductionto
Webscrapingwith R
By RobertVesco
Why Use R forWebscraping
Why XML,XPATHApproach
The Basics ofWebscraping
R Example
RCurl
PracticalAdvice
References
Find Which Pages Have the Info You Need +Download
1 #Seq Variables2 cStartSeq 03 cEndSeq 1004 cStepSeq 205 #link variables6 chrURLPrefix h t t p : //www.meetup.com/RusersDC/members/? o f f s e t=7 chrURLSuffix &d e s c=1&s o r t=c h a p t e r m e m b e r . a t i m e 8 #Files will be read and save to this path. Make sure it exists on your computer9 #or change path to wherever you save this script to ! !
10 chrSetDir = /R/R Meetup/ 11 #setwd( chrSetDir) #commented out for sweave12 chrDir p a s t e ( chrSetDir , RawData/ , sep=)13 #Check to see if folder exists , else create it.14 i f ( RawData %in% d i r ( chrSetDir )==FALSE ) {dir.create ( chrDir ) }15 f o r ( w in seq ( cStartSeq , cEndSeq , cStepSeq ) )16 {17 #Create URL that will download page18 u r l p a s t e ( chrURLPrefix , w , chrURLSuffix , sep=)19 #Create name for URL. Important because sometimes URL names have illegal
characters
20 #or lengths for files systems.21 urlName p a s t e ( chrDir , w , . h t m l , sep=)22 #Without error catching script will crash. Websites frequently time out !23 err t r y ( download.file ( u r l , destfile = urlName , quiet = TRUE ) , silent = TRUE )24 i f ( c l a s s ( err )==t r y e r r o r )25 {26 #you may be hitting the server too hard , so backoff and try again later.27 Sys.sleep ( 5 ) #in seconds , adjust as necessary28 t r y ( download.file ( u r l , destfile = urlName , quiet = TRUE ) , silent = TRUE )29 }30 }
-
Introductionto
Webscrapingwith R
By RobertVesco
Why Use R forWebscraping
Why XML,XPATHApproach
The Basics ofWebscraping
R Example
RCurl
PracticalAdvice
References
Process Pages and Extract Data
1 r e q u i r e ( XML )2 r e q u i r e ( xtable )3 vFiles list.files ( chrDir ) # put files in rawdata folder into vector and get
length
4 iLenFilesList l e n g t h ( vFiles ) #create list to store dataframes5 l s l i s t ( )6 f o r ( i in 1 : iLenFilesList )7 {8 #each i will pull a different URL9 u r l vFiles [ i ]
10 chrFileUrl p a s t e ( chrDir , u r l , sep=)11 # this function works on dirty html , adding closing tags and such.12 web htmlTreeParse ( chrFileUrl , error=f u n c t i o n ( ... ) {} , useInternalNodes = TRUE ,
encoding = UTF8 , trim=TRUE )13 #Use vectorized function to get names14 vNames xpathSApply ( web , '//* [ @ c l a s s =memName ] ' , xmlValue )15 #Same as above , but use regex to clean up a bit16 vDates gsub ( J o i n e d : | \ r \n , , xpathSApply ( web , '//*/ u l [ @ c l a s s =D l e s s
memStats ] / l i ' , xmlValue ) )17 #Since not every person has a quote , we break the problem into part getting
chunks of code
18 vQuote2 getNodeSet ( web , //*/ d i v [ @ c l a s s =' D t i t l e ' ] )19 #now we look for quotes - notice ".//" this means subquery -- IMPORTANT !20 vQuote3 s a p p l y ( vQuote2 , f u n c t i o n ( x ) xpathSApply (x , . //p [ @ c l a s s ='D l e s s ' ] ,
xmlValue ) )21 # we get list () for node with no quotes. Replace list () with NULL22 vQuote4 gsub ( '\ r | \ n | [ ] ' , , s a p p l y ( vQuote3 , f u n c t i o n ( x ) i f e l s e ( is.list ( x ) , NA ,
x ) ) )23 #add df to list. This is ok for small scrapes , but for larges ones , you need to
write to file or db
24 l s [ [ i ] ] data.frame ( Name=vNames , Date=vDates , Quote=vQuote4 , stringsAsFactors=FALSE )
25 }
-
Introductionto
Webscrapingwith R
By RobertVesco
Why Use R forWebscraping
Why XML,XPATHApproach
The Basics ofWebscraping
R Example
RCurl
PracticalAdvice
References
Combine Data from Each Loop
1 #combine df2 d f do.call ( r b i n d , l s )3 #sample output for latex4 l i b r a r y ( Hmisc )5 df2 d f [ 1 : 3 , ]6 latex ( df2 , f i l e =' ' , col.just=c ( l , l , p{2 i n } ) )
df2 Name Date Quote1 JOEL ROBERTS June 9, 2010 My name is Joel Roberts. I am a friend of Bryan
Stroube who told me about the DC useR Group. Iwork with several systems that use XML for informa-tion exchange between dissimilar computer / softwaresystems.
2 Arun May 1, 2010 NA3 Travis M April 16, 2010 NA
-
Introductionto
Webscrapingwith R
By RobertVesco
Why Use R forWebscraping
Why XML,XPATHApproach
The Basics ofWebscraping
R Example
RCurl
PracticalAdvice
References
Outline
1 Why Use R for Webscraping
2 Why XML, XPATH Approach
3 The Basics of Webscraping
4 R Example
5 RCurl
6 Practical Advice
7 References
-
Introductionto
Webscrapingwith R
By RobertVesco
Why Use R forWebscraping
Why XML,XPATHApproach
The Basics ofWebscraping
R Example
RCurl
PracticalAdvice
References
RCurl Package
More flexible, allows one to modify headers, referers, etc...
Beyond http: https, ftp, sftp, scp, ldap, etc.....
Come from c library libcurl so fast, extensive, and activelydeveloped
Can use persistent connections, cookies, and processrequests as they come in rather than sequentially
-
Introductionto
Webscrapingwith R
By RobertVesco
Why Use R forWebscraping
Why XML,XPATHApproach
The Basics ofWebscraping
R Example
RCurl
PracticalAdvice
References
Some RCurl Examples
getURL() allows the use of https whereas built-in R functions do not (recent Rversions may be different, internet2 method must have valid cert)
1 txt = getURL ( h t t p s : //www. t w i t t e r . com , ssl . verifyhost=0, ssl . verifypeer=0)
getCurlHandle() handles allow persistent connections and settings to be usedacross repeated call to same server which is similar to pass list of arguments, butpotentially with better network connectivity.
1 curl = getCurlHandle ( cookie=cookie , useragent= M o z i l l a / 5 . 0 ( Windows ; U; Windows NT 5 . 1 ; enUS ; r v : 1 . 8 . 1 . 6 ) Gecko / 20070725 F i r e f o x / 2 . 0 . 0 . 6 )
2 txt = getURL ( h t t p : //www. meetup . com/Ru s e r sDC/members/ , curl=curl , . opts = l i s t ( verbose = TRUE ) )
3 txt2 = getURL ( h t t p : //www. meetup . com/Ru s e r sDC/members/ , curl=curl , . opts= l i s t ( verbose = TRUE ) )
-
Introductionto
Webscrapingwith R
By RobertVesco
Why Use R forWebscraping
Why XML,XPATHApproach
The Basics ofWebscraping
R Example
RCurl
PracticalAdvice
References
Outline
1 Why Use R for Webscraping
2 Why XML, XPATH Approach
3 The Basics of Webscraping
4 R Example
5 RCurl
6 Practical Advice
7 References
-
Introductionto
Webscrapingwith R
By RobertVesco
Why Use R forWebscraping
Why XML,XPATHApproach
The Basics ofWebscraping
R Example
RCurl
PracticalAdvice
References
Warning and Practical Advice
Be Nice! Go easy on the servers especially if youre usingrCURL
Check to make sure website does not prohibit scrapping.See if they already have an API. Else, send an email toweb owners.
If login required, then need to use rCurl package.
Always use a proxy - if for nothing else you dont wantyour home address to accidently get blocked (ie googleautomated query blocking).
-
Introductionto
Webscrapingwith R
By RobertVesco
Why Use R forWebscraping
Why XML,XPATHApproach
The Basics ofWebscraping
R Example
RCurl
PracticalAdvice
References
Warning and Practical Advice
Consider setting up your old computer or laptop to runscraping. Frees your main computer up. Set it to emailyou if it crashes.....
Get familiar with error catching and debugging
Even if you try to make your code robust, large jobs willrequire several retweakings of code, because ex ante, youdont know all the possible permutations. Your code willbecome more generalized with more exception handling.
XML packages have slightly different interpretations.Hence going from one language to the next may requireslightly different queries
-
Introductionto
Webscrapingwith R
By RobertVesco
Why Use R forWebscraping
Why XML,XPATHApproach
The Basics ofWebscraping
R Example
RCurl
PracticalAdvice
References
Outline
1 Why Use R for Webscraping
2 Why XML, XPATH Approach
3 The Basics of Webscraping
4 R Example
5 RCurl
6 Practical Advice
7 References
-
Introductionto
Webscrapingwith R
By RobertVesco
Why Use R forWebscraping
Why XML,XPATHApproach
The Basics ofWebscraping
R Example
RCurl
PracticalAdvice
References
References
R code used in this presentation, click on pin
Tutorials,http://www.zvon.org/xxl/XPathTutorial/General/examples.html
http://www.w3schools.com/xpath/
Firebughttp://getfirebug.com/
https://addons.mozilla.org/en-US/firefox/addon/11900/
SelectorGadgethttp://www.selectorgadget.com/
XML package docshttp://www.omegahat.org/RSXML/
rCurl Packagehttp://www.omegahat.org/RCurl/
# Used R 2.9# Used Frank Harrel Sweavel.sty file for code in presentation# Written by Robert Vesco for R-meetup 2010-06-24# Purpose: An example to illustrate basic webscraping with R###############################################################################
###############################Create loop to download all the relevant pages. #We want to do this to minimize timeout errors and keep evidence##################################
#Seq VariablescStartSeq