r and collecting internet data - university of california, berkeley

92
Motivation Tools we need to learn R and Collecting Internet Data Luis F. Campos Department of Statistics University of California, Berkeley February 11, 2011 - 4-5 PM - 1011 Evans Hall Luis F. Campos UC, Berkeley R and Collecting Internet Data

Upload: others

Post on 12-Sep-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R and Collecting Internet Data

Luis F. Campos

Department of StatisticsUniversity of California, Berkeley

February 11, 2011 - 4-5 PM - 1011 Evans Hall

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 2: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Outline1 Motivation

Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data

2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 3: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Internet Movie Database (IMDb)

Outline1 Motivation

Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data

2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 4: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Internet Movie Database (IMDb)

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 5: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Internet Movie Database (IMDb)

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 6: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Internet Movie Database (IMDb)

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 7: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Music Databases

Outline1 Motivation

Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data

2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 8: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Music Databases

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 9: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Music Databases

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 10: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Yahoo! Sports

Outline1 Motivation

Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data

2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 11: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Yahoo! Sports

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 12: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Yahoo! Sports

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 13: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Yahoo! Sports

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 14: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Main example: Superbowl - Play by Play Data

Outline1 Motivation

Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data

2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 15: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Main example: Superbowl - Play by Play Data

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 16: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Main example: Superbowl - Play by Play Data

What is actually going on in your browser?

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 17: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Main example: Superbowl - Play by Play Data

What is actually going on in your browser?

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 18: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML/HTML

Outline1 Motivation

Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data

2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 19: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML/HTML

What is XML?

XML stands for eXtensible Markup Language

markup languages: provide annotations for structure(marks) that are distinguishable from text (examples:LaTeX, html, PostScript)extensible: you can create your own marks

In XML, the structure markers are angle brackets: < >HTML could be considered a derivative of XMLHTML (HyperText Markup Language): is currently thepredominant markup language

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 20: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML/HTML

What is XML?

XML stands for eXtensible Markup Languagemarkup languages: provide annotations for structure(marks) that are distinguishable from text (examples:LaTeX, html, PostScript)

extensible: you can create your own marks

In XML, the structure markers are angle brackets: < >HTML could be considered a derivative of XMLHTML (HyperText Markup Language): is currently thepredominant markup language

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 21: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML/HTML

What is XML?

XML stands for eXtensible Markup Languagemarkup languages: provide annotations for structure(marks) that are distinguishable from text (examples:LaTeX, html, PostScript)extensible: you can create your own marks

In XML, the structure markers are angle brackets: < >HTML could be considered a derivative of XMLHTML (HyperText Markup Language): is currently thepredominant markup language

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 22: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML/HTML

What is XML?

XML stands for eXtensible Markup Languagemarkup languages: provide annotations for structure(marks) that are distinguishable from text (examples:LaTeX, html, PostScript)extensible: you can create your own marks

In XML, the structure markers are angle brackets: < >

HTML could be considered a derivative of XMLHTML (HyperText Markup Language): is currently thepredominant markup language

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 23: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML/HTML

What is XML?

XML stands for eXtensible Markup Languagemarkup languages: provide annotations for structure(marks) that are distinguishable from text (examples:LaTeX, html, PostScript)extensible: you can create your own marks

In XML, the structure markers are angle brackets: < >HTML could be considered a derivative of XML

HTML (HyperText Markup Language): is currently thepredominant markup language

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 24: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML/HTML

What is XML?

XML stands for eXtensible Markup Languagemarkup languages: provide annotations for structure(marks) that are distinguishable from text (examples:LaTeX, html, PostScript)extensible: you can create your own marks

In XML, the structure markers are angle brackets: < >HTML could be considered a derivative of XMLHTML (HyperText Markup Language): is currently thepredominant markup language

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 25: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML/HTML

What is HTML? Simply, a set of predetermined structuralmarkers.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html>

<head><title>The document title</title>

</head><body>

<h1>Main heading</h1><p>A paragraph.</p><a href = "www.stat.berkeley.edu">Statistics Website</a>

</body></html>

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 26: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML/HTML

What is HTML? Simply, a set of predetermined structuralmarkers.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html><head><title>The document title</title>

</head><body><h1>Main heading</h1><p>A paragraph.</p><a href = "www.stat.berkeley.edu">Statistics Website</a>

</body></html>

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 27: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML/HTML

Another useful way to view an HTML document is as a tree.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 28: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML/HTML

We call the boxes nodes and the arrows edges

an edge goes from a to b if:<a>

<b></b>

</a>

Note: there is a unique path from the root node to anygiven node.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 29: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML/HTML

We call the boxes nodes and the arrows edgesan edge goes from a to b if:<a>

<b></b>

</a>

Note: there is a unique path from the root node to anygiven node.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 30: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML/HTML

We call the boxes nodes and the arrows edgesan edge goes from a to b if:<a>

<b></b>

</a>

Note: there is a unique path from the root node to anygiven node.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 31: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML/HTML

Each node can have the following elements associated to it:

a required name for the node, not necessarily uniqueany number of attributesoptional text<a href = "www.stat.berkeley.edu">

Statistics Website</a>

nodename: aattribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 32: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML/HTML

Each node can have the following elements associated to it:a required name for the node, not necessarily unique

any number of attributesoptional text<a href = "www.stat.berkeley.edu">

Statistics Website</a>

nodename: aattribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 33: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML/HTML

Each node can have the following elements associated to it:a required name for the node, not necessarily uniqueany number of attributes

optional text<a href = "www.stat.berkeley.edu">

Statistics Website</a>

nodename: aattribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 34: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML/HTML

Each node can have the following elements associated to it:a required name for the node, not necessarily uniqueany number of attributesoptional text

<a href = "www.stat.berkeley.edu">Statistics Website

</a>

nodename: aattribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 35: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML/HTML

Each node can have the following elements associated to it:a required name for the node, not necessarily uniqueany number of attributesoptional text<a href = "www.stat.berkeley.edu">Statistics Website

</a>

nodename: aattribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 36: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML/HTML

Each node can have the following elements associated to it:a required name for the node, not necessarily uniqueany number of attributesoptional text<a href = "www.stat.berkeley.edu">Statistics Website

</a>

nodename: a

attribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 37: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML/HTML

Each node can have the following elements associated to it:a required name for the node, not necessarily uniqueany number of attributesoptional text<a href = "www.stat.berkeley.edu">Statistics Website

</a>

nodename: aattribute: href with value "www.stat.berkeley.edu"

text: "Statistics Website"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 38: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML/HTML

Each node can have the following elements associated to it:a required name for the node, not necessarily uniqueany number of attributesoptional text<a href = "www.stat.berkeley.edu">Statistics Website

</a>

nodename: aattribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 39: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML Path Query Language (XPath QL)

Outline1 Motivation

Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data

2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 40: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML Path Query Language (XPath QL)

XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:

/ finds the root node// selects from anywhere in the tree. selects current node.. selects parent of current node@ selects attributesnodename finds location(s) of the named node

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 41: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML Path Query Language (XPath QL)

XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:

/ finds the root node

// selects from anywhere in the tree. selects current node.. selects parent of current node@ selects attributesnodename finds location(s) of the named node

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 42: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML Path Query Language (XPath QL)

XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:

/ finds the root node// selects from anywhere in the tree

. selects current node

.. selects parent of current node@ selects attributesnodename finds location(s) of the named node

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 43: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML Path Query Language (XPath QL)

XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:

/ finds the root node// selects from anywhere in the tree. selects current node

.. selects parent of current node@ selects attributesnodename finds location(s) of the named node

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 44: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML Path Query Language (XPath QL)

XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:

/ finds the root node// selects from anywhere in the tree. selects current node.. selects parent of current node

@ selects attributesnodename finds location(s) of the named node

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 45: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML Path Query Language (XPath QL)

XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:

/ finds the root node// selects from anywhere in the tree. selects current node.. selects parent of current node@ selects attributesnodename finds location(s) of the named node

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 46: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML Path Query Language (XPath QL)

XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:

/ finds the root node// selects from anywhere in the tree. selects current node.. selects parent of current node@ selects attributesnodename finds location(s) of the named node

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 47: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML Path Query Language (XPath QL)

To traverse to the only anchor node below, we can use:"/../body/div/a" or "//a"

If we had <a href = "stat.berkeley.edu">stats</a>:"//a[@href = ’stat.berkeley.edu’]" or "//a[text() = ’stats’]"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 48: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

XML Path Query Language (XPath QL)

To traverse to the only anchor node below, we can use:"/../body/div/a" or "//a"If we had <a href = "stat.berkeley.edu">stats</a>:"//a[@href = ’stat.berkeley.edu’]" or "//a[text() = ’stats’]"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 49: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Programming Language

Outline1 Motivation

Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data

2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 50: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Programming Language

Why R, in general?

Plenty of packages! So you can extend R to do whateveryou want.It’s free! So you can spend money on other things.

Why R, for collecting internet data?

Collecting data for statistical purposes <–> R is made forstatisticsImplementations of XPath, Regular Expressions are fairlyintuitiveThere are other ways, but they follow similar procedures

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 51: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Programming Language

Why R, in general?Plenty of packages! So you can extend R to do whateveryou want.

It’s free! So you can spend money on other things.Why R, for collecting internet data?

Collecting data for statistical purposes <–> R is made forstatisticsImplementations of XPath, Regular Expressions are fairlyintuitiveThere are other ways, but they follow similar procedures

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 52: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Programming Language

Why R, in general?Plenty of packages! So you can extend R to do whateveryou want.It’s free! So you can spend money on other things.

Why R, for collecting internet data?

Collecting data for statistical purposes <–> R is made forstatisticsImplementations of XPath, Regular Expressions are fairlyintuitiveThere are other ways, but they follow similar procedures

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 53: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Programming Language

Why R, in general?Plenty of packages! So you can extend R to do whateveryou want.It’s free! So you can spend money on other things.

Why R, for collecting internet data?

Collecting data for statistical purposes <–> R is made forstatisticsImplementations of XPath, Regular Expressions are fairlyintuitiveThere are other ways, but they follow similar procedures

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 54: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Programming Language

Why R, in general?Plenty of packages! So you can extend R to do whateveryou want.It’s free! So you can spend money on other things.

Why R, for collecting internet data?Collecting data for statistical purposes <–> R is made forstatistics

Implementations of XPath, Regular Expressions are fairlyintuitiveThere are other ways, but they follow similar procedures

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 55: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Programming Language

Why R, in general?Plenty of packages! So you can extend R to do whateveryou want.It’s free! So you can spend money on other things.

Why R, for collecting internet data?Collecting data for statistical purposes <–> R is made forstatisticsImplementations of XPath, Regular Expressions are fairlyintuitive

There are other ways, but they follow similar procedures

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 56: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Programming Language

Why R, in general?Plenty of packages! So you can extend R to do whateveryou want.It’s free! So you can spend money on other things.

Why R, for collecting internet data?Collecting data for statistical purposes <–> R is made forstatisticsImplementations of XPath, Regular Expressions are fairlyintuitiveThere are other ways, but they follow similar procedures

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 57: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Programming Language

R: briefly

R is a command line interpreter language (interpretscommands you type)

We’ll be using functions provided by packages (XML) andthe base package. Example of calling a function below:It’s usually a good idea to name all named arguments whencalling a function (otherwise R will just use them in order)Notice that if fun has any output it will be stored in x

x <- fun(unnamed, arg = named)

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 58: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Programming Language

R: briefly

R is a command line interpreter language (interpretscommands you type)We’ll be using functions provided by packages (XML) andthe base package. Example of calling a function below:

It’s usually a good idea to name all named arguments whencalling a function (otherwise R will just use them in order)Notice that if fun has any output it will be stored in x

x <- fun(unnamed, arg = named)

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 59: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Programming Language

R: briefly

R is a command line interpreter language (interpretscommands you type)We’ll be using functions provided by packages (XML) andthe base package. Example of calling a function below:It’s usually a good idea to name all named arguments whencalling a function (otherwise R will just use them in order)

Notice that if fun has any output it will be stored in x

x <- fun(unnamed, arg = named)

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 60: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Programming Language

R: briefly

R is a command line interpreter language (interpretscommands you type)We’ll be using functions provided by packages (XML) andthe base package. Example of calling a function below:It’s usually a good idea to name all named arguments whencalling a function (otherwise R will just use them in order)Notice that if fun has any output it will be stored in x

x <- fun(unnamed, arg = named)

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 61: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Package: XML

Outline1 Motivation

Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data

2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 62: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Package: XML

Created by Prof. Duncan Temple Lang. This package is acollection of functions that will enable us to work withXML/HTML tree structures. Some of the functions we will useare:

htmlTreeParse: gets html page specified and creates aninternal tree structure. We can then use XPath to traversethe tree>url="http://en.wikipedia.org/wiki/Main_Page">doc=htmlTreeParse(url,useInternalNodes = TRUE)

getNodeSet: takes document name and XPathinstructions about the node we’re looking for:> x1 = getNodeSet(doc, "//div[@class = ’bd’]")> x2 = getNodeSet(doc, "//a")

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 63: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Package: XML

Created by Prof. Duncan Temple Lang. This package is acollection of functions that will enable us to work withXML/HTML tree structures. Some of the functions we will useare:

htmlTreeParse: gets html page specified and creates aninternal tree structure. We can then use XPath to traversethe tree>url="http://en.wikipedia.org/wiki/Main_Page">doc=htmlTreeParse(url,useInternalNodes = TRUE)

getNodeSet: takes document name and XPathinstructions about the node we’re looking for:> x1 = getNodeSet(doc, "//div[@class = ’bd’]")> x2 = getNodeSet(doc, "//a")

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 64: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Package: XML

Created by Prof. Duncan Temple Lang. This package is acollection of functions that will enable us to work withXML/HTML tree structures. Some of the functions we will useare:

htmlTreeParse: gets html page specified and creates aninternal tree structure. We can then use XPath to traversethe tree>url="http://en.wikipedia.org/wiki/Main_Page">doc=htmlTreeParse(url,useInternalNodes = TRUE)

getNodeSet: takes document name and XPathinstructions about the node we’re looking for:> x1 = getNodeSet(doc, "//div[@class = ’bd’]")> x2 = getNodeSet(doc, "//a")

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 65: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Package: XML

x1 will contain the location of all the nodes named "div"who have a class attribute with value "bd"

x2 will contain locations of ALL the anchor nodes (containlinks)! Which may or may not be useful (unless you’reGoogle).

xmlParent: gets location of the parent of the specifiednode:> xmlParent(x1[[1]])

xmlChildren, xmlAncestors: gets all children orancestors of a given node.xmlValue: gets the text of a given node.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 66: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Package: XML

x1 will contain the location of all the nodes named "div"who have a class attribute with value "bd"

x2 will contain locations of ALL the anchor nodes (containlinks)! Which may or may not be useful (unless you’reGoogle).

xmlParent: gets location of the parent of the specifiednode:> xmlParent(x1[[1]])

xmlChildren, xmlAncestors: gets all children orancestors of a given node.xmlValue: gets the text of a given node.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 67: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Package: XML

x1 will contain the location of all the nodes named "div"who have a class attribute with value "bd"

x2 will contain locations of ALL the anchor nodes (containlinks)! Which may or may not be useful (unless you’reGoogle).

xmlParent: gets location of the parent of the specifiednode:> xmlParent(x1[[1]])

xmlChildren, xmlAncestors: gets all children orancestors of a given node.xmlValue: gets the text of a given node.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 68: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Package: XML

x1 will contain the location of all the nodes named "div"who have a class attribute with value "bd"

x2 will contain locations of ALL the anchor nodes (containlinks)! Which may or may not be useful (unless you’reGoogle).

xmlParent: gets location of the parent of the specifiednode:> xmlParent(x1[[1]])

xmlChildren, xmlAncestors: gets all children orancestors of a given node.

xmlValue: gets the text of a given node.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 69: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Package: XML

x1 will contain the location of all the nodes named "div"who have a class attribute with value "bd"

x2 will contain locations of ALL the anchor nodes (containlinks)! Which may or may not be useful (unless you’reGoogle).

xmlParent: gets location of the parent of the specifiednode:> xmlParent(x1[[1]])

xmlChildren, xmlAncestors: gets all children orancestors of a given node.xmlValue: gets the text of a given node.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 70: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Package: XML

One shortcut:If you know you want to get a table element from an htmlfile

a table is a very specific html element!

readHTMLTable will retrieve all table nodes and transformthem into a workable form for you!So, we’re done right?!Correct! No, not everything is stored in a table!This is why we learned all this!

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 71: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Package: XML

One shortcut:If you know you want to get a table element from an htmlfile

a table is a very specific html element!

readHTMLTable will retrieve all table nodes and transformthem into a workable form for you!

So, we’re done right?!Correct! No, not everything is stored in a table!This is why we learned all this!

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 72: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Package: XML

One shortcut:If you know you want to get a table element from an htmlfile

a table is a very specific html element!

readHTMLTable will retrieve all table nodes and transformthem into a workable form for you!So, we’re done right?!

Correct! No, not everything is stored in a table!This is why we learned all this!

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 73: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Package: XML

One shortcut:If you know you want to get a table element from an htmlfile

a table is a very specific html element!

readHTMLTable will retrieve all table nodes and transformthem into a workable form for you!So, we’re done right?!Correct! No, not everything is stored in a table!

This is why we learned all this!

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 74: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

R Package: XML

One shortcut:If you know you want to get a table element from an htmlfile

a table is a very specific html element!

readHTMLTable will retrieve all table nodes and transformthem into a workable form for you!So, we’re done right?!Correct! No, not everything is stored in a table!This is why we learned all this!

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 75: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Regular Expressions (regex)

Outline1 Motivation

Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data

2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 76: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Regular Expressions (regex)

Provide a framework for finding specific text in a body oftext for a list of possible texts.

We can specify the string we are looking for with thefollowing quantifiers:

OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:

question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 77: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Regular Expressions (regex)

Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:

OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:

question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 78: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Regular Expressions (regex)

Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:

OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"

Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:

question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 79: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Regular Expressions (regex)

Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:

OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas above

Quantifiers (number of), given after the character or group:

question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 80: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Regular Expressions (regex)

Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:

OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:

question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 81: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Regular Expressions (regex)

Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:

OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:

question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"

asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 82: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Regular Expressions (regex)

Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:

OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:

question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...

plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 83: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Regular Expressions (regex)

Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:

OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:

question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 84: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Regular Expressions (regex)

POSIX Expressions:. - matches ANY single character - ex: "a.c" finds "aac","abc", "acc",...

[ ] - matches a single character from the brackets - ex:"gr[ea]y" finds "grey" or "gray"These are different from parenthesis since ranges can bespecified in square brackets: a-z means any lower casealphanumeric character.There are a ton more. See reference.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 85: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Regular Expressions (regex)

POSIX Expressions:. - matches ANY single character - ex: "a.c" finds "aac","abc", "acc",...[ ] - matches a single character from the brackets - ex:"gr[ea]y" finds "grey" or "gray"

These are different from parenthesis since ranges can bespecified in square brackets: a-z means any lower casealphanumeric character.There are a ton more. See reference.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 86: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Regular Expressions (regex)

POSIX Expressions:. - matches ANY single character - ex: "a.c" finds "aac","abc", "acc",...[ ] - matches a single character from the brackets - ex:"gr[ea]y" finds "grey" or "gray"These are different from parenthesis since ranges can bespecified in square brackets: a-z means any lower casealphanumeric character.

There are a ton more. See reference.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 87: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Regular Expressions (regex)

POSIX Expressions:. - matches ANY single character - ex: "a.c" finds "aac","abc", "acc",...[ ] - matches a single character from the brackets - ex:"gr[ea]y" finds "grey" or "gray"These are different from parenthesis since ranges can bespecified in square brackets: a-z means any lower casealphanumeric character.There are a ton more. See reference.

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 88: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Regular Expressions (regex)

Regular Expressions in R are used in the following functions:grep finds a pattern in a list of candidates:>grep("abc",c("abcdsf", "fabcasda", "cba"))[1] 1 2

gsub replaces a pattern with another in a list ofcandidates:>gsub("abc","CANDY",c("abcdsf", "fabca", "cba"))[1] "CANDYdsf" "fCANDYa" "cba"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 89: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Regular Expressions (regex)

Regular Expressions in R are used in the following functions:grep finds a pattern in a list of candidates:>grep("abc",c("abcdsf", "fabcasda", "cba"))[1] 1 2

gsub replaces a pattern with another in a list ofcandidates:>gsub("abc","CANDY",c("abcdsf", "fabca", "cba"))[1] "CANDYdsf" "fCANDYa" "cba"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 90: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Regular Expressions (regex)

strsplit>strsplit(c("abcdsf", "fabcasda", "cba"), "abc")[[1]][1] "" "dsf"

[[2]][1] "f" "asda"

[[3]][1] "cba"

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 91: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Demonstration/Resources

Outline1 Motivation

Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data

2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources

Luis F. Campos UC, Berkeley

R and Collecting Internet Data

Page 92: R and Collecting Internet Data - University of California, Berkeley

Motivation Tools we need to learn

Demonstration/Resources

We’ll go though a quick Demo! (time permitting)Resources:

R: http://cran.r-project.org/R::XML:http://cran.r-project.org/web/packages/XML/index.htmlDuncan Temple Lang:http://www.stat.ucdavis.edu/ duncan/XPath Tutorial:http://www.w3schools.com/xpath/default.aspRegEx: http://www.regular-expressions.info/reference.html,WikipediaThis Presentation: http://www.stat.berkeley.edu/ luis/

Luis F. Campos UC, Berkeley

R and Collecting Internet Data