data from the web - · pdf filedata on websites • health data from government website ...

17
Data from the Web stat 579 Heike Hofmann

Upload: buituyen

Post on 13-Mar-2018

215 views

Category:

Documents


2 download

TRANSCRIPT

Data from the Webstat 579

Heike Hofmann

Outline

• HTML & XML source code

• package XML

Data on Websites• Midterm elections 2014, House

http://elections.nytimes.com/2014/results/house

Data on Websites

• Health Data from Government website http://205.207.175.93/HDI/TableViewer/tableView.aspx?ReportId=74

Data on Websites

• Weather Updates, e.g. www.kcci.com

Data on Websites• Weather Data, e.g.

www.wunderground.com/history/airport/KAMW/2013/11/13/DailyHistory.html?

Data on Websites• Weather Data, e.g.

www.wunderground.com/history/airport/KAMW/2014/11/12/DailyHistory.html?

Data on Websites• Sports Data, e.g. http://www.baseball-

reference.com/teams/SFG/2014.shtml

Getting Data

• Single Data File: copy & paste into spread sheet software

• Multiple Data Files: use Parser - e.g. R script

R Package “XML”

• allows parsing of HTML and XML

•install.packages("XML")!

• e.g. team batting statistics for San Francisco Giants: url =“http://www.baseball-reference.com/teams/SFG/2014.shtml”tables <- readHTMLTable(url)

Working with lists• lists are the most general data types in R -

they can contain anything

• Use double brackets to access content of one list element: content of list item i: [[i]]

• Use individual brackets to subset a list (results in another list) i.e. [i:j] [i] is therefore a list with just one list element

• ldply(list, function) from the plyr package

Your turn

• Find your favorite baseball team (if you don’t have a favorite, use the one that matches your initials the closest)

• Use the readHTMLTable function to get the batting statistics for the last three seasons (you might need to find the correct URL first)

• Introduce a new variable called ‘Handedness’ into the data set, that has values ‘L’, ‘R’, or ‘B’ for the batting handedness (see website how this is encoded).

package lubridate

• lubridate allows to work with dates

• as.Date converts (most commonly used) date formats in character variables to dates

• in case of ambiguities - e.g. 6/3/2012 could be interpreted (Europe) as 6th of March instead of June 3rd, use parameter ‘format’

• accessor functions: month(), wday(), year(), …

• comparisons: date > as.Date(“2012-10-05”)

beyond readHTMLTable?

• go directly to HTML source code

<html>!<head> <title>stat579. intro statistical computing. iastate.</title> <link rel="stylesheet" type="text/css" href="http://www.public.iastate.edu/~hofmann/style.css"></head>!<body><div id="navbar"> <p><a href="/~stat579/index.html">&#8594;home</a></p> </div> <div id="wrap">!<h1>stat 579</h1><h2 class="subhead">Introduction to Statistical Computing</h2>!<p><strong>Fall 2009</strong><br />!<p>Section A Thursday 12:10&#8211;2:00. Carver 205</p><p>Section B Tuesday 12:10&#8211;2:00. Snedecor 3121</p>!<p>Heike Hofmann, <a href="mailto:[email protected]">[email protected]</a>.<br /> Office hours by arrangement. Snedecor 2413</p><p>TA: Ai-Ling Teh, <a href="mailto:[email protected]">[email protected]</a>.<br /> Office hour Monday 2-3, Snedecor 3404.</p>!<h2>Syllabus</h2>!<p><a href="syllabus.html">Course syllabus</a> describing objectives, software used, topics and assessment.</p>!<h2>Lectures and timetable</h2>!<table width="100%"> <tr> <th width="15%">Date</th> <th>Lecture and Resources</th> <th>Homework</th> </tr>! <tr class="topic first"><td colspan="4">R Basics &amp; Setting up the Working Environment</td></tr> <tr> <td>Aug 25/27</td> <td><a href="lectures/01-introduction.html">Numeric and Visual Summaries</a> (<a href="http://connect.extension.iastate.edu/p31588910/">movie</a> <a href="http://lasonline.iastate.edu/stat579/media/Stat_579_8-28-2009.mov">movie.mov</a>) </td> </tr>! <tr> <td>Sep 1/3</td> <td><a href="lectures/02-working-directories.html">Working Directories, Data frames &amp; subsets, logical operators</a> (<a href="http://connect.extension.iastate.edu/p45341752/">movie</a> <a href="http://lasonline.iastate.edu/stat579/media/Stat_579_9-3-2009.mov">movie.mov</a>)</td> <td><a href="homework/week1.html">Week 2</a>, due Sep 8/10.</td> </tr>

• http://www.public.iastate.edu/~hofmann/stat579/

• HTML is text with tags, i.e. structural elements

• headers, images, links, tables: e.g. <title>Statistics 579</title>

Viewing the Source

HTML is a treeroot

bodyhead footer

xmlValue

xmlRoot

xmlValuexmlValue

xmlChildren

table

tr tr

td td td

xmlChildren

.!

.!

.

Your turn

• The website http://www.google.org/flutrends/us/#US shows the Google trends for flu cases across the US

• Read the data into R (without using any text editor help)

• Extract the data for the last month for all states.