big data and programming 4 february 2015. today’s agenda a short introduction to big data a big...

20
Big Data and Programming 4 February 2015

Upload: roxanne-carter

Post on 17-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

Big Data and Programming

4 February 2015

Today’s Agenda A Short Introduction to Big Data

A Big Data Project: People In Motion Next week

Meet Monday here at 2:30 for ca. 60-75 minutes Meet Wednesday ca. 2:30-4:30 in Library 034a

(north stairs, go to basement)

Data Deluge Bit, byte, kilobyte (kB) megabyte (MB),

gigabyte, terabyte, petabyte, exabyte, zettabytes....

Library of Congress = 200 terabytes “Transferring “Libraries of Congress” of Data”

IP traffic is around 667 exabytes It’s a deluge... “Big Data”

too large for current software to handle

Don’t be intimidated Not all DH sources (yet)

Big Data for History Tools for journalists, literature scholars and others

Where does history fit in? Graham, Milligan, & Weingart

“Will Big Data have a revolutionary impact on the epistemological foundation of history?”

Will it get us closer to the past? Networks

A whole world of fun! Visualization is also a whole new world

See: David McCandless, “The Beauty of Data Visualization

What does it tell us?

New approaches: Crowdsourcing An “online, distributed problem-solving and

production model.” Examples:

Wikipedia reCAPTCHA

Luis von Ahn

Others...

A Database for Your Project? Think about how you might use a database

but perhaps not too big! Databases can be very small and still be DH-

worthy Are there public docs out there that you can

digest?

Resources: Programming Historian MS Excel (spreadsheet), Access (relational

database), Google Refine

People in Motion:Longitudinal Data from

the Canadian CensusA Big Data Project at the University of Guelph

‘Unbiased’ links connecting individuals/households over several

census years

A comprehensive infrastructure of longitudinal data

What we are working towards

1851Census

1871Census

1881Census 1891

Census

1901Census

1906 Census

1916Census

1911Census

US 1880

Census

US 1900

Census

Stage 1: 1871 to 1881

100% of 1871

Census

Automatic Linking

4,277,807 records

3,601,663 records

Partners and collaborators: FamilySearch (Church of Latter Day Saints), Minnesota Population Center, Université de Montréal, Université Laval/CIEQ University of Alberta

100% of 1871

Census

100% of 1871

Census

100% of 1881

Census

100% of 1871

Census

Teaching a Computer to be a genealogist Training with existing manually-created (True)

links Ontario Industrial Proprietors – 8429 links Logan Township – 1760 links St. James Church, Toronto – 232 links Quebec City Boys – 1403 links

Bias concerns Think of any?

Logan Twp

Guelph

Attributes for Automatic Linking Last Name – string First Name – string Gender – binary Birthplace – code Age – number Marital status – single, married, divorced,

widowed, unknown

Automatic Linkage

The challenges:1) Identify the same person2) Deal with attribute characteristics3) Manage computational expense

The system:

Data Cleaning and Standardization Cleaning

Names – remove non-alpha numerical characters; remove titles

Age – transform non-numerical representations to corresponding numbers (e.g. 3 months);

All attributes - deal with English/French notations (e.g. days/jours, married/mariee)

Standardization Birthplace codes and granularity Marital status

Computational Expense Very expensive to compare all the possible pairs

of records

Computing similarity between 3.5 million records (1871 census) with 4 million records (1881 census)

Run-time estimate of : ( (3.5M x 4M)record pairs x 2 attributes being compared ) / (4M comparisons per second) / 60 (sec/min) / 60 (min/hour) / 24 (hours/day) = 40.5 days. (Big Data)

Managing Computational Expense Blocking

By first letter of last name By birthplace

Using HPC Running the system on multiple processors in

parallel

Record Comparison Comparing Strings

String measures: First letter, “edit Distance”, sound

Age +/- 2 years

Required exact matches Gender Birthplace

Linkage Results 1871-81-91-1901

Over 500,000 links… About 20%

Coding Workshop Go to http://www.codecademy.com/learn Scroll down to “Goals” Pick one of the three activities

Animate your Name About You Sun, Earth and Code

After 30 minutes, be prepared to present!