data and society lecture 2: big data applicationsbermaf/data course 2017/lecture 2 - 2017.pdf ·...

46
Fran Berman, Data and Society, CSCI 4370/6370 Data and Society Lecture 2: Big Data Applications 1/27/17 Big data, Data and Commerce, Data and the Election

Upload: others

Post on 24-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Data and Society Lecture 2: Big Data Applications

1/27/17

Big data, Data and Commerce, Data and the Election

Page 2: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Announcements 1/27

• Reminders from last time:

– Please sign attendance sheet each time you are here (your participation grade depends partly on attendance).

– If you decide to drop the class, please let me know ([email protected]) and I will let someone in on the waiting list.

– Office Hours: Friday 1-2 (AE 218) or by appointment (send email to [email protected])

• Wednesday class February 1, starts at 8:00 a.m.

• Op-Ed draft due February 10 – instructions in Lecture 1

• Discussion article for next week (Friday, February 3). Please read:

– “I had my DNA Picture Taken with Varying Results”, New York Times, http://www.nytimes.com/2013/12/31/science/i-had-my-dna-picture-taken-with-varying-results.html

Page 3: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Wednesday Section Friday lecture

First Half of Class Second Half of Class Assignments

January 18 : NO class January 20 L!: Class Intro + Logistics / Survey / Digital Data in the 21st Century

Presentation Model / Op-Ed Instructions

January 25: NO class January 27 L2: Big data applications / Data and the election; Data and Target; Discussion

4 Presentations

February 1: 6 presentations February 3

L3: Data and Health / PDB, Precision Medicine; Discussion

4 Presentations

February 8: NO class February 10 L4: Data and Science / Earthquakes, LHC; Paper Instructions

4 Presentations Op-Ed Draft Due

February 15: 6 presentations

February 17 L5: Data Cyberinfrastructure; Discussion

4 Presentations Op-Ed Draft Back

February 22: 6 presentations

February 24 L6: Data Stewardship and Data Preservation; Discussion

4 presentations Op-Ed Final Due

March 1: NO class March 3 NO class

March 8: 6 presentations March 10 L7: Data Futures – Internet of Things; Discussion

4 presentations Paper Draft Due

March 15: Spring Break March 17 Spring Break

March 22: NO class March 24 L8: Data rights and policy / U.S. and EU; Discussion

4 presentations

March 29: 6 presentations March 31 Op-Ed Pecha-Kucha Paper Draft Back

April 5: NO class April 7 NO class

April 12: 4 presentations April 14 Hilary Mason Guest Lecture 4 presentations Final Paper Due

April 19: 4 presentations April 21 L9: Data and Ethics; Discussion 4 presentations

April 26: 6 presentations April 28 Paper Pecha-Kucha

Page 4: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Today (1/27/17)

• Lecture 2: Big Data Applications

• Discussion

• Break

• 4 Student Presentations

4

Page 5: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Lecture 2: Big Data Applications

Page 7: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

What is big data?

• Wikipedia: “Broad term for data sets so large or complex that traditional data processing applications are inadequate.”

• McKinsey: “Datasets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze”

• O’Reilly Radar: “Data that exceeds the processing capacity of conventional database systems. The data that is too big, moves too fast, or doesn’t fit the structures of your database architectures. To gain value from this data, you must choose an alternative way to process it.”

Page 8: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

What does big data tell us?

• Big data is often noisy, dynamic, heterogeneous. Inter-

related and untrustworthy. Why do we find it useful?

– General statistics obtained from frequent patterns and

correlation analysis can disclose more reliable hidden patterns

and knowledge

– Interconnected big data forms large heterogeneous

information networks, with which information redundancy can

be explored to compensate for missing data, cross check

conflicting cases, validate trustworthy relationships, disclose

inherent clusters, and uncover hidden relationships and

models.

Page 9: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Big data visualization from the

Cooper Hewitt Design Museum

(Thanks to Sarah Schattschneider!)

• “Flight Patterns” by Aaron Koblin: https://www.youtube.com/watch?v=ttH7sQ48n5k

• From https://collection.cooperhewitt.org/objects/68743525/: “Flight Patterns is a data visualization project that traces domestic airline traffic during a single 24-hour period over North America. Flight paths, using datasets provided by the Federal Aviation Administration, are rendered as arced trajectories. The result is a stunning visual animation that elegantly renders air traffic data as cartography.”

Page 10: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

About Big Data [Strata]

• Value of big data: analytical use, enabling new products

• Ways that big data impacts infrastructure

– Volume: big data calls for scalable storage and a distributed approach to querying

– Velocity: big data infrastructure must adapt to the speed of the input and the need for quick analysis and turnaround. Need for stream processing technologies

– Variety: Source data often “messy”, non-homogeneous, unstructured. Infrastructure must organize and find meaning from it.

Page 11: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

11

Image adapted from NIST. Original credit: Jason Kolb, Applied Data Labs; Modified from the original at:

www.applieddatalabs.com/content/new-reality-business-intelligence-and-big-data

Things You Know

Things You Don’t Know

Questions

You’re

Asking

Questions

You

Haven’t

Thought Of Conventional

Data Analytics

Data

Acquisition

BIG

DATA

Data-enabled

Exploration

Big Data – Potential for Innovation

Page 12: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

How is big data useful in industry?

• (Big) data is being used by virtually every industry and is being used to boost/improve production

• Big data contributing to new ways of creating value:

– Creating transparency

– Enabling experimentation to discover needs, expose variability and improve performance

– Segmenting populations to customize actions

– Replacing / supporting human decision making with automated algorithms

– Supporting new business models, products, services

• Big data becoming a competitive advantage and means of industry growth

• Big data enabling substantial growth in productivity and customer satisfaction.

• Big data enabling new insights and discoveries

Page 13: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Big data can mean big profits

Page 14: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

McKinsey’s take on Big Data (circa 2011)

Page 15: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Capitalizing on big data – easy or hard?

Road blocks to capturing the value of big data

• Need for data policy – privacy, security, intellectual property and liability (ownership, rights, fair use, etc.)

• Need for new and evolving systems, technologies and techniques for managing and leveraging big data

• Need for new practice, policy and infrastructure to gain/provide access to data

• Need to evolve / change industry structure and culture

Page 16: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Beware of too much inference from Big Data! Correlation vs. Causation

• Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together. [http://whatis.techtarget.com/definition/correlation]

• Causation, or causality, is the capacity of one variable to influence another. The first variable may bring the second into existence or may cause the incidence of the second variable to fluctuate.

• Causation is often confused with correlation, which indicates the extent to which two variables tend to increase or decrease in parallel. However, correlation by itself does not imply causation. There may be a third factor, for example, that is responsible for the fluctuations in both variables. [http://whatis.techtarget.com/definition/causation]

Correlations from Spurious correlations: http://www.tylervigen.com/spurious-correlations

Page 17: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Data-Driven Commerce

Page 18: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Predictive Analytics

• Retailers highly interested in the buying habits of their customers: what you like, what you need, which coupons will help draw you to their store, etc.

• Retailers also use highly sophisticated models of human behavior: buying behavior, formation of habits, etc. to help determine how to best draw customers

• Many retailers hiring statisticians, mathematicians, data scientists to improve the bottom line through strategic marketing, including Target

Page 19: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Predictive Analytics at Target

• Target develops profile of customer information for each customer

– Information indexed by a unique guest ID number: credit card information, name, email address, purchases, demographic information as available, etc.

– Information is collected by Target or bought from other sources (information available includes ethnicity, job history, magazines you read, if you’ve declared bankruptcy or gotten divorce, what kinds of topics you talk about online, etc.)

• Retailers know that at major life events, old routines fall apart and usual brand loyalties and buying habits are in flux: graduating from college, birth of a child, moving to a new area / town, etc.

• Target wanted to focus on the life event of having a child

– New parents will develop new buying routines for diapers, toys, lotion, baby food, clothes, etc.

– If Target can change the buying habits of new parents before the birth of the baby, they are pre-competitive and can win big

Page 20: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Marketing to Pregnant Women

• Target statistician Andrew Pole analyzed data from customers who had

signed up in Target’s baby registry

• Analyses identified ~25 products that, when analyzed together,

contributed to a “pregnancy prediction” score (e.g. unscented lotion,

vitamin supplements, etc.). Score also estimated due date.

• Target used pregnancy prediction score and estimated due date to

identify which target customers to send baby product coupons to, what

and when

• Anecdote:

Page 21: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Minimizing the “creepiness factor”

• Behavioral research and data analysis helping drive much more in-depth predictive analytics

• Combining prediction and analysis with marketing infrastructure:

• Target had the capacity to send customers customized ad books. Once it is determined that they are potentially pregnant, seemingly random pregnancy and baby products can be included with other ads that accurately target the consumer.

• Company began to mix baby products with other things (e.g. lawn mowers, wineglasses, etc.)

• Customers found this less creepy and used the baby coupons

Page 22: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Personalized marketing

• Soon after the new ad campaign, Target’s “Mom and Baby” sales greatly increased and grew over time ($44B in 2002 to $67B in 2010)

• Similar data mining approach being used in many, many stores and businesses: department stores, Facebook, Google, etc.

• Key issues about privacy remain and your rights within the burgeoning market for data about you are yet to be sorted out.

Page 23: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

21 Things “Big Data” Knows about You (Forbes) -- 1 http://www.forbes.com/sites/bernardmarr/2016/03/08/21-scary-things-big-data-knows-about-you/#23aec89b66a7

1. Your browser knows what you’ve searched for.

2. Google also knows your age and gender — even if you never told them. They make a pretty comprehensive ads profile of you, including a list of your interests (which you can edit) to decide what kinds of ads to show you.

3. Facebook knows when your relationship is going south. Based on activities and status updates on Facebook, the company can predict (with scary accuracy) whether or not your relationship is going to last.

4. Google knows where you’ve travelled, especially if you have an Android phone.

5. And the police know where you’re driving right now — at least in the U.K., where closed circuit televisions (CCTV) are ubiquitous. Police have access to data from thousands of networked cameras across the country, which scan license plates and take photographs of each car and their driver. In the U.S., many cities have traffic cameras that can be used similarly.

6. Your phone also knows how fast you were going when you were traveling. (Be glad they don’t share that information with the police!)

7. Your phone has also probably deduced where you live and work.

8. The Internet knows where your cat lives. Using the hidden meta-data about the geographic location of where the photo was taken which we share when we publish photos of our cats on sites like Instagram and other social media networks.

9. Your credit card company knows what you buy. Of course your credit card company knows what you buy and where, but this has raised concerns that what you buy and where you shop might impact your credit score. They can use your purchasing data to decide if you’re a credit risk.

10. Your grocery store knows what brands you like. For every point a grocery store or pharmacy doles out, they’re collecting mountains of data about your purchasing habits and preferences. The chains are using the data to serve up personalized experiences when you visit their websites, personalized coupon offers, and more.

Page 24: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

21 Things “Big Data” Knows about You (Forbes) -- 2 http://www.forbes.com/sites/bernardmarr/2016/03/08/21-scary-things-big-data-knows-about-you/#23aec89b66a7

11.HR knows when you’re going to quit your job. An HR software company called Workday is testing out an algorithm that analyzes text in documents and can predict from that information, which employees are likely to leave the company.

12.Target knows if you’re pregnant. (Sometimes even before your family does.)

13.YouTube knows what videos you’ve been watching. And even what you’ve searched for on YouTube.

14.Amazon knows what you like to read, Netflix knows what you like to watch. Even your public library knows what kinds of media you like to consume.

15.Apple and Google know what you ask Siri and Cortana.

16.Your child’s Barbie doll is also telling Mattel what she and your child talk about.

17.Police departments in some major cities, including Chicago and Kansas City, know you’re going to commit a crime — before you do it.

18.Your auto insurance company knows when and where you drive — and they may penalize you for it, even if you’ve never filed a claim.

19.Data brokers can help unscrupulous companies identify vulnerable consumers. For example, they may identify a population as a “credit-crunched city family” and then direct advertisements at you for payday loans.

20.Facebook knows how intelligent you are, how satisfied you are with your life, and whether you are emotionally stable or not – simply based on a big data analysis of the ‘likes’ you have clicked.

21.Your apps may have access to a lot of your personal data. Angry Birds gets access to your contact list in your phone and your physical location. Bejeweled wants to know your phone number. Some apps even access your microphone to record what’s going on around you while you use them.

Page 25: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Big Data and the 2016 Election

Page 26: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

What Happened? Why were most predictions off?

• Were the models wrong?

• Was the data wrong?

• Were the samples wrong?

• Were the interpretations wrong?

• Was voter behavior just one of the low probability outcomes?

Map from http://www.270towin.com/2016_Election/interactive_map

Page 27: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

How people voted: Exit Polls and Election Results

Characteristic Breakdown(s)

Age • 18-29: 55% for Clinton, 37% for Trump • 30-44: 50% for Clinton, 42% for Trump • 45+: 53% for Trump • 45-64: 44% for Clinton • 65+: 45% for Clinton

Gender • 54% of women voted for Clinton; 42% of women voted for Trump • 53% of men voted for Trump; 41% of men voted for Clinton

Ethnicity • White voters: 58% for Trump, 37% for Clinton • Black voters: 88% for Clinton; 8% for Trump • Hispanic and Asian voters: 65% for Clinton; 29% for Trump

Education • College grads: 49% for Clinton • Postgrads: 58% for Clinton • High school or less: 51% for Trump • Some college / Associate degree: 52% for Trump

Religion • Catholic: 52% for Trump • Protestants / Christians: 58% for Trump • Jewish: 71% for Clinton • Other: 62% for Clinton • No religion: 68% for Clinton

Income • Under $30K: 53% for Clinton • $30K-$49.99K: 51% for Clinton • $50K - $99.99K: 50%for Trump • $100K-$199.99K: 48% for Trump • $250K+: 48% for Trump

Locale (Urban vs. Rural)

• Cities with > 50K residents: 59% for Clinton, 35% for Trump • Rural areas: 62% for Trump, 34% for Clinton • Suburbs: 50% for Trump, 45% for Clinton

Page 28: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Voter stats

All Americans 320,000,000+

Voting age population 251,107,404 (78.5%)

Eligible voters 231,556,622 (72.4%)

Registered voters ~200,000,000 (62.5%)

Voters 138,884,643

(43.4%)

Statistics from: http://heavy.com/news/2016/11/eligible-voter-turnout-for-2016-data-hillary-clinton-donald-trump-republican-democrat-popular-vote-registered-results/

Page 29: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

What Happened? Why were most predictions off?

• Were the models wrong?

• Were the interpretations wrong?

• Was the data wrong?

• Were the samples wrong?

• Was voter behavior just one of the low probability outcomes?

Map from http://www.270towin.com/2016_Election/interactive_map

Page 30: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Model and Interpretation Accuracy

Many challenges in modeling and interpretation:

• Raw polling data supplemented by estimates on how many people will vote and what undecided voters will do

• Historical inferences about past patterns of turnout, demographics, economic conditions and party loyalty may not be accurate for present day

• If polls shows that candidate “wins” by a small margin within the margin of error, it is risky to interpret this as a “win”

From: https://projects.fivethirtyeight.com/2016-election-forecast/

From: http://www.latimes.com/politics/la-na-pol-usc-latimes-poll-20161108-story.html

Page 31: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Data Integrity – Was poll data accurate?

• Many suspected that people lied about voting for Trump

• Trafalgar Group’s approach to improving data accuracy -- Adjust numbers to account for people’s hesitance to admit a Trump vote

– Used robotic calls for which Trump voters seemed more comfortable

– Added a “neighbor” question -- Who do you think your neighbors will vote for? – and checked to see if the numbers were different

– Created a demographic of people who had not voted in 6+ years but planned to vote for Trump

• Trafalgar predicted Trump win in Pennsylvania and Michigan (but not all states)

Page 32: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Sampling Accuracy

Figure from http://www.forbes.com/sites/startswithabang/2016/11/09/the-science-of-error-how-polling-botched-the-2016-election/#75748a437da8

Key sampling questions

• How representative is the sample of

population?

• How biased are the sampling vehicles –

land lines, human interviews, tweets,

non-self screening respondents, etc.?

• How representative is the sample of

turnout? For eligible voters? For eligible

voters who actually vote?

• How accurate is the data (are people

lying)?

• How big is the sample / what is the

margin of error?

Page 33: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Interpretations What happened at 538?

• Nate Silver: Statistician who writes NY Times “538” column

• 538 model favored Clinton consistently but model with adjusted polls near election increasingly favored Trump win

• Model used derived “unksewed” polling data:

– Adjusted polls subjected to set of assumptions, run through a regression analysis to produce adjusted numbers

• With this approach, 538 predicted that Trump would win in Florida and Clinton would win election

“We strongly disagree with the idea that there was a massive polling error. Instead, there was a modest polling error, well in line with historical polling errors, but even a modest error was enough to provide for plenty of paths to victory

for Trump. We think people should have been better prepared for it. There was widespread complacency about Clinton’s chances in a way that wasn’t justified by a careful analysis of the data and the uncertainties surrounding it.”

Nate Silver

Page 34: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Who made correct predictions? • Investor’s Business Daily (IBD/TIPP poll)

Predicted: Trump would win by 1.6%

Approach:

– Start with random sample from public, adjust for census statistics and age, gender, religion, look at registered and likely voters, adjust for party registrations, enthusiasm

– Poll made more calls to smartphones than landlines.

– People represented wide range of people in the country (including a representative sample of types of phones used)

– Poll questioned respondents about enthusiasm and factored this into results

• USC / LA Times poll (USC economics prof Arie Kapteyn)

Predicted: Trump would win by 3%

Approach:

– Pollsters sought to balance both big groups (e.g. men and women) and smaller groups (e.g. young minority voters)

– Weighting of responses in polls used to make them more fully representative. Sample includes representation of demographic statistics including race, gender, age.

Page 35: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Who made correct predictions?

• Primary Model.com / Helmut Norpoth (Stonybrook political science prof)

Predicted: Trump would win against Clinton with 87% certainty.

– Predicted last 5 presidential elections correctly; predicted the results of every presidential election except 1 in last 104 years.

Approach:

– Uses primaries rather than polls to predict outcomes.

– Takes “swing of the electoral pendulum” into consideration (Republicans favored after two democratic terms)

• Alan Lichtman / American University historian

Predicted: Trump wins

Approach:

– Developed 13 T/F keys that predict election outcome. True favors incumbent party. If 6+ are false, change is predicted.

– Has worked in every election for the last 30 years.

• Lichtman’s Keys:

1. Party Mandate: After the midterm elections, the incumbent party holds more seats in the U.S. House of Representatives than after the previous midterm elections.

2. Contest: There is no serious contest for the incumbent party nomination.

3. Incumbency: The incumbent party candidate is the sitting president.

4. Third party: There is no significant third party or independent campaign.

5. Short-term economy: The economy is not in recession during the election campaign.

6. Long-term economy: Real per capita economic growth during the term equals or exceeds mean growth during the previous two terms.

7. Policy change: The incumbent administration effects major changes in national policy.

8. Social unrest: There is no sustained social unrest during the term.

9. Scandal: The incumbent administration is untainted by major scandal.

10. Foreign/military failure: The incumbent administration suffers no major failure in foreign or military affairs.

11. Foreign/military success: The incumbent administration achieves a major success in foreign or military affairs.

12. Incumbent charisma: The incumbent party candidate is charismatic or a national hero.

13. Challenger charisma: The challenging party candidate is not charismatic or a national hero.

Page 36: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Bottom Line

• Big data is a great tool but it is not a guarantee of outcomes

• Predictions and estimations are not a guarantee of outcomes

• Garbage in, garbage out

• Bias can be built into the system at many places: data collection, data sampling, models, interpretation, etc.

• There is no magic bullet …

Page 37: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Lecture 2 Sources 1

• “Big data: The next frontier for innovation, competition and productivity”, Report from the McKinsey Global Institute, http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation

• “What is big data?” O’Reilly Radar, http://radar.oreilly.com/2012/01/what-is-big-data.html

• “How Target figured out a teen girl was pregnant before her father did,” Forbes, http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/#5df063ed34c6

• “How Companies Learn your secrets”, The New York Times, http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?pagewanted=all&_r=0

• “Exit polls and election results – what we learned”. The Guardian, https://www.theguardian.com/us-news/2016/nov/12/exit-polls-election-results-what-we-learned

• “The Science of Error: How Polling Botched The 2016 Election”, Forbes, http://www.forbes.com/sites/startswithabang/2016/11/09/the-science-of-error-how-polling-botched-the-2016-election/#75748a437da8

• “The trouble is not with polling but with the limits to human interpretation of data,” Quartz, http://qz.com/832908/confirmation-bias-is-why-we-couldnt-predict-a-trump-victory/

Page 38: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Lecture 2 Sources 2

• “There are Many Ways to Map election Results. We’ve Tried Most of Them.”, NY Times, http://www.nytimes.com/interactive/2016/11/01/upshot/many-ways-to-map-election-results.html?_r=0

• “Trump’s win isn’t the death of data – it was flawed all along,” Wired, https://www.wired.com/2016/11/trumps-win-isnt-death-data-flawed-along/

• “2016 Election Oracles: These People Predicted Trump Would Win”, Heavy, http://heavy.com/news/2016/11/2016-final-election-results-predictions-helmut-norpoth-abramowitz-michael-moore-nate-silver-vote-count-turn-out-electoral-college-maps-donald-trump-hillary-clinton-polls-forecasting-pennsylvania-michi/

• “No, one 19-year-old Trump supporter probably isn’t distorting the polling averages all himself”, LA Times, http://www.latimes.com/politics/la-na-pol-daybreak-poll-questions-20161013-snap-story.html

• “How IBD Accurately Gauged Voter Enthusiasm and Got the Polls Right,” Townhall, http://townhall.com/tipsheet/cortneyobrien/2016/11/11/how-ibd-got-the-polls-right-n2244109

• “Why 538 Gave Trump a Better Chance than Almost Anyone Else”, NY Times, http://fivethirtyeight.com/features/why-fivethirtyeight-gave-trump-a-better-chance-than-almost-anyone-else/

Page 40: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

• (From “Eight (No, Nine!) Problems with Big Data”, NY Times). Limitations of big data:

1. “… although big data is very good at detecting correlations, …, it never tells us which correlations are meaningful”

2. “ … big data can work well as an adjunct to scientific inquiry but rarely succeeds as a wholesale replacement.”

3. “ … many tools that are based on big data can be easily gamed.”

4. “ … even when the results of a big data analysis aren’t intentionally gamed, they often turn out to be less robust than they initially seem.”

5. “ … whenever the source of information for a big data analysis is itself a product of big data, opportunities for vicious cycles abound [echo chamber effect].”

6. “ … risk of too many correlations.”

7. “ … big data is prone to giving scientific-sounding solutions to hopelessly imprecise questions.”

8. “ …big data is at its best when analyzing things that are extremely common, but often falls short when analyzing things that are less common.”

9. “ … the hype.”

Page 41: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Break

Page 42: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Presentation Articles for February 1

• February 1:

– “Racial Bias in Everything: Airbnb Edition”, Washington Post, https://www.washingtonpost.com/news/wonk/wp/2015/12/12/racial-bias-in-everything-airbnb-edition/?utm_term=.4fcc0ce20771 [Yarden N]

– “Google Flu Trends: The Limits of Big Data”, New York Times, http://bits.blogs.nytimes.com/2014/03/28/google-flu-trends-the-limits-of-big-data/ [Mary L]

– “Data the Driving Force behind AC50 Designs”, Royal Gazette, http://www.royalgazette.com/oracle-team-usa/article/20170107/data-driving-force-behind-ac50-designs [Joe C]

– “Falcons, Drones, Data: A Winery Battles Climate Change”, NY Times, http://www.nytimes.com/2017/01/05/business/california-wine-climate-change.html [Eryka G]

– “Uber’s Mildly Helpful Data Could Help Cities Fix Streets”, Wired https://www.wired.com/2017/01/uber-movement-traffic-data-tool/ [David K]

– “The Shazam Effect”, The Atlantic, http://www.theatlantic.com/magazine/archive/2014/12/the-shazam-effect/382237/ [Rob R]

Page 43: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Presentation Articles for February 3

• February 3:

– “The 21st Century Cures Act: FDA Reforms Aim to Spur Innovation the Pharmaceutical, Medial Device and Health Research Sectors”, Lexology, http://www.lexology.com/library/detail.aspx?g=fa622c15-f2c4-4397-9d64-6cd341fcaf3f [Andrea L]

– “Four Steps to Precision Public Health”, Nature, http://www.nature.com/news/four-steps-to-precision-public-health-1.21089 [Kusuma B]

‒ “Can IBM’s Watson do it all?”, Fast Company, https://www.fastcompany.com/3065339/mind-and-machine/can-ibms-watson-do-it-all [Erica B]

‒ “mHealth’s Year in Review: From Texting to Wearables to Telehealth’s Tricks (and Treats)”, mHealth Intelligence, http://mhealthintelligence.com/news/mhealths-year-in-review-from-texting-to-wearables-to-telehealths-tricks-and [Tim T]

Page 44: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Presentation Articles February 10

• February 10:

– “Crowdsourcing: For the Birds”, NY Times, http://www.nytimes.com/2013/08/20/science/earth/crowdsourcing-for-the-birds.html?pagewanted=2&contentCollection=Science&action=click&region=EndOfArticle&module=RelatedCoverage&pgtype=article [Dan S]

– “Digital Keys for Unlocking the Humanities’ Riches”, New York Times, http://www.nytimes.com/2010/11/17/arts/17digital.html?pagewanted=all&_r=0 [Bobby M]

– “African Elephant Numbers Plummet 30 Percent, Landmark Study Finds,” National Geographic, http://news.nationalgeographic.com/2016/08/wildlife-african-elephants-population-decrease-great-elephant-census/ [Deborah A]

– “Astronomers Characterize Wolf 1061 Planetary System“, Sci News, http://www.sci-news.com/astronomy/wolf-1061-planetary-system-04552.html [Eric L]

Page 45: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Class next Wednesday. Class next Friday.

• Next Friday: Data and Health ; Discussion

• Read for January February 3 Discussion:

– “I had my DNA Picture Taken with Varying Results”, New York Times,

http://www.nytimes.com/2013/12/31/science/i-had-my-dna-picture-

taken-with-varying-results.html

Page 46: Data and Society Lecture 2: Big Data Applicationsbermaf/Data Course 2017/Lecture 2 - 2017.pdf · 2017-01-27 · Fran Berman, Data and Society, CSCI 4370/6370 Announcements 1/27 •

Fran Berman, Data and Society, CSCI 4370/6370

Presentation Articles for January 27

• January 27: – “Giving Viewers What They Want,” NY Times,

http://www.nytimes.com/2013/02/25/business/media/for-house-of-cards-using-big-data-to-guarantee-its-popularity.html?pagewanted=all&_r=1& [Molly R]

– “How Big Data is going to help feed 9 billion people by 2050”, TechRepublic, http://www.techrepublic.com/article/how-big-data-is-going-to-help-feed-9-billion-people-by-2050/ [Max L]

– “How Pro Sports Teams are Using Big Data to Draft Better Players”, Financial Post, http://business.financialpost.com/executive/c-suite/pro-sports-teams-turning-to-data-anlaytics-to-fill-seats?__lsa=88c9-3dab [Rob B]

– “Where you live can have a lot to say about your health,” Washington Post, https://www.washingtonpost.com/national/health-science/where-you-live-can-have-a-lot-to-say-about-your-health/2016/12/30/6d94c510-cc73-11e6-a747-d03044780a02_story.html?utm_term=.689dd5bb5168 [Harrison L]