statistical paradises and paradoxes in big data (i): law ... · menu 1 xiao-li meng department of...

Post on 28-Jul-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Menu 1

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Statistical Paradises and Paradoxes in Big Data (I):

Law of Large Populations, Big DataParadox, and the 2016 US Election

Xiao-Li MengDepartment of Statistics, Harvard University

Meng (2018). Annals of Applied Statistics, No. 2, 685-726.https://statistics.fas.harvard.edu/people/xiao-li-meng

Many thanks to Stephen Ansolabehere and ShiroKuriwaki for the CCES (Cooperative CongressionalElection Study) data and analysis on 2016 US election.

Menu 1

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Statistical Paradises and Paradoxes in Big Data (I):

Law of Large Populations, Big DataParadox, and the 2016 US Election

Xiao-Li MengDepartment of Statistics, Harvard University

Meng (2018). Annals of Applied Statistics, No. 2, 685-726.https://statistics.fas.harvard.edu/people/xiao-li-meng

Many thanks to Stephen Ansolabehere and ShiroKuriwaki for the CCES (Cooperative CongressionalElection Study) data and analysis on 2016 US election.

Menu 1

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Statistical Paradises and Paradoxes in Big Data (I):

Law of Large Populations, Big DataParadox, and the 2016 US Election

Xiao-Li MengDepartment of Statistics, Harvard University

Meng (2018). Annals of Applied Statistics, No. 2, 685-726.https://statistics.fas.harvard.edu/people/xiao-li-meng

Many thanks to Stephen Ansolabehere and ShiroKuriwaki for the CCES (Cooperative CongressionalElection Study) data and analysis on 2016 US election.

Menu 2

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Motivating questions

We know that a 5% random sample is better than a 5%non-random sample in measurable ways (e.g., bias,predictive power).

But is an 80% non-random sample “better” than a5% random sample in measurable terms? 90%?95%? 99%? (Jeremy Wu of US Census Bureau, 2012,Seminar at Harvard Statistics)

“Which one should we trust more: a 1% survey with60% response rate or a non-probabilistic datasetcovering 80% of the population?” (Keiding and Louis,2015, Joint Statistical Meetings; and JRSSB, 2016)

Menu 2

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Motivating questions

We know that a 5% random sample is better than a 5%non-random sample in measurable ways (e.g., bias,predictive power).

But is an 80% non-random sample “better” than a5% random sample in measurable terms? 90%?95%? 99%? (Jeremy Wu of US Census Bureau, 2012,Seminar at Harvard Statistics)

“Which one should we trust more: a 1% survey with60% response rate or a non-probabilistic datasetcovering 80% of the population?” (Keiding and Louis,2015, Joint Statistical Meetings; and JRSSB, 2016)

Menu 2

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Motivating questions

We know that a 5% random sample is better than a 5%non-random sample in measurable ways (e.g., bias,predictive power).

But is an 80% non-random sample “better” than a5% random sample in measurable terms? 90%?95%? 99%? (Jeremy Wu of US Census Bureau, 2012,Seminar at Harvard Statistics)

“Which one should we trust more: a 1% survey with60% response rate or a non-probabilistic datasetcovering 80% of the population?” (Keiding and Louis,2015, Joint Statistical Meetings; and JRSSB, 2016)

Menu 3

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

A Bit of History: Theory and Practice

Law of Large Numbers:Jakob Bernoulli (1713)

Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√

n: n − sample size

Survey Sampling:

Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway

Landmark paper: JerzyNeyman (1934)

The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)

First implementation inUS Census: 1940 led byMorris Hansen

Menu 3

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

A Bit of History: Theory and Practice

Law of Large Numbers:Jakob Bernoulli (1713)

Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√

n: n − sample size

Survey Sampling:

Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway

Landmark paper: JerzyNeyman (1934)

The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)

First implementation inUS Census: 1940 led byMorris Hansen

Menu 3

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

A Bit of History: Theory and Practice

Law of Large Numbers:Jakob Bernoulli (1713)

Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√

n: n − sample size

Survey Sampling:

Graunt (1662); Laplace (1882)

The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway

Landmark paper: JerzyNeyman (1934)

The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)

First implementation inUS Census: 1940 led byMorris Hansen

Menu 3

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

A Bit of History: Theory and Practice

Law of Large Numbers:Jakob Bernoulli (1713)

Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√

n: n − sample size

Survey Sampling:

Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway

Landmark paper: JerzyNeyman (1934)

The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)

First implementation inUS Census: 1940 led byMorris Hansen

Menu 3

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

A Bit of History: Theory and Practice

Law of Large Numbers:Jakob Bernoulli (1713)

Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√

n: n − sample size

Survey Sampling:

Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway

Landmark paper: JerzyNeyman (1934)

The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)

First implementation inUS Census: 1940 led byMorris Hansen

Menu 3

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

A Bit of History: Theory and Practice

Law of Large Numbers:Jakob Bernoulli (1713)

Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√

n: n − sample size

Survey Sampling:

Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway

Landmark paper: JerzyNeyman (1934)

The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)

First implementation inUS Census: 1940 led byMorris Hansen

Menu 3

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

A Bit of History: Theory and Practice

Law of Large Numbers:Jakob Bernoulli (1713)

Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√

n: n − sample size

Survey Sampling:

Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway

Landmark paper: JerzyNeyman (1934)

The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)

First implementation inUS Census: 1940 led byMorris Hansen

Menu 3

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

A Bit of History: Theory and Practice

Law of Large Numbers:Jakob Bernoulli (1713)

Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√

n: n − sample size

Survey Sampling:

Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway

Landmark paper: JerzyNeyman (1934)

The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)

First implementation inUS Census: 1940 led byMorris Hansen

Menu 3

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

A Bit of History: Theory and Practice

Law of Large Numbers:Jakob Bernoulli (1713)

Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√

n: n − sample size

Survey Sampling:

Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway

Landmark paper: JerzyNeyman (1934)

The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)

First implementation inUS Census: 1940 led byMorris Hansen

Menu 4

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

When/why can we ignore the population size?

Think about tastingsoup ...

Stir it well, then afew bits aresufficient regardlessof the size of thecontainer!

⇐⇒

Menu 4

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

When/why can we ignore the population size?

Think about tastingsoup ...

Stir it well, then afew bits aresufficient regardlessof the size of thecontainer!

⇐⇒

Menu 4

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

When/why can we ignore the population size?

Think about tastingsoup ...

Stir it well, then afew bits aresufficient regardlessof the size of thecontainer!

⇐⇒

Menu 4

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

When/why can we ignore the population size?

Think about tastingsoup ...

Stir it well, then afew bits aresufficient regardlessof the size of thecontainer!

⇐⇒

Menu 4

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

When/why can we ignore the population size?

Think about tastingsoup ...

Stir it well, then afew bits aresufficient regardlessof the size of thecontainer!

⇐⇒

Menu 5

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

2016 US Presidential Election

n: number of respondents to an election survey

N: number of (actual) voters in US

Xj = 1: plan to vote for Trump; Xj = 0 otherwise

Rj = 1: report (honestly) voting plan; Rj = 0 otherwise

Estimatinng Trump’s share: µN

= Ave(Xj) by sample average:

µ̂n =R1X1 + . . .+ RNXN

n=

Ave(RjXj)

Ave(Rj)

Actual estimation error

µ̂n − µN=

Ave(RjXj)

Ave(Rj)− Ave(Xj)

=

[Ave(RjXj)− Ave(Rj)Ave(Xj)

σRσ

X

]× σ

R

Ave(Rj)× σ

X

Menu 5

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

2016 US Presidential Election

n: number of respondents to an election survey

N: number of (actual) voters in US

Xj = 1: plan to vote for Trump; Xj = 0 otherwise

Rj = 1: report (honestly) voting plan; Rj = 0 otherwise

Estimatinng Trump’s share: µN

= Ave(Xj) by sample average:

µ̂n =R1X1 + . . .+ RNXN

n=

Ave(RjXj)

Ave(Rj)

Actual estimation error

µ̂n − µN=

Ave(RjXj)

Ave(Rj)− Ave(Xj)

=

[Ave(RjXj)− Ave(Rj)Ave(Xj)

σRσ

X

]× σ

R

Ave(Rj)× σ

X

Menu 5

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

2016 US Presidential Election

n: number of respondents to an election survey

N: number of (actual) voters in US

Xj = 1: plan to vote for Trump; Xj = 0 otherwise

Rj = 1: report (honestly) voting plan; Rj = 0 otherwise

Estimatinng Trump’s share: µN

= Ave(Xj) by sample average:

µ̂n =R1X1 + . . .+ RNXN

n=

Ave(RjXj)

Ave(Rj)

Actual estimation error

µ̂n − µN=

Ave(RjXj)

Ave(Rj)− Ave(Xj)

=

[Ave(RjXj)− Ave(Rj)Ave(Xj)

σRσ

X

]× σ

R

Ave(Rj)× σ

X

Menu 5

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

2016 US Presidential Election

n: number of respondents to an election survey

N: number of (actual) voters in US

Xj = 1: plan to vote for Trump; Xj = 0 otherwise

Rj = 1: report (honestly) voting plan; Rj = 0 otherwise

Estimatinng Trump’s share: µN

= Ave(Xj) by sample average:

µ̂n =R1X1 + . . .+ RNXN

n=

Ave(RjXj)

Ave(Rj)

Actual estimation error

µ̂n − µN=

Ave(RjXj)

Ave(Rj)− Ave(Xj)

=

[Ave(RjXj)− Ave(Rj)Ave(Xj)

σRσ

X

]× σ

R

Ave(Rj)× σ

X

Menu 5

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

2016 US Presidential Election

n: number of respondents to an election survey

N: number of (actual) voters in US

Xj = 1: plan to vote for Trump; Xj = 0 otherwise

Rj = 1: report (honestly) voting plan; Rj = 0 otherwise

Estimatinng Trump’s share: µN

= Ave(Xj) by sample average:

µ̂n =R1X1 + . . .+ RNXN

n=

Ave(RjXj)

Ave(Rj)

Actual estimation error

µ̂n − µN=

Ave(RjXj)

Ave(Rj)− Ave(Xj)

=

[Ave(RjXj)− Ave(Rj)Ave(Xj)

σRσ

X

]× σ

R

Ave(Rj)× σ

X

Menu 5

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

2016 US Presidential Election

n: number of respondents to an election survey

N: number of (actual) voters in US

Xj = 1: plan to vote for Trump; Xj = 0 otherwise

Rj = 1: report (honestly) voting plan; Rj = 0 otherwise

Estimatinng Trump’s share: µN

= Ave(Xj) by sample average:

µ̂n =R1X1 + . . .+ RNXN

n=

Ave(RjXj)

Ave(Rj)

Actual estimation error

µ̂n − µN=

Ave(RjXj)

Ave(Rj)− Ave(Xj)

=

[Ave(RjXj)− Ave(Rj)Ave(Xj)

σRσ

X

]× σ

R

Ave(Rj)× σ

X

Menu 5

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

2016 US Presidential Election

n: number of respondents to an election survey

N: number of (actual) voters in US

Xj = 1: plan to vote for Trump; Xj = 0 otherwise

Rj = 1: report (honestly) voting plan; Rj = 0 otherwise

Estimatinng Trump’s share: µN

= Ave(Xj) by sample average:

µ̂n =R1X1 + . . .+ RNXN

n=

Ave(RjXj)

Ave(Rj)

Actual estimation error

µ̂n − µN=

Ave(RjXj)

Ave(Rj)− Ave(Xj)

=

[Ave(RjXj)− Ave(Rj)Ave(Xj)

σRσ

X

]× σ

R

Ave(Rj)× σ

X

Menu 6

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Data quality, quantity, and uncertainty

Because σ2R = f (1− f ), f = Ave{Rj} = n

N , we have

Error = ρ̂R,X︸︷︷︸

Data Quality

×

√N − n

n︸ ︷︷ ︸Data Quantity

× σX︸︷︷︸

Problem Difficulty

Menu 6

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Data quality, quantity, and uncertainty

Because σ2R = f (1− f ), f = Ave{Rj} = n

N , we have

Error = ρ̂R,X︸︷︷︸

Data Quality

×√

N − n

n︸ ︷︷ ︸Data Quantity

×

σX︸︷︷︸

Problem Difficulty

Menu 6

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Data quality, quantity, and uncertainty

Because σ2R = f (1− f ), f = Ave{Rj} = n

N , we have

Error = ρ̂R,X︸︷︷︸

Data Quality

×√

N − n

n︸ ︷︷ ︸Data Quantity

× σX︸︷︷︸

Problem Difficulty

Menu 7

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Data Defect Index (d.d.i.)

Mean Squared Error (MSE)

MSE(µ̂n) = ER(ρ̂2)× N − n

n× σ2

X

Data Defect Index (d.d.i): DI = ER(ρ̂2)

For Simple Random Sample (SRS): DI = (N − 1)−1

For probabilistic samples in general: DI ∝ N−1

Deep trouble when DI does not vanish with N−1;

or equivalently when ρ̂ does not vanish with N−1/2 ...

Menu 7

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Data Defect Index (d.d.i.)

Mean Squared Error (MSE)

MSE(µ̂n) = ER(ρ̂2)× N − n

n× σ2

X

Data Defect Index (d.d.i): DI = ER(ρ̂2)

For Simple Random Sample (SRS): DI = (N − 1)−1

For probabilistic samples in general: DI ∝ N−1

Deep trouble when DI does not vanish with N−1;

or equivalently when ρ̂ does not vanish with N−1/2 ...

Menu 7

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Data Defect Index (d.d.i.)

Mean Squared Error (MSE)

MSE(µ̂n) = ER(ρ̂2)× N − n

n× σ2

X

Data Defect Index (d.d.i): DI = ER(ρ̂2)

For Simple Random Sample (SRS): DI = (N − 1)−1

For probabilistic samples in general: DI ∝ N−1

Deep trouble when DI does not vanish with N−1;

or equivalently when ρ̂ does not vanish with N−1/2 ...

Menu 7

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Data Defect Index (d.d.i.)

Mean Squared Error (MSE)

MSE(µ̂n) = ER(ρ̂2)× N − n

n× σ2

X

Data Defect Index (d.d.i): DI = ER(ρ̂2)

For Simple Random Sample (SRS): DI = (N − 1)−1

For probabilistic samples in general: DI ∝ N−1

Deep trouble when DI does not vanish with N−1;

or equivalently when ρ̂ does not vanish with N−1/2 ...

Menu 7

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Data Defect Index (d.d.i.)

Mean Squared Error (MSE)

MSE(µ̂n) = ER(ρ̂2)× N − n

n× σ2

X

Data Defect Index (d.d.i): DI = ER(ρ̂2)

For Simple Random Sample (SRS): DI = (N − 1)−1

For probabilistic samples in general: DI ∝ N−1

Deep trouble when DI does not vanish with N−1;

or equivalently when ρ̂ does not vanish with N−1/2 ...

Menu 8

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

A Law of Large Populations (LLP)

If ρ = ER(ρ̂) 6= 0, then on average, the relative error ↑√N:

Actual Error

Benchmark SRS Standard Error=√N − 1ρ̂

The (lack-of) design effect (Deff)

Deff =MSE

Benchmark SRS MSE= (N − 1)DI

Paradigm shift for “Big Data”:

Fromσ√n︸︷︷︸

random error

to ρ̂√N︸ ︷︷ ︸

relative systemtic bias

Menu 8

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

A Law of Large Populations (LLP)

If ρ = ER(ρ̂) 6= 0, then on average, the relative error ↑√N:

Actual Error

Benchmark SRS Standard Error=√N − 1ρ̂

The (lack-of) design effect (Deff)

Deff =MSE

Benchmark SRS MSE= (N − 1)DI

Paradigm shift for “Big Data”:

Fromσ√n︸︷︷︸

random error

to ρ̂√N︸ ︷︷ ︸

relative systemtic bias

Menu 8

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

A Law of Large Populations (LLP)

If ρ = ER(ρ̂) 6= 0, then on average, the relative error ↑√N:

Actual Error

Benchmark SRS Standard Error=√N − 1ρ̂

The (lack-of) design effect (Deff)

Deff =MSE

Benchmark SRS MSE= (N − 1)DI

Paradigm shift for “Big Data”:

Fromσ√n︸︷︷︸

random error

to ρ̂√N︸ ︷︷ ︸

relative systemtic bias

Menu 9

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Effective Sample Size

The Effective Sample Size neff of a “Big Data” set

Equate its MSE to that from a SRS with size neff :

DI

[N − n

n

]σ2 =

1

N − 1

[N − neff

neff

]σ2

What matters is the relative size f = n/N

neff =n

1 + (1− f )[(N − 1)DI − 1]≈ f

1− f

1

ρ̂2.

Menu 9

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Effective Sample Size

The Effective Sample Size neff of a “Big Data” set

Equate its MSE to that from a SRS with size neff :

DI

[N − n

n

]σ2 =

1

N − 1

[N − neff

neff

]σ2

What matters is the relative size f = n/N

neff =n

1 + (1− f )[(N − 1)DI − 1]≈ f

1− f

1

ρ̂2.

Menu 9

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Effective Sample Size

The Effective Sample Size neff of a “Big Data” set

Equate its MSE to that from a SRS with size neff :

DI

[N − n

n

]σ2 =

1

N − 1

[N − neff

neff

]σ2

What matters is the relative size f = n/N

neff =n

1 + (1− f )[(N − 1)DI − 1]≈ f

1− f

1

ρ̂2.

Menu 10

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Gaining 2020 Vision: Assessing the behavioral ρ̂using validated voter counts (≈ 35, 000)

CCES: Cooperative Congressional Election Study(Conducted by Stephen Ansolabehere, Brian Schaffner, Sam Luks, Douglas Rivers

on Oct 4 - Nov 6, 2016 (YouGov); Analysis assisted by Shiro Kuriwaki)

Poll underestimated

Clinton support

Poll overestimated

Clinton support

0%

50%

100%

0% 50% 100%

Final Clinton Popular Vote Share

Val

idat

ed V

oter

Pol

l Est

imat

e,C

linto

n S

uppo

rt

Root Mean Squared Error: 0.07

Reasonable predictions forClinton’s Vote Share

Poll underestimated

Trump support

Poll overestimated

Trump support

0%

50%

100%

0% 50% 100%

Final Trump Popular Vote Share

Val

idat

ed V

oter

Pol

l Est

imat

e,Tr

ump

Sup

port

Root Mean Squared Error: 0.15

Serious underestimation ofTrump’s Vote Share

Menu 10

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Gaining 2020 Vision: Assessing the behavioral ρ̂using validated voter counts (≈ 35, 000)

CCES: Cooperative Congressional Election Study(Conducted by Stephen Ansolabehere, Brian Schaffner, Sam Luks, Douglas Rivers

on Oct 4 - Nov 6, 2016 (YouGov); Analysis assisted by Shiro Kuriwaki)

Poll underestimated

Clinton support

Poll overestimated

Clinton support

0%

50%

100%

0% 50% 100%

Final Clinton Popular Vote Share

Val

idat

ed V

oter

Pol

l Est

imat

e,C

linto

n S

uppo

rt

Root Mean Squared Error: 0.07

Reasonable predictions forClinton’s Vote Share

Poll underestimated

Trump support

Poll overestimated

Trump support

0%

50%

100%

0% 50% 100%

Final Trump Popular Vote Share

Val

idat

ed V

oter

Pol

l Est

imat

e,Tr

ump

Sup

port

Root Mean Squared Error: 0.15

Serious underestimation ofTrump’s Vote Share

Menu 10

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Gaining 2020 Vision: Assessing the behavioral ρ̂using validated voter counts (≈ 35, 000)

CCES: Cooperative Congressional Election Study(Conducted by Stephen Ansolabehere, Brian Schaffner, Sam Luks, Douglas Rivers

on Oct 4 - Nov 6, 2016 (YouGov); Analysis assisted by Shiro Kuriwaki)

Poll underestimated

Clinton support

Poll overestimated

Clinton support

0%

50%

100%

0% 50% 100%

Final Clinton Popular Vote Share

Val

idat

ed V

oter

Pol

l Est

imat

e,C

linto

n S

uppo

rt

Root Mean Squared Error: 0.07

Reasonable predictions forClinton’s Vote Share

Poll underestimated

Trump support

Poll overestimated

Trump support

0%

50%

100%

0% 50% 100%

Final Trump Popular Vote Share

Val

idat

ed V

oter

Pol

l Est

imat

e,Tr

ump

Sup

port

Root Mean Squared Error: 0.15

Serious underestimation ofTrump’s Vote Share

Menu 11

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Assessing ρ̂ using state-level data

Let µN

be the true share, and µ̂n the estimated share. Then

ρ̂ =µ̂n − µN√

N−nn σ2

, & σ2 = µN

(1− µN

)

−0.00021 ± 0.00061

0

4

8

12

−0.010 −0.005 0.000 0.005 0.010

Clinton ρN

Cou

nt

Clinton: ρ̂ ≈ −0.0002± 0.0006

−0.0045 ± 0.00056

0

4

8

12

−0.010 −0.005 0.000 0.005 0.010

Trump ρN

Cou

nt

Trump: ρ̂ ≈ −0.0045± 0.0006

Menu 11

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Assessing ρ̂ using state-level data

Let µN

be the true share, and µ̂n the estimated share. Then

ρ̂ =µ̂n − µN√

N−nn σ2

, & σ2 = µN

(1− µN

)

−0.00021 ± 0.00061

0

4

8

12

−0.010 −0.005 0.000 0.005 0.010

Clinton ρN

Cou

nt

Clinton: ρ̂ ≈ −0.0002± 0.0006

−0.0045 ± 0.00056

0

4

8

12

−0.010 −0.005 0.000 0.005 0.010

Trump ρN

Cou

nt

Trump: ρ̂ ≈ −0.0045± 0.0006

Menu 11

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Assessing ρ̂ using state-level data

Let µN

be the true share, and µ̂n the estimated share. Then

ρ̂ =µ̂n − µN√

N−nn σ2

, & σ2 = µN

(1− µN

)

−0.00021 ± 0.00061

0

4

8

12

−0.010 −0.005 0.000 0.005 0.010

Clinton ρN

Cou

nt

Clinton: ρ̂ ≈ −0.0002± 0.0006

−0.0045 ± 0.00056

0

4

8

12

−0.010 −0.005 0.000 0.005 0.010

Trump ρNC

ount

Trump: ρ̂ ≈ −0.0045± 0.0006

Menu 12

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

What’s the implication of ρ̂ = −0.005?

Many (major) survey results published before Nov 8, 2016;

Roughly amounts to 1% of eligible voters: n ≈ 2, 300, 000;

Equivalent to 2,300 surveys of 1,000 respondents each.

When ρ̂ = −0.005 = −1/200,DI = 1/40000, and hence

neff =f

1− f

1

DI=

1

99× 40000 ≈ 404!

A 99.98% reduction in n, caused by ρ̂ = −0.005.

Butterfly Effect due to Law of Large Populations (LLP)

Relative Error =√

N− 1ρ̂

Menu 12

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

What’s the implication of ρ̂ = −0.005?

Many (major) survey results published before Nov 8, 2016;

Roughly amounts to 1% of eligible voters: n ≈ 2, 300, 000;

Equivalent to 2,300 surveys of 1,000 respondents each.

When ρ̂ = −0.005 = −1/200,DI = 1/40000, and hence

neff =f

1− f

1

DI=

1

99× 40000 ≈ 404!

A 99.98% reduction in n, caused by ρ̂ = −0.005.

Butterfly Effect due to Law of Large Populations (LLP)

Relative Error =√

N− 1ρ̂

Menu 12

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

What’s the implication of ρ̂ = −0.005?

Many (major) survey results published before Nov 8, 2016;

Roughly amounts to 1% of eligible voters: n ≈ 2, 300, 000;

Equivalent to 2,300 surveys of 1,000 respondents each.

When ρ̂ = −0.005 = −1/200,DI = 1/40000, and hence

neff =f

1− f

1

DI=

1

99× 40000 ≈ 404!

A 99.98% reduction in n, caused by ρ̂ = −0.005.

Butterfly Effect due to Law of Large Populations (LLP)

Relative Error =√

N− 1ρ̂

Menu 12

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

What’s the implication of ρ̂ = −0.005?

Many (major) survey results published before Nov 8, 2016;

Roughly amounts to 1% of eligible voters: n ≈ 2, 300, 000;

Equivalent to 2,300 surveys of 1,000 respondents each.

When ρ̂ = −0.005 = −1/200,DI = 1/40000, and hence

neff =f

1− f

1

DI=

1

99× 40000 ≈ 404!

A 99.98% reduction in n, caused by ρ̂ = −0.005.

Butterfly Effect due to Law of Large Populations (LLP)

Relative Error =√

N− 1ρ̂

Menu 12

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

What’s the implication of ρ̂ = −0.005?

Many (major) survey results published before Nov 8, 2016;

Roughly amounts to 1% of eligible voters: n ≈ 2, 300, 000;

Equivalent to 2,300 surveys of 1,000 respondents each.

When ρ̂ = −0.005 = −1/200,DI = 1/40000, and hence

neff =f

1− f

1

DI=

1

99× 40000 ≈ 404!

A 99.98% reduction in n, caused by ρ̂ = −0.005.

Butterfly Effect due to Law of Large Populations (LLP)

Relative Error =√

N− 1ρ̂

Menu 13

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Visualizing LLP: Actual Coverage for Clinton

AL

AK

AZ

AR

CA

CO

CTDEDC

FL

GA

HI

ID

IL

IN

IA

KS

KY

LA

ME

MDMA

MI

MN

MSMO

MT

NE

NV

NH

NJNM

NY

NCND

OH

OK

OR

PA

RISC

SDTN

TX

UT

VT VA

WA

WVWI

WY

−10

−5

−2

0

2

5

5.5 6.0 6.5 7.0

log10 (Total Voters)

Clin

ton

Zn

Menu 14

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Visualizing LLP: Actual Coverage for Trump

AL

AKAZ

AR

CACOCT

DE

DC

FL

GA

HI

ID

IL

IN

IA

KS KY

LAME MDMA

MI

MN

MS

MO

MT

NE

NV

NH

NJ

NM

NY

NC

ND

OH

OK

OR

PA

RI

SC

SD

TN

TX

UT

VT

VA

WAWV

WI

WY

−10

−5

−2

0

2

5

5.5 6.0 6.5 7.0

log10 (Total Voters)

Tru

mp

Zn

Menu 15

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

The Big Data Paradox:

If we do not pay attention to data quality, then

The bigger the data,

the surer we fool ourselves.

Menu 16

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Lessons Learned ...

Lesson 1: What matters most is the quality, not thequantity.

Lesson 2: Don’t ignore seemingly tiny probabilisticdatasets when combining data sources.

Lesson 3: Watch the relative size, not the absolutesize.

Lesson 4: Classical theory is BIG for “big data”, aslong as we let it go outside the classical box.

Menu 16

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Lessons Learned ...

Lesson 1: What matters most is the quality, not thequantity.

Lesson 2: Don’t ignore seemingly tiny probabilisticdatasets when combining data sources.

Lesson 3: Watch the relative size, not the absolutesize.

Lesson 4: Classical theory is BIG for “big data”, aslong as we let it go outside the classical box.

Menu 16

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Lessons Learned ...

Lesson 1: What matters most is the quality, not thequantity.

Lesson 2: Don’t ignore seemingly tiny probabilisticdatasets when combining data sources.

Lesson 3: Watch the relative size, not the absolutesize.

Lesson 4: Classical theory is BIG for “big data”, aslong as we let it go outside the classical box.

Menu 16

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Lessons Learned ...

Lesson 1: What matters most is the quality, not thequantity.

Lesson 2: Don’t ignore seemingly tiny probabilisticdatasets when combining data sources.

Lesson 3: Watch the relative size, not the absolutesize.

Lesson 4: Classical theory is BIG for “big data”, aslong as we let it go outside the classical box.

Menu 17

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

In case you are kind enough to invite me again ...

The sequel: Meng (2018/9)

Statistical Paradises and Paradoxes in Big Data (II):Multi-resolution Inference, Simpson’s

Paradox, and Individualized Treatments

Menu 17

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

In case you are kind enough to invite me again ...

The sequel: Meng (2018/9)

Statistical Paradises and Paradoxes in Big Data (II):Multi-resolution Inference, Simpson’s

Paradox, and Individualized Treatments

top related