big data in the social science research and libraries micah altman director of research mit...

56
Big Data in the Social Science Research and Libraries Micah Altman Director of Research MIT Libraries Prepared for Invited Seminar Series Harbin Engineering University September 2015

Upload: owen-nickolas-lambert

Post on 13-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Big Data in the Social Science Research and Libraries

Micah Altman

Director of Research

MIT Libraries

Prepared for

Invited Seminar Series

Harbin Engineering University

September 2015

Big Data in the Social Science and Library Research

Roadmap

What the @#%&! is “big data”?

Two examples of big data in social & health sciences

Representative Challenges

Potential roles for libraries

Big Data Challenges

Acquisition

Retention

Analysis

Access

Big Data in the Social Science and Library Research

Credits&

Disclaimers

Big Data in the Social Science and Library Research

DISCLAIMERThese opinions are my own, they are not the opinions of MIT, Brookings, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators

Secondary disclaimer:

“It’s tough to make predictions, especially about the future!”

-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan

Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc. 

Big Data in the Social Science and Library Research

Collaborators & Co-Conspirators Workshop Series Co-Organizers

– U.S. Census Bureau Cavan Capps Ron Prevost

Research Support Supported by the U.S. Census Bureau

Big Data in the Social Science and Library Research

Related Work

Main Project: Census-MIT Big Data Workshop Series

projects.informatics.mit.edu/bigdataworkshops

Related publications:

(Reprints available from: informatics.mit.edu )

Altman, M., D. O’Brien, S. Vadhan, A. Wood. 2014. “Big Data Study:

Request for Information.” Altman, M Altman M, Wood A, O'Brien D, Gasser M., Vadhan, S. Towards a Modern Approach

to Privacy-Aware Government Data Releases. Berkeley Journal of Technology Law. Forthcoming.

Altman M, McDonald MP. 2014. Public Participation GIS : The Case of Redistricting. Proceedings of the 47th Annual Hawaii International Conference on Systems Science .

Big Data in the Social Science and Library Research

Workshops Series: Big Data and Official Statistics

Acquisition ChallengesUsing New forms of Information for Official Economic Statistics[August 3-4]

Privacy ChallengesLocation Confidentiality and Official Surveys [October 5-6]

Inference ChallengesTransparency and Inference[December 7-8]

Expected outcomes:

Workshop reports (September, October, December)

Integrated white paper(February)

Identifying new opportunities for statistical agencies

Inform the Census Big Data Research Program.

projects.informatics.mit.edu/bigdataworkshops

Big Data in the Social Science and Library Research

“More” is the trend in research

and education

Big Data in the Social Science and Library Research

More stuff & people…… Forms of Evidence

… Collaborators

… Data

… Publishing, and Filtering

… Learners

… Access

… Evaluation

But declining response to traditional surveys…

1997 2000 2003 2006 2009 20120

20

40

60

80

100

Contact Rate Cooperation RateResponse Rate

Source: Kohut, Andrew, Scott Keeter, Carroll Doherty, Michael Dimock, and Leah Christian. "Assessing the representativeness of public opinion surveys." Pew Research Center, Washington, DC (2012).

Big Data in the Social Science and Library Research

Trends and Challenges

Trends Increasingly data-driven economy Individuals are increasingly mobile Technology changes data uses Stakeholder expectations are changing

The next generation of social science research Utilize broad sources of information Increase granularity, detail, and timeliness Reduce cost & burden Maintain confidentiality and security

Multi-disciplinary challenges : Computation, Statistics, Informatics, Social Science,

Policy

Big Data in the Social Science and Library Research

What the @#%&!

is Big Data?

Big Data in the Social Science and Library Research

Small, Big, Massive & Ginormous

Data Characteristics: the k “V’s” of big data

Volume Velocity Variety + Veracity + Variability + …

Big Data in the Social Science and Library Research

“Big” is in the use, not just the data Going big can change analysis strategy

Model-based vs. data-based analysis Designed vs. “found” data Causal inference vs.

Descriptive/ predictive (forecasting) inference Scaling challenges for standards algorithms / software

implementations “in core” vs. “out-of-core” implementations O(N^2) vs. O(log n) complexity Static vs. streaming algorithms Serial vs. massively parrallel

Going big can change data management Provenance Data access and shaing Data privacy Change management Continuous integration Accommodating variety – semantics, quality,

Big Data in the Social Science and Library Research

“Big” is in the use, not just the data

Data becomes “big” when volume, variety, etc. exceed limits of traditional methods and practices:

Implementation Limits – Performance Challenges Analysis Limits – Inferential Challenges

But also:

Data Management (workflows, systems, standards) Data Governance (policies, licenses) Data Controls (privacy, security)

Big Data in the Social Science and Library Research

Three examples

(Good Cop, Bad Cop, No cop?)

Using Text Analysis to Discover U.S. Debate Strategies

Big Data in the Social Science and Library Research

More Information• Grimmer, Justin, and Gary King. "General purpose

computer-assisted clustering and conceptualization." Proceedings of the National Academy of Sciences 108.7 (2011): 2643-2650.

• Discover new communication pattern• Use exaggerated language to put them

down or devalue their ideas

Data Source - Social Media Messages

Data: Structure - Network, Unstructured Text, Structured metadata

Unit of Observation - Individuals; Interactions

Collection Design - Pure observational

Desired Inferences - Causal inference – what censorship strategies cause observed reaction

- Inference to Population Frame

Performance challenges

- High volume- Complex network

structure- Scaling bespoke

algorithms- Sparsity- Systematic and sparse

metadata

Management Challenges

- License- Replication- Revision Control

Inferential Challenges - Measurement error – extracting topics from text

Using Google Searches to Forecast Disease Outbreaks

Big Data in the Social Science and Library Research

More Information

• Ginsberg, Jeremy, et al. "Detecting influenza epidemics using search engine query data." Nature 457.7232 (2009): 1012-1014.

• Lazer, David, et al. "The parable of Google Flu: traps in big data analysis." Science 343.14 March (2014).

“Big data hubris” is the often implicitassumption that big data are a substitutefor, rather than a supplement to, traditional data collection and analysis.

Data Source - Google search queries

Data: Structure - Quasi-tabular, structured metadata and unstructured text

Unit of Observation - Interactions with a system

Collection Design - Pure observational

Desired Inferences - Predictive inference-- where will flu clusters appear next-- Short-term (nearcasting)-- small-area (fine-spatial granularity)

- Inference to general population

Performance challenges - Streaming algorithms

Management Challenges - Replication- Transparency- Variability

Inferential Challenges - External Validity- Measurement error

– extracting topics from text- Overfitting- Sampling

Using Google Adwords for Ad Targeting (Big Data but Not Social Science)

Big Data in the Social Science and Library Research

More Information

• Gross, Margaret S., et al. "Using Google AdWords for international multilingual recruitment to health research websites." Journal of medical Internet research 16.1 (2014).

Visible:- impressions- click-through- conversions- $$- demograp

hics

Data Source - Google search queries

Data: Structure - Visible – tabular

Unit of Observation - Interactions with a system

Collection Design - Pure observational

Desired Inferences - Ad conversions- $$- Representativeness of

subjects- Subject conversion decision

Performance challenges - Emergent behavior

Management Challenges - Cost

Inferential Challenges - Sampling

User X advertisementInteractions

Ad-pool BidsUser signals

Site Quality signals

Hidden:- selection criteria- coverage

of population

-

Big Data in the Social Science and Library Research

Comparing CasesChinese Censorship Flu Prediction

Data Source - Social Media Messages - Google search queries

Data: Structure - Network, Unstructured Text, Structured metadata

- Quasi-tabular, structured metadata and unstructured text

Unit of Observation - Individuals; Interactions - Interactions with a system

Collection Design - Pure observational - Pure observational

Desired Inferences - Causal inference – what censorship strategies cause observed reaction

- Inference to Population Frame

- Predictive inference-- where will flu clusters appear next-- Short-term (nearcasting)-- small-area (fine-spatial granularity)

- Inference to general population

Performance challenges - High volume- Complex network structure- Scaling bespoke algorithms- Sparsity- Systematic and sparse

metadata

- Streaming algorithms

Management Challenges - License- Replication- Revision Control

- Replication- Transparency- Variability

Inferential Challenges - Measurement error – extracting topics from text

- External Validity- Measurement error

– extracting topics from text- Overfitting- Sampling

Big Data in the Social Science and Library Research

How is Big Data Different for Social Science?

Social Science Goal What do we need to know about data?

Inference that is valid beyond sample

- Characteristics of sampling distribution

- Characteristics of sampling frame- (In)dependence of samples

Involves phenomena that are challenging to measure

- Reliability, comparability of measures

- External validity of measurement

Involves predicting long-term causes and effects

- (In)dependence of treatments (selection, contagion)

- Causal structure

Scientific data requires explicit documentation of provenance, quality,

comparability

Big Data in the Social Science and Library Research

Why is dealing with big data

hard?

Big Data in the Social Science and Library Research

Big Data Challenges

Acquisition

Sources

Incentives Quality

Provenance

Retention

Change Managemen

t

Integration

Security

StorageAnalysis

Bias

CausationComputation

Visualization

Access

Transparency

Reproducibility

Durable Access(Preservation)

Confidentialiity

Big Data in the Social Science and Library Research

Acquisition Challenges:

Quality, Provenance, Sources

Big Data in the Social Science and Library Research

Some Sources of Economic Information Smartphones

GPS, Imagery, Voice, Telecom, Proximit Vehicle systems IoT (internet of things)

– smart light bulbs, thermostats, fire alarms

Transactions – online, internal

Search behavior – search engine queries

Social media – twitter, FaceBook, LinkedIN

Imagery – satellite, thermal, video

Big Data in the Social Science and Library Research

Source Characteristics

Unit of Observation Location, virtual service, communication network,

individual Context

Behavior, transaction, environment, statement Measure characteristics

Measure scale Measure structure Accuracy, precision

Frame & Sample characteristics

Big Data in the Social Science and Library Research

Analysis Challenges:

Bias, Computation, Causation, Integration

Big Data in the Social Science and Library Research

Model of The World,Grad School Theory Version

λβ

Parameters

Big Data in the Social Science and Library Research

Model of The World, Quantitative Postdoc Version

Target Population

Frame

Selection

Super Population

Laws

(structures)

λβ

(generates)

Parameters

Big Data in the Social Science and Library Research

Some Potential Sources of Analysis Error

Target Population

Frame

Selection

Super Population

Laws

(structures)

λβ

(generates)

Parameters

• Selection bias• Frame

uncertainty• Measurement

error• Unknown

measurement semantics

• Non-independence of measures

• Non-independence of samples

• Model uncertainty

• Unknown causal structure

• Shift in measurements, samples, frames

Ensuring Repeatability & Transparency

Big Data in the Social Science and Library Research

‘’ΩΩΩΩ

Theory(Rules, Entities, Concepts)

Algorithm (Protocol, Operationalization)

Theory(Rules, Entities, Concepts)

Theory(Rules, Entities, Concepts)

Implementation(Software, Coding Rules, Instrumentation )

Execution(Deployment, House Survey Style, Equipment

Setting )

Algorithms (Protocol, Operationalization)

Implementations(Software, Coding Rules, Instrumentation Design )

Executions(Deployment, House Survey Style, Operating

System, Hardware, Starting Values, PRNG seeds)

Structure

Formats

Versions/Revisions

Selections

Integrations

Instantiations(copies)

Execution Context(weather, compiler, operating system system load)

Big Data in the Social Science and Library Research

Access Challenge:

Data Repeatability, Transparency, Preservation

Big Data in the Social Science and Library Research

Many Initiatives to Improve Scientific Reliability

Retraction monitoring Data citation Clinical trial

preregistration Registered replication Open data Badges

Big Data in the Social Science and Library Research

Some Types of Reproducibility Issues

• Fraud• Misconduct• Negligence• Bit Rot• Versioning problem• Replication• Reproduction• Extension• Result Validation• Fact Checking• Calibration,

Extension, Reuse• Under-reporting• Data Dredging• Multiple

Comparisons’ P-Hacking

• Sensitivity, Robustness

• Reliability• Generalizability

Big Data in the Social Science and Library Research

Identify Properties Supporting Reproducibility Claim

Reproducibility Related Issue Related informatics claims

Label Validation, Fact Checking

Reproducibility Issue Variance of estimates given data identifier & analysis algorithm

Reproducibility Claim Variance of estimates given data identifier & analysis algorithm is known & correctly represented.

Use Case Post-publication reviewer wants to establish that published claims correspond to analysis method performed…

Potential supporting informational claims

1. Instance of data retrieved via identifier is semantically equivalent to instance of data used to support published claim

2. analysis algorithm is robust to choice of reasonable alternative implementation

3. implementation of algorithm is robust to reasonable choice of execution details and context

4. published direct claims about data are semantically equivalent to subset of claims produced by authors previous application of analysis

5. …

Potential information systems properties supporting claims

1a. Detailed provenance history for data from collection through analysis and deposition1b. Automatic replication of direct data claims from deposited source1c. Cryptographic evidence (e.g. cryptographic signed {analysis output including, cryptographic hash of data} & {cryptographic hash of data retrieved via identifier}…2a. Standard implementation, subject to community review2b. Report of results of application of implementation on standard testbed2c. Availability of implementation for inspection….3. …

Entities, and Relationships, and Straw Models(oh my!)

Big Data in the Social Science and Library Research

‘The World’

‘Actors’(people)

‘Theory’

(ideas)

‘Documents’

‘Methods’

‘Data’(affect decisions of)

(interact/intervene/simulate)

(select and apply)

(select, design, perform) )

(create and apply)

Analysis

(output)

(apply over)(observe,edit)

Big Data in the Social Science and Library Research

Documents(compendia, fairy tales)

‘’We applied a general linear model’‘We conjecture kids will choose candy’‘δ = 2.3 * √Ω’‘Chewing gum tastes great’ (Altman, et al. 2013)

‘’

Assertions about

other entities

Logical Claims

Theorem 1….Lemma 1.1

Speculations, Commentary

Thanks to my dog for his support…

References, Citation

U49845.1 GI:1293613doi:10.1002/0470841559.ch1orcid:0000-0001-7382-6960

Internal Meta-Information

Title: XXXX

Publication Operations(manuscrip

ts, proofreadi

ng, translation,

…)

Big Data in the Social Science and Library Research

People (their Relationships & Action)

Identity

Who is the actor?Relationship(or action)What did the actor do, or how are they related?

Relationship Context(time, duration, intent)

Big Data in the Social Science and Library Research

Reproducibility claims are not necessarily direct claims about “the world”…

1. What claims about information are implied by reproducibility claims/issues?

2. What properties of information and information flow are related to those claims?

3. How would possible changes to information processing and flow yield?(And how much would they it cost?)

4. What confidentiality risks do these changes create

*

Big Data in the Social Science and Library Research

Access Challenge:

Data Confidentiality,

Security

Durable, Long-Term Access• Why durable access?

• The rule of law require maintaining authentic public records• Scientific advances rely on a cumulative, traceable evidence base• Art, history, culture require durable access to national heritage

information• Our nation needs durable access to a strategic information reserve• Humanity needs durable long-term access information in order to

communicate to future generations• Big data challenges to durability

• Velocity – information is updated, sometime overwritten• Many sources are commercial/private

– not routinely archived, preserved• Modeling future value of information • Maintaining privacy and confidentiality

Big Data in the Social Science and Library Research

Big Data in the Social Science and Library Research

Big data challenges…

Anonymization can completely destroy utility The “Netflix Problem”: large, sparse datasets that

overlap can be probabilistically linked [Narayan and Shmatikov 2008]

Observable Behavior Leaves Unique “Fingerprints” The “GIS”: fine geo-spatial-temporal data impossible

mask, when correlated with external data [Zimmerman 2008; ]

Big Data can be Rich, Messy & Surprising The “Facebook Problem”: Possible to identify masked

network data, if only a few nodes controlled. [Backstrom, et. al 2007]

The “Blog problem” : Pseudononymous communication can be linked through textual analysis [Novak wet. al 2004]

Source: [Calberese 2008; Real Time Rome Project 2007]

Big Data in the Social Science and Library Research

Little Data in a Big World

Little Data in a Big World The “Favorite Ice Cream”

problem -- public information that is not risky can help us learn information that is risky

The “Doesn’t Stay in Vegas” problem-- information shared locally can be found anywhere

The “Unintended Algorithmic Discrimination” problem-- algorithms are often not transparent, and can amplify human biases

Big Data in the Social Science and Library Research

Algorithmic Discrimination Big data algorithms

Often lack transparency Can be changed without notice Depend on users interactions with the system

Discrimination can occur when: Decisions are made on correlations, without examining causal relationships Algorithms concentrate many small, implicit decisions by indivduals

Big Data in the Social Science and Library Research

Emerging Confidentiality Approaches for Big Data

Controlled remote access Varies from remote access to all data and output to human vetting of output Restrictions on use, easier to enforce Advantages: auditable, potential to impose human review, potential to limit

analysis Disadvantages: complex to implement, slow

Model servers Mediated remote access – analysis limited to designated models Differential privacy methods can be used to formally guarantee

confidentiality of some models Advantages: faster, no human in loop Disadvantage: limited set of models currently supported; complex to

implement Experimental approaches

Personal Data Stores Automatic Data Auditing and Accountability Multi-party computing Functional encryption

Big Data in the Social Science and Library Research

Categorizing Challenges Implementation – Performance

Challenges Systems challenges

Exceed capacity of locally managed storage

Location and migration of data becomes critical for performance

Standard backup, recovery and data integrity mechanisms ineffective

Communication bandwidth Algorithmic Challenges

“in core” vs. “out-of-core” implementations

O(N^2) vs. O(log n) complexity Static vs. streaming algorithms Serial vs. massively parallel Distributed – shared-nothing algorithms

Analysis methods – Inferential Challenges Sources: Designed vs. “found” data Model-based vs. data-based analysis Causal inference vs.

Descriptive/ predictive (forecasting) inference

Data Management & Workflow Provenance Data quality Change management Continuous integration Accommodating variety – semantics,

quality Transparency and reproducibility Privacy Security

Data Governance and Policy Standards Incentives Certifications Regulation

Big Data in the Social Science and Library Research

Access Challenge:

Data Governance

Big Data in the Social Science and Library Research

Preliminary Observations from First Census-MIT Workshop

Topic: Sources of Economic Big Data

Use Case: Commodity Flow Survey

Observations: Different classes of decisions require different sources of data:

E.g. much designed survey data contributes baseline data for decisions about infrastructure and strategic planning

Transaction based big data could contribute frequency and granularity of estimates

In big data, data sources are stakeholders Businesses need to react quickly and predict the future – and need frequently updated

detailed data Critical to provide a value proposition to business Critical to develop a trust relationship Critical to develop appropriate privacy, security and licensing terms

Some Potential sources ERP and DRP operations data EDI Mobile Phone Traffic Data

Big Data in the Social Science and Library Research

Some Non-Technical Questions About Sources

With big data, critical to identify…

● Who are the key stakeholders in big data source, and what are the key stakeholder incentives?○ What key decisions does this information support for

stakeholders? What are the gaps in data from the stakeholder perspective?

○ What are barriers associated with new sources of information?○ Legal barriers – privacy, licenses○ Economic barriers – value of data, analytics○ Social/trust barriers

Big Data in the Social Science and Library Research

Potential Roles for Libraries

Big Data in the Social Science and Library Research

Potential Roles -- Infrastructure

Dissemination Discovery –

catalog range of new statistics/indicators , sources Selection based on quality attributes Guide proper use

Durability – Data Stewardship Ensure long-term accessibility of big-data Manage provenance, versioning Provide transparency of new indicators/statistics

Security & Confidentiality Libraries could be a trusted and accountable 3rd party Store and integrate data from multiple sources Could develop expert implementation of privacy

best practices

Big Data in the Social Science and Library Research

Potential Roles - Leadership

AdvocacyAdvocate for quality, transparency, replication, durable access.

StandardizationDevelop new methods for big data management

Identify “best practices” for replication, transparency, long-term access

Standardize licenses for replication, reuse, preservation

Big Data in the Social Science and Library Research

Why Should Libraries Engage in Big Data

Big data sources cross disciplines –libraries are multidisciplinary

Big data challenges include discovery, information management, information quality, governance –libraries are engages in these standards and practices

Big data challenges privacy and security of users –libraries are trusted

Research requires durable access to a big evidence base – libraries take the long-term view for stewardship

Big Data in the Social Science and Library Research

Additional References

● Einav, Liran, and Jonathan Levin. "Economics in the age of big data." Science 346.6210 (2014): 1243089. http://www.sciencemag.org/content/346/6210/1243089.short

● Varian, Hal R. "Big data: New tricks for econometrics." The Journal of Economic Perspectives 28.2 (2014): 3-27.http://people.ischool.berkeley.edu/~hal/Papers/2013/ml.pdf

Reimsbach-Kounatze, C. (2015), “The Proliferation of “Big Data” and Implications for Official Statistics and Statistical Agencies: A Preliminary Analysis”, OECD Digital Economy Papers, No. 245, OECD Publishing. http://dx.doi.org/10.1787/5js7t9wqzvg8-en

Kriger, David S., et al. Freight Transportation Surveys. Vol. 410. Transportation Research Board, 2011. http://www.nap.edu/catalog/13627/nchrp-synthesis-410-freight-transportation-surveys

Questions?E-mail: [email protected]

Web: informatics.mit.edu

Big Data in the Social Science and Library Research

Big Data in the Social Science and Library Research

Creative Commons License

This work. Managing Confidential information in research, by Micah Altman (http://redistricting.info) is licensed under the Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.