big data in the social science research and libraries micah altman director of research mit...
TRANSCRIPT
Big Data in the Social Science Research and Libraries
Micah Altman
Director of Research
MIT Libraries
Prepared for
Invited Seminar Series
Harbin Engineering University
September 2015
Big Data in the Social Science and Library Research
Roadmap
What the @#%&! is “big data”?
Two examples of big data in social & health sciences
Representative Challenges
Potential roles for libraries
Big Data Challenges
Acquisition
Retention
Analysis
Access
Big Data in the Social Science and Library Research
DISCLAIMERThese opinions are my own, they are not the opinions of MIT, Brookings, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators
Secondary disclaimer:
“It’s tough to make predictions, especially about the future!”
-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan
Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc.
Big Data in the Social Science and Library Research
Collaborators & Co-Conspirators Workshop Series Co-Organizers
– U.S. Census Bureau Cavan Capps Ron Prevost
Research Support Supported by the U.S. Census Bureau
Big Data in the Social Science and Library Research
Related Work
Main Project: Census-MIT Big Data Workshop Series
projects.informatics.mit.edu/bigdataworkshops
Related publications:
(Reprints available from: informatics.mit.edu )
Altman, M., D. O’Brien, S. Vadhan, A. Wood. 2014. “Big Data Study:
Request for Information.” Altman, M Altman M, Wood A, O'Brien D, Gasser M., Vadhan, S. Towards a Modern Approach
to Privacy-Aware Government Data Releases. Berkeley Journal of Technology Law. Forthcoming.
Altman M, McDonald MP. 2014. Public Participation GIS : The Case of Redistricting. Proceedings of the 47th Annual Hawaii International Conference on Systems Science .
Big Data in the Social Science and Library Research
Workshops Series: Big Data and Official Statistics
Acquisition ChallengesUsing New forms of Information for Official Economic Statistics[August 3-4]
Privacy ChallengesLocation Confidentiality and Official Surveys [October 5-6]
Inference ChallengesTransparency and Inference[December 7-8]
Expected outcomes:
Workshop reports (September, October, December)
Integrated white paper(February)
Identifying new opportunities for statistical agencies
Inform the Census Big Data Research Program.
projects.informatics.mit.edu/bigdataworkshops
Big Data in the Social Science and Library Research
More stuff & people…… Forms of Evidence
… Collaborators
… Data
… Publishing, and Filtering
… Learners
… Access
… Evaluation
But declining response to traditional surveys…
1997 2000 2003 2006 2009 20120
20
40
60
80
100
Contact Rate Cooperation RateResponse Rate
Source: Kohut, Andrew, Scott Keeter, Carroll Doherty, Michael Dimock, and Leah Christian. "Assessing the representativeness of public opinion surveys." Pew Research Center, Washington, DC (2012).
Big Data in the Social Science and Library Research
Trends and Challenges
Trends Increasingly data-driven economy Individuals are increasingly mobile Technology changes data uses Stakeholder expectations are changing
The next generation of social science research Utilize broad sources of information Increase granularity, detail, and timeliness Reduce cost & burden Maintain confidentiality and security
Multi-disciplinary challenges : Computation, Statistics, Informatics, Social Science,
Policy
Big Data in the Social Science and Library Research
Small, Big, Massive & Ginormous
Data Characteristics: the k “V’s” of big data
Volume Velocity Variety + Veracity + Variability + …
Big Data in the Social Science and Library Research
“Big” is in the use, not just the data Going big can change analysis strategy
Model-based vs. data-based analysis Designed vs. “found” data Causal inference vs.
Descriptive/ predictive (forecasting) inference Scaling challenges for standards algorithms / software
implementations “in core” vs. “out-of-core” implementations O(N^2) vs. O(log n) complexity Static vs. streaming algorithms Serial vs. massively parrallel
Going big can change data management Provenance Data access and shaing Data privacy Change management Continuous integration Accommodating variety – semantics, quality,
Big Data in the Social Science and Library Research
“Big” is in the use, not just the data
Data becomes “big” when volume, variety, etc. exceed limits of traditional methods and practices:
Implementation Limits – Performance Challenges Analysis Limits – Inferential Challenges
But also:
Data Management (workflows, systems, standards) Data Governance (policies, licenses) Data Controls (privacy, security)
Using Text Analysis to Discover U.S. Debate Strategies
Big Data in the Social Science and Library Research
More Information• Grimmer, Justin, and Gary King. "General purpose
computer-assisted clustering and conceptualization." Proceedings of the National Academy of Sciences 108.7 (2011): 2643-2650.
• Discover new communication pattern• Use exaggerated language to put them
down or devalue their ideas
Data Source - Social Media Messages
Data: Structure - Network, Unstructured Text, Structured metadata
Unit of Observation - Individuals; Interactions
Collection Design - Pure observational
Desired Inferences - Causal inference – what censorship strategies cause observed reaction
- Inference to Population Frame
Performance challenges
- High volume- Complex network
structure- Scaling bespoke
algorithms- Sparsity- Systematic and sparse
metadata
Management Challenges
- License- Replication- Revision Control
Inferential Challenges - Measurement error – extracting topics from text
Using Google Searches to Forecast Disease Outbreaks
Big Data in the Social Science and Library Research
More Information
• Ginsberg, Jeremy, et al. "Detecting influenza epidemics using search engine query data." Nature 457.7232 (2009): 1012-1014.
• Lazer, David, et al. "The parable of Google Flu: traps in big data analysis." Science 343.14 March (2014).
“Big data hubris” is the often implicitassumption that big data are a substitutefor, rather than a supplement to, traditional data collection and analysis.
Data Source - Google search queries
Data: Structure - Quasi-tabular, structured metadata and unstructured text
Unit of Observation - Interactions with a system
Collection Design - Pure observational
Desired Inferences - Predictive inference-- where will flu clusters appear next-- Short-term (nearcasting)-- small-area (fine-spatial granularity)
- Inference to general population
Performance challenges - Streaming algorithms
Management Challenges - Replication- Transparency- Variability
Inferential Challenges - External Validity- Measurement error
– extracting topics from text- Overfitting- Sampling
Using Google Adwords for Ad Targeting (Big Data but Not Social Science)
Big Data in the Social Science and Library Research
More Information
• Gross, Margaret S., et al. "Using Google AdWords for international multilingual recruitment to health research websites." Journal of medical Internet research 16.1 (2014).
Visible:- impressions- click-through- conversions- $$- demograp
hics
Data Source - Google search queries
Data: Structure - Visible – tabular
Unit of Observation - Interactions with a system
Collection Design - Pure observational
Desired Inferences - Ad conversions- $$- Representativeness of
subjects- Subject conversion decision
Performance challenges - Emergent behavior
Management Challenges - Cost
Inferential Challenges - Sampling
User X advertisementInteractions
Ad-pool BidsUser signals
Site Quality signals
Hidden:- selection criteria- coverage
of population
-
Big Data in the Social Science and Library Research
Comparing CasesChinese Censorship Flu Prediction
Data Source - Social Media Messages - Google search queries
Data: Structure - Network, Unstructured Text, Structured metadata
- Quasi-tabular, structured metadata and unstructured text
Unit of Observation - Individuals; Interactions - Interactions with a system
Collection Design - Pure observational - Pure observational
Desired Inferences - Causal inference – what censorship strategies cause observed reaction
- Inference to Population Frame
- Predictive inference-- where will flu clusters appear next-- Short-term (nearcasting)-- small-area (fine-spatial granularity)
- Inference to general population
Performance challenges - High volume- Complex network structure- Scaling bespoke algorithms- Sparsity- Systematic and sparse
metadata
- Streaming algorithms
Management Challenges - License- Replication- Revision Control
- Replication- Transparency- Variability
Inferential Challenges - Measurement error – extracting topics from text
- External Validity- Measurement error
– extracting topics from text- Overfitting- Sampling
Big Data in the Social Science and Library Research
How is Big Data Different for Social Science?
Social Science Goal What do we need to know about data?
Inference that is valid beyond sample
- Characteristics of sampling distribution
- Characteristics of sampling frame- (In)dependence of samples
Involves phenomena that are challenging to measure
- Reliability, comparability of measures
- External validity of measurement
Involves predicting long-term causes and effects
- (In)dependence of treatments (selection, contagion)
- Causal structure
Scientific data requires explicit documentation of provenance, quality,
comparability
Big Data in the Social Science and Library Research
Big Data Challenges
Acquisition
Sources
Incentives Quality
Provenance
Retention
Change Managemen
t
Integration
Security
StorageAnalysis
Bias
CausationComputation
Visualization
Access
Transparency
Reproducibility
Durable Access(Preservation)
Confidentialiity
Big Data in the Social Science and Library Research
Acquisition Challenges:
Quality, Provenance, Sources
Big Data in the Social Science and Library Research
Some Sources of Economic Information Smartphones
GPS, Imagery, Voice, Telecom, Proximit Vehicle systems IoT (internet of things)
– smart light bulbs, thermostats, fire alarms
Transactions – online, internal
Search behavior – search engine queries
Social media – twitter, FaceBook, LinkedIN
Imagery – satellite, thermal, video
…
Big Data in the Social Science and Library Research
Source Characteristics
Unit of Observation Location, virtual service, communication network,
individual Context
Behavior, transaction, environment, statement Measure characteristics
Measure scale Measure structure Accuracy, precision
Frame & Sample characteristics
Big Data in the Social Science and Library Research
Analysis Challenges:
Bias, Computation, Causation, Integration
Big Data in the Social Science and Library Research
Model of The World,Grad School Theory Version
λβ
Parameters
Big Data in the Social Science and Library Research
Model of The World, Quantitative Postdoc Version
Target Population
Frame
Selection
Super Population
Laws
(structures)
λβ
(generates)
Parameters
Big Data in the Social Science and Library Research
Some Potential Sources of Analysis Error
Target Population
Frame
Selection
Super Population
Laws
(structures)
λβ
(generates)
Parameters
• Selection bias• Frame
uncertainty• Measurement
error• Unknown
measurement semantics
• Non-independence of measures
• Non-independence of samples
• Model uncertainty
• Unknown causal structure
• Shift in measurements, samples, frames
Ensuring Repeatability & Transparency
Big Data in the Social Science and Library Research
‘
‘’ΩΩΩΩ
Theory(Rules, Entities, Concepts)
Algorithm (Protocol, Operationalization)
Theory(Rules, Entities, Concepts)
Theory(Rules, Entities, Concepts)
Implementation(Software, Coding Rules, Instrumentation )
Execution(Deployment, House Survey Style, Equipment
Setting )
’
Algorithms (Protocol, Operationalization)
Implementations(Software, Coding Rules, Instrumentation Design )
Executions(Deployment, House Survey Style, Operating
System, Hardware, Starting Values, PRNG seeds)
Structure
Formats
Versions/Revisions
Selections
Integrations
Instantiations(copies)
Execution Context(weather, compiler, operating system system load)
Big Data in the Social Science and Library Research
Access Challenge:
Data Repeatability, Transparency, Preservation
Big Data in the Social Science and Library Research
Many Initiatives to Improve Scientific Reliability
Retraction monitoring Data citation Clinical trial
preregistration Registered replication Open data Badges
Big Data in the Social Science and Library Research
Some Types of Reproducibility Issues
• Fraud• Misconduct• Negligence• Bit Rot• Versioning problem• Replication• Reproduction• Extension• Result Validation• Fact Checking• Calibration,
Extension, Reuse• Under-reporting• Data Dredging• Multiple
Comparisons’ P-Hacking
• Sensitivity, Robustness
• Reliability• Generalizability
Big Data in the Social Science and Library Research
Identify Properties Supporting Reproducibility Claim
Reproducibility Related Issue Related informatics claims
Label Validation, Fact Checking
Reproducibility Issue Variance of estimates given data identifier & analysis algorithm
Reproducibility Claim Variance of estimates given data identifier & analysis algorithm is known & correctly represented.
Use Case Post-publication reviewer wants to establish that published claims correspond to analysis method performed…
Potential supporting informational claims
1. Instance of data retrieved via identifier is semantically equivalent to instance of data used to support published claim
2. analysis algorithm is robust to choice of reasonable alternative implementation
3. implementation of algorithm is robust to reasonable choice of execution details and context
4. published direct claims about data are semantically equivalent to subset of claims produced by authors previous application of analysis
5. …
Potential information systems properties supporting claims
1a. Detailed provenance history for data from collection through analysis and deposition1b. Automatic replication of direct data claims from deposited source1c. Cryptographic evidence (e.g. cryptographic signed {analysis output including, cryptographic hash of data} & {cryptographic hash of data retrieved via identifier}…2a. Standard implementation, subject to community review2b. Report of results of application of implementation on standard testbed2c. Availability of implementation for inspection….3. …
Entities, and Relationships, and Straw Models(oh my!)
Big Data in the Social Science and Library Research
‘The World’
‘Actors’(people)
‘Theory’
(ideas)
‘Documents’
‘Methods’
‘Data’(affect decisions of)
(interact/intervene/simulate)
(select and apply)
(select, design, perform) )
(create and apply)
Analysis
(output)
(apply over)(observe,edit)
Big Data in the Social Science and Library Research
Documents(compendia, fairy tales)
‘’We applied a general linear model’‘We conjecture kids will choose candy’‘δ = 2.3 * √Ω’‘Chewing gum tastes great’ (Altman, et al. 2013)
‘’
Assertions about
other entities
Logical Claims
Theorem 1….Lemma 1.1
Speculations, Commentary
Thanks to my dog for his support…
References, Citation
U49845.1 GI:1293613doi:10.1002/0470841559.ch1orcid:0000-0001-7382-6960
Internal Meta-Information
Title: XXXX
Publication Operations(manuscrip
ts, proofreadi
ng, translation,
…)
Big Data in the Social Science and Library Research
People (their Relationships & Action)
Identity
Who is the actor?Relationship(or action)What did the actor do, or how are they related?
Relationship Context(time, duration, intent)
Big Data in the Social Science and Library Research
Reproducibility claims are not necessarily direct claims about “the world”…
1. What claims about information are implied by reproducibility claims/issues?
2. What properties of information and information flow are related to those claims?
3. How would possible changes to information processing and flow yield?(And how much would they it cost?)
4. What confidentiality risks do these changes create
*
Big Data in the Social Science and Library Research
Access Challenge:
Data Confidentiality,
Security
Durable, Long-Term Access• Why durable access?
• The rule of law require maintaining authentic public records• Scientific advances rely on a cumulative, traceable evidence base• Art, history, culture require durable access to national heritage
information• Our nation needs durable access to a strategic information reserve• Humanity needs durable long-term access information in order to
communicate to future generations• Big data challenges to durability
• Velocity – information is updated, sometime overwritten• Many sources are commercial/private
– not routinely archived, preserved• Modeling future value of information • Maintaining privacy and confidentiality
Big Data in the Social Science and Library Research
Big Data in the Social Science and Library Research
Big data challenges…
Anonymization can completely destroy utility The “Netflix Problem”: large, sparse datasets that
overlap can be probabilistically linked [Narayan and Shmatikov 2008]
Observable Behavior Leaves Unique “Fingerprints” The “GIS”: fine geo-spatial-temporal data impossible
mask, when correlated with external data [Zimmerman 2008; ]
Big Data can be Rich, Messy & Surprising The “Facebook Problem”: Possible to identify masked
network data, if only a few nodes controlled. [Backstrom, et. al 2007]
The “Blog problem” : Pseudononymous communication can be linked through textual analysis [Novak wet. al 2004]
Source: [Calberese 2008; Real Time Rome Project 2007]
Big Data in the Social Science and Library Research
Little Data in a Big World
Little Data in a Big World The “Favorite Ice Cream”
problem -- public information that is not risky can help us learn information that is risky
The “Doesn’t Stay in Vegas” problem-- information shared locally can be found anywhere
The “Unintended Algorithmic Discrimination” problem-- algorithms are often not transparent, and can amplify human biases
Big Data in the Social Science and Library Research
Algorithmic Discrimination Big data algorithms
Often lack transparency Can be changed without notice Depend on users interactions with the system
Discrimination can occur when: Decisions are made on correlations, without examining causal relationships Algorithms concentrate many small, implicit decisions by indivduals
Big Data in the Social Science and Library Research
Emerging Confidentiality Approaches for Big Data
Controlled remote access Varies from remote access to all data and output to human vetting of output Restrictions on use, easier to enforce Advantages: auditable, potential to impose human review, potential to limit
analysis Disadvantages: complex to implement, slow
Model servers Mediated remote access – analysis limited to designated models Differential privacy methods can be used to formally guarantee
confidentiality of some models Advantages: faster, no human in loop Disadvantage: limited set of models currently supported; complex to
implement Experimental approaches
Personal Data Stores Automatic Data Auditing and Accountability Multi-party computing Functional encryption
Big Data in the Social Science and Library Research
Categorizing Challenges Implementation – Performance
Challenges Systems challenges
Exceed capacity of locally managed storage
Location and migration of data becomes critical for performance
Standard backup, recovery and data integrity mechanisms ineffective
Communication bandwidth Algorithmic Challenges
“in core” vs. “out-of-core” implementations
O(N^2) vs. O(log n) complexity Static vs. streaming algorithms Serial vs. massively parallel Distributed – shared-nothing algorithms
Analysis methods – Inferential Challenges Sources: Designed vs. “found” data Model-based vs. data-based analysis Causal inference vs.
Descriptive/ predictive (forecasting) inference
Data Management & Workflow Provenance Data quality Change management Continuous integration Accommodating variety – semantics,
quality Transparency and reproducibility Privacy Security
Data Governance and Policy Standards Incentives Certifications Regulation
Big Data in the Social Science and Library Research
Preliminary Observations from First Census-MIT Workshop
Topic: Sources of Economic Big Data
Use Case: Commodity Flow Survey
Observations: Different classes of decisions require different sources of data:
E.g. much designed survey data contributes baseline data for decisions about infrastructure and strategic planning
Transaction based big data could contribute frequency and granularity of estimates
In big data, data sources are stakeholders Businesses need to react quickly and predict the future – and need frequently updated
detailed data Critical to provide a value proposition to business Critical to develop a trust relationship Critical to develop appropriate privacy, security and licensing terms
Some Potential sources ERP and DRP operations data EDI Mobile Phone Traffic Data
Big Data in the Social Science and Library Research
Some Non-Technical Questions About Sources
With big data, critical to identify…
● Who are the key stakeholders in big data source, and what are the key stakeholder incentives?○ What key decisions does this information support for
stakeholders? What are the gaps in data from the stakeholder perspective?
○ What are barriers associated with new sources of information?○ Legal barriers – privacy, licenses○ Economic barriers – value of data, analytics○ Social/trust barriers
Big Data in the Social Science and Library Research
Potential Roles -- Infrastructure
Dissemination Discovery –
catalog range of new statistics/indicators , sources Selection based on quality attributes Guide proper use
Durability – Data Stewardship Ensure long-term accessibility of big-data Manage provenance, versioning Provide transparency of new indicators/statistics
Security & Confidentiality Libraries could be a trusted and accountable 3rd party Store and integrate data from multiple sources Could develop expert implementation of privacy
best practices
Big Data in the Social Science and Library Research
Potential Roles - Leadership
AdvocacyAdvocate for quality, transparency, replication, durable access.
StandardizationDevelop new methods for big data management
Identify “best practices” for replication, transparency, long-term access
Standardize licenses for replication, reuse, preservation
Big Data in the Social Science and Library Research
Why Should Libraries Engage in Big Data
Big data sources cross disciplines –libraries are multidisciplinary
Big data challenges include discovery, information management, information quality, governance –libraries are engages in these standards and practices
Big data challenges privacy and security of users –libraries are trusted
Research requires durable access to a big evidence base – libraries take the long-term view for stewardship
Big Data in the Social Science and Library Research
Additional References
● Einav, Liran, and Jonathan Levin. "Economics in the age of big data." Science 346.6210 (2014): 1243089. http://www.sciencemag.org/content/346/6210/1243089.short
● Varian, Hal R. "Big data: New tricks for econometrics." The Journal of Economic Perspectives 28.2 (2014): 3-27.http://people.ischool.berkeley.edu/~hal/Papers/2013/ml.pdf
Reimsbach-Kounatze, C. (2015), “The Proliferation of “Big Data” and Implications for Official Statistics and Statistical Agencies: A Preliminary Analysis”, OECD Digital Economy Papers, No. 245, OECD Publishing. http://dx.doi.org/10.1787/5js7t9wqzvg8-en
Kriger, David S., et al. Freight Transportation Surveys. Vol. 410. Transportation Research Board, 2011. http://www.nap.edu/catalog/13627/nchrp-synthesis-410-freight-transportation-surveys
Questions?E-mail: [email protected]
Web: informatics.mit.edu
Big Data in the Social Science and Library Research
Big Data in the Social Science and Library Research
Creative Commons License
This work. Managing Confidential information in research, by Micah Altman (http://redistricting.info) is licensed under the Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.