aapor - comparing found data from social media and made data from surveys
DESCRIPTION
This presentation was for the 2014 AAPOR conference, and deals with specific components of how "big data" from social media is different from data acquired through surveys.TRANSCRIPT
"When Are Big Data Methods Trustworthy for Social Measurement?"
Cliff Lampe (@clifflampe), Josh Pasek, Lauren Guggenheim, Fred ConradUniversity of Michigan
Michael SchoberThe New School for Social Research
Presenting on “Big Data”
• Cliff Lampe– University of Michigan
School of Information– Social Scientist who uses
some Big Data techniques
– NOT A REAL DATA SCIENTIST
– Background in survey research
Mostly publish in Computer Science conferences
CHI – Computer Human InteractionKDD – Knowledge Discovery and Data MiningWSDM – Web Search and Data Mining
Ironically Data-Free Presentation
Today we are presenting on methodological issues of Big Social Data and surveys. Not presenting new data.
First we describe Big Data and Big Social Data as terms.
Then we describe methodological considerations at the intersection of surveys and Big Social Data
There have been many hyperbolic claims about Big Data
Is Big Data going to replace other forms of social measurement, or is it too flawed to survive (HINT: Neither)
What is Big Data?
Big Data started in the physical sciences
Big Data is increasingly being applied to social science questions
What counts as “big”?
LHC: .001% of sensors lead to 25 petabytes annually.Wikipedia: 17 terabytesTwitter: ~ 10 GB/day
How many observations needed to count as “big”?
Note: 100 million records not all that big.
Almost nobody who uses these techniques would use the term “big data”. Similar to surveys vs. polls.
Big Data is short hand for a variety of techniques that include:
- Data capture- Data storage- Data analytics- Search and Retrieval
Challenges in “Big Data”
CaptureCurationStorageSearchSharingTransferAnalysisVisualization
Related terms:
Computational social science, data science, information access and retrieval, Web-scale data, data mining, machine learning, non-reactive data
Big Social Data: large data sets about humans that are collected from social interactions captured online, primarily in social media sites.
What are the characteristics of surveys and Big Social Data that define when they are complementary,
supplementary, or orthogonal?
Bob Groves“Three Eras of Survey Research”
Mick Couper“Is the Sky Falling? New Technology, Changing Media, and the Future of Surveys”
Survey Research80+ years of research and practice
Sampling proceduresQuestion designEstimating precision of statisticsPractices in reducing survey error
Attempt to represent the population of interest with a sample
Research Questions
• Do we see big social data and survey data telling us the same things about society? When and why might this happen?
• How do survey data and big social data compare on important dimensions?
• In what ways are the two fundamentally different from each other?
• How are their uses different from one another?
Highlighting 3 Areas of Concern
How participants understand the activity of responding or posting
Different motivations and communicative dynamics
Nature of the dataDifferent structure, users, and data properties
Practical, ethical, and analytic considerations
Participants Understanding
Participants’ Understanding
– Posting initiative or motivation– Informed consent– Ability to opt out– Prior considerations– User identity– Perceived audience and social desirability– Time pressure/synchrony– Respondent burden
Participants’ Understanding
• Nature of perceived audience– Survey: Interviewer, Organization, others in HH– BSD: Groups of friends, acquaintances, public
• Social Desirability– Survey: Avoid negative evaluations from researcher– BSD: Manage impressions for their audience
• Scale of data• Face threatening topics
Participants’ Understanding
• Identity of user– Survey: Kept anonymous– BSD: User-created persona. Multiple users on a
single account, multiple accounts for one user, corporate users, etc.
• Prior Considerations– Survey: May not have thought about issue– BSD: Have thought about it, maybe not deeply
• Being asked vs caring to post
Nature of the Data
Nature of the Data
– Population coverage– Sampled units– Sampling– Sample size– Temporal properties– Relevance to research topic– Granularity of possible analyses– Data structure– Auxiliary information
Nature of the Data
• Sampling– Surveys: Representative of population of interest (via probability
sampling)– BSD: Users/messages not the full population. User accounts are not
always users. Frequency of posting among users varies
• Sample Size– Surveys: Balance between large enough to make inference and low
cost– BSD: More users and posts than surveys. Limited by access/storage.
• Can size help overcome sampling/representativeness problems?• The aggregation of SM does not necessarily map on to collection of
individual users in survey research
Nature of the Data
• Temporal properties:– Surveys: Memory retrieval, measurement at
discrete moments– BSD: Posting on recent events, continuously
• Auxiliary data:– Surveys: Paradata (# calls, behavior during
interview)– BSD: Geolocation, system activity, profile info
Practical, Ethical and Analytic Considerations
Practical, Ethical, and Analytic Considerations
– Established research communities– Consent to research/IRB– Perception of research among public– Costs to researchers– Data ownership– Adjustments for non-representativeness– Stability of data source and adjustments– Updating models in changing environment– Users and impact
Practical Considerations
• Adjustments for non-representativeness– Surveys: Well developed, weighting– BSD: No standard use, depends on style of analysis,
may not be done if using certain techniques
• Ethical issues– Surveys: Explicit consent, regulated by govm’t/IRB– BSD: Unaware of terms in user agreement,
inconsistently regulated by IRBs
Practical Considerations
• Perception of research/Legitimacy– Surveys: fatigue, falling response rates, confusion
about legitimacy– BSD: not considered while posting, but concerns
over surveillance
YOU’RE SLOW AND EXPENSIVE!
YOU AREN’T REPRESENTATIVE!
Conclusion
We need to stop arguing about the wrong things.
We need a systematic agenda of research looking at the intersection of these [email protected]
[email protected]: @clifflampe