secondary data analysis with digital trace data
TRANSCRIPT
![Page 1: Secondary data analysis with digital trace data](https://reader035.vdocuments.us/reader035/viewer/2022081403/555583b5b4c9055f5f8b53b8/html5/thumbnails/1.jpg)
Secondary data analysis with digital trace data
Examples from FLOSS research
Andrea Wiggins13 Juillet, 2011
![Page 2: Secondary data analysis with digital trace data](https://reader035.vdocuments.us/reader035/viewer/2022081403/555583b5b4c9055f5f8b53b8/html5/thumbnails/2.jpg)
Secondary Data Analysis
• Uses existing data produced or collected by someone else, usually for a different purpose
• Databases
• Repositories
• Surveys
• Emails
• Social networks
2
![Page 3: Secondary data analysis with digital trace data](https://reader035.vdocuments.us/reader035/viewer/2022081403/555583b5b4c9055f5f8b53b8/html5/thumbnails/3.jpg)
Digital Trace Data
• Records of activity (trace data) undertaken through an online information system (thus digital)
• Increasingly common in studies of online phenomena
• Large volumes of available data
• Can be complete: a census, not a sample
• May be more reliably recorded than other data
3
![Page 4: Secondary data analysis with digital trace data](https://reader035.vdocuments.us/reader035/viewer/2022081403/555583b5b4c9055f5f8b53b8/html5/thumbnails/4.jpg)
Characteristics
1. Found data (not produced for research)
2. Event-based data (not summary data)
3. Events occur over time, so it is longitudinal data
4
![Page 5: Secondary data analysis with digital trace data](https://reader035.vdocuments.us/reader035/viewer/2022081403/555583b5b4c9055f5f8b53b8/html5/thumbnails/5.jpg)
Requirements
• Understand the original data source
• How it was collected, potential problems
• Limitations of the sample
• What the data describe
• Match with appropriate analysis methods and measures
• New types of data may require new measures
• Theoretical coherence is very important
5
![Page 6: Secondary data analysis with digital trace data](https://reader035.vdocuments.us/reader035/viewer/2022081403/555583b5b4c9055f5f8b53b8/html5/thumbnails/6.jpg)
Advantages
• Data may be “complete”
• Usually no response bias (exception: cookies)
• May cover long periods of time and large groups
• Multiple different data types, but mostly textual
• Data are often easy to acquire
• APIs or scraping web pages (with caution)
• Databases, archives, or repositories of research data
• But remember: you usually get what you pay for!
6
![Page 7: Secondary data analysis with digital trace data](https://reader035.vdocuments.us/reader035/viewer/2022081403/555583b5b4c9055f5f8b53b8/html5/thumbnails/7.jpg)
Disadvantages
• Often difficult to know limitations of data
• Data may be poorly documented
• Original creator may not be available for comment
• Volume of data can be overwhelming
• Sampling strategies needed, e.g., temporal, random
• Substantial time required for data preparation: 90% of effort
• Exceptions are everywhere and will break analyses, but can only be discovered through trial and error
7
![Page 8: Secondary data analysis with digital trace data](https://reader035.vdocuments.us/reader035/viewer/2022081403/555583b5b4c9055f5f8b53b8/html5/thumbnails/8.jpg)
Example: Email Networks
• Data source: email listservs for FLOSS projects
• Analysis approach: create social networks
• Within discussion threads, individuals are nodes, and links are reply-to messages
• Some conceptual issues for interpretation, choice of measures
• Technical challenges
• Temporal aggregation
• Identity resolution
8
![Page 9: Secondary data analysis with digital trace data](https://reader035.vdocuments.us/reader035/viewer/2022081403/555583b5b4c9055f5f8b53b8/html5/thumbnails/9.jpg)
Temporal AggregationFigures from Howison et al., 2006
9
![Page 10: Secondary data analysis with digital trace data](https://reader035.vdocuments.us/reader035/viewer/2022081403/555583b5b4c9055f5f8b53b8/html5/thumbnails/10.jpg)
Network Workflow10
![Page 11: Secondary data analysis with digital trace data](https://reader035.vdocuments.us/reader035/viewer/2022081403/555583b5b4c9055f5f8b53b8/html5/thumbnails/11.jpg)
Network Results
Cleaning up before shutting down
• Observed anomalous patterns in trackers for both projects: periodic centralization spikes
• A single user makes batch bug closings (up to 279!)– Fire’s (feature request) tracker housekeeping
appears to be preparation for project closure
– Gaim’s tracker housekeeping was more regular and repeated
• Different levels of correlation between venues, suggesting different types of interactions
• User venues more decentralized than developer venues, reflecting greater number of participants
• Overall trend toward decentralization could be result of different influences
11
![Page 12: Secondary data analysis with digital trace data](https://reader035.vdocuments.us/reader035/viewer/2022081403/555583b5b4c9055f5f8b53b8/html5/thumbnails/12.jpg)
Example: Classification
• Replication of success-tragedy classification
• Classification criteria originally drawn from interviews with community members
• Data extracted from repositories
• Technical challenges
• Merging data from two repositories
• Processing large volume of data in multiple steps
12
![Page 13: Secondary data analysis with digital trace data](https://reader035.vdocuments.us/reader035/viewer/2022081403/555583b5b4c9055f5f8b53b8/html5/thumbnails/13.jpg)
Variables
• Inputs: project names and 5 threshold values for classification tests, e.g. number of downloads
• Project statistics retrieved from repositories
• Founding date
• Data collection date
• Dates for all releases
• Number of downloads
• URL
13
![Page 14: Secondary data analysis with digital trace data](https://reader035.vdocuments.us/reader035/viewer/2022081403/555583b5b4c9055f5f8b53b8/html5/thumbnails/14.jpg)
Classification workflow14
![Page 15: Secondary data analysis with digital trace data](https://reader035.vdocuments.us/reader035/viewer/2022081403/555583b5b4c9055f5f8b53b8/html5/thumbnails/15.jpg)
Classification Results
Class Original Our results Difference
unclassifiable
3 186 3 296 +110
II 13 342 (12%) 16 252 (14%) +2 910 (+2%)
IG 10 711 (10%) 12 991 (11%) +2 280 (+1%)
TI 37 320 (35%) 36 507 (31%) -813 (-4%)
TG 30 592 (28%) 32 642 (28%) +2 050 (0%)
SG 15 782 (15%) 16 045 (14%) +263 (-1%)
other 8 422 0
Total 119 355 117 733
15
![Page 16: Secondary data analysis with digital trace data](https://reader035.vdocuments.us/reader035/viewer/2022081403/555583b5b4c9055f5f8b53b8/html5/thumbnails/16.jpg)
Thanks!
• Questions?
16