internet jones and the raiders of the lost trackers...validation of archival data vs. ground truth...
TRANSCRIPT
Internet Jones and the Raiders of the Lost TrackersAn Archaeological Study of Web Tracking from 1996 to 2016
Adam Lerner*
1
Anna Kornfeld Simpson* Tadayoshi Kohno Franziska Roesner
*Joint First Authors in Alphabetical Order
Third Party Web Tracking
1. Third party web tracking is important2. We should be studying web tracking over time
2
What is Third Party Web Tracking?
3
● news.google.com● weather.com● twitch.tv● hulu.com● imdb.com● foxnews.com● zappos.com● honda.com● jetblue.com
I know that user "d04874f3" has visited these sites:
tracker.com
4
Being watched has chilling effects
Views and edits of privacy-related Wikipedia articles reduced after leaks of NSA surveillance activities*
People donate more for communal coffee when a picture of watching eyes is posted near the coffee**
*Penney, Jon. "Chilling Effects: Online Surveillance and Wikipedia Use."Berkeley Technology Law Journal (2016).
**Bateson M, Nettle D, Roberts G. Cues of being watched enhance cooperation in a real-world setting. Biology Letters. 2006;2(3):412-414. doi:10.1098/rsbl.2006.0509.
5
Web Tracking
1. Third party web tracking is important2. We should be studying web tracking over time
6
We should study web tracking over time
7
1996 20162009
Krishnamurthy and Wills '09
Roesner, Kohno, and Wetherall '12
2005
Englehardt et al. '15
Eubank et al. '13
Gomer et al. '13
Mayer and Mitchell '12
We should study web tracking over time
8
1996 20162009
Krishnamurthy and Wills '09
Roesner, Kohno, and Wetherall '12
2005
Englehardt et al. '15
Eubank et al. '13
Gomer et al. '13
Mayer and Mitchell '12
?
Time Travel
9
Time Travel
10
11
12
The Wayback Machine
13
The Wayback Machine
14
~11 petabytes of web sites archived!
15
~11 petabytes of web sites archived!
Includes JavaScript! HTTP headers! CSS! Images! Hosted on web servers!
16
The Wayback Machine enables us to perform retrospective web measurements
This Talk
A 20 year longitudinal study of third party web tracking, performed using archival web data
17
Abstract
18
Abstract
Insight: Web archives exist (e.g., Wayback Machine)
19
Abstract
Insight: Web archives exist (e.g., Wayback Machine)
Challenges: Web archives are sometimes 1. incomplete & 2. inconsistent
20
Abstract
Insight: Web archives exist (e.g., Wayback Machine)
Challenges: Web archives are sometimes 1. incomplete & 2. inconsistent
Tool: TrackingExcavator
21
Abstract
Insight: Web archives exist (e.g., Wayback Machine)
Challenges: Web archives are sometimes 1. incomplete & 2. inconsistent
Tool: TrackingExcavator
Validation: Comparison of archival measurements to ground truth
22
Abstract
Insight: Web archives exist (e.g., Wayback Machine)
Challenges: Web archives are sometimes 1. incomplete & 2. inconsistent
Tool: TrackingExcavator
Validation: Comparison of archival measurements to ground truth
Measurement Results: A picture of web tracking over the last 20 years
23
A first-party is a domain to which a user goes intentionally, by typing a URL or clicking a link.
A third-party is a domain whose content is embedded in a first-party web page.
How does Third Party Web Tracking Work?
24
Background
A first-party is a domain to which a user goes intentionally, by typing a URL or clicking a link.
A third-party is a domain whose content is embedded in a first-party web page.
How does Third Party Web Tracking Work?
25
Background
If a third party is able to link together a subset of a person's browsing history, we call this ability 3rd Party Web Tracking.
26
How does Third Party Web Tracking Work?
Background
tracker.com
ad
logo
27
How does Third Party Web Tracking Work?
Background
tracker.com
ad
logo
theonion.com
28
How does Third Party Web Tracking Work?
Background
tracker.com
ad
logo
theonion.com
29
How does Third Party Web Tracking Work?
Background
tracker.com
ad
logo
theonion.com
30
How does Third Party Web Tracking Work?
Background
tracker.com
ad
logo
31
How does Third Party Web Tracking Work?
Background
tracker.com
ad
logo
32
How does Third Party Web Tracking Work?
Background
tracker.com
ad
logo
Browsing profile for user 789: theonion.com
Set this cookie: id=789
33
How does Third Party Web Tracking Work?
Background
tracker.com
ad
logo
Browsing profile for user 789: theonion.com
Set this cookie: id=789
tracker.com:id=789
34
How does Third Party Web Tracking Work?
Background
tracker.com
ad
Browsing profile for user 789: theonion.com
ad
logo
logo
tracker.com:id=789
35
How does Third Party Web Tracking Work?
Background
tracker.com
ad
Browsing profile for user 789: theonion.com
ad
logo
logo
cnn.com
tracker.com:id=789
36
How does Third Party Web Tracking Work?
Background
tracker.com
ad
Browsing profile for user 789: theonion.com
ad
logo
logo
cnn.com
tracker.com:id=789
37
How does Third Party Web Tracking Work?
Background
tracker.com
ad
Browsing profile for user 789: theonion.com
ad
logo
logo
tracker.com:id=789
38
How does Third Party Web Tracking Work?
Background
tracker.com
ad
Browsing profile for user 789: theonion.com
ad
My cookie is: id=789
logo
logo
tracker.com:id=789
39
How does Third Party Web Tracking Work?
Background
tracker.com
ad
Browsing profile for user 789: theonion.com, cnn.com
ad
My cookie is: id=789
logo
logo
tracker.com:id=789
40
How does Third Party Web Tracking Work?
Background
tracker.com
ad
Browsing profile for user 789: theonion.com, cnn.com
ad
My cookie is: id=789
logo
logo
tracker.com:id=789
41
How does Third Party Web Tracking Work?
Background
tracker.com
ad
Browsing profile for user 789: theonion.com, cnn.com
ad
logo
logo
tracker.com:id=789
Contributions
● TrackingExcavator, our Tool● Challenges of Retrospective Measurement● Validation of Archival Data vs. Ground Truth● Measurement Results: A 20 Year Picture of Web Tracking
42
Contributions
● TrackingExcavator, our Tool● Challenges of Retrospective Measurement● Validation of Archival Data vs. Ground Truth● Measurement Results: A 20 Year Picture of Web Tracking
43
Our Tool, TrackingExcavator,and the Challenges of RetrospectiveMeasurements
44
45
TrackingExcavator
Input Munging
46
TrackingExcavator
Input Munging Automatic Browsing
47
TrackingExcavator
Input Munging Automatic Browsing
48
TrackingExcavator
Automatic Browsing
Input Munging Data Analysis & Visualization
49
TrackingExcavator
● Cookies don't work in the archive● Incomplete archives● Inconsistent archives ("anachronisms")
What Makes this Hard?
50
Cookies don't work in the archive
Set-cookie: id=789
becomes...
X-Archive-Orig-Set-cookie: id=789
51
Automatic Browsing
Input Munging Data Analysis & Visualization
52
TrackingExcavator
Cookie simulation
● Cookies are weird in archival environments● Incomplete archives● Inconsistent archives ("anachronisms")
What Makes this Hard?
53
Set-cookie: id=789
Set-cookie: oatmeal=raisin
Set-cookie: chocolate=chip
Incomplete Archives
54
55
57
58
Fortunately, we can still count requests to missing resources!
59
http://doubleclick.net/robots.txt:Robots.txt files
60
User-agent: *Disallow: /searchAllow: /search/aboutDisallow: /sdchDisallow: /groupsDisallow: /index.html?Disallow: /?Allow: /?hl=Disallow: /?hl=*&Allow: /?hl=*&gws_rd=ssl$Disallow: /?hl=*&*&gws_rd=sslAllow: /?gws_rd=ssl$Allow: /?pt1=true$Disallow: /imgres...
http://doubleclick.net/robots.txt:Robots.txt files
61
User-agent: *Disallow: /searchAllow: /search/aboutDisallow: /sdchDisallow: /groupsDisallow: /index.html?Disallow: /?Allow: /?hl=Disallow: /?hl=*&Allow: /?hl=*&gws_rd=ssl$Disallow: /?hl=*&*&gws_rd=sslAllow: /?gws_rd=ssl$Allow: /?pt1=true$Disallow: /imgres...
The Wayback Machine respects these preferences retroactively!
Automatic Browsing
Input Munging Data Analysis & Visualization
62
TrackingExcavator
Inconsistent Archives
63
64From https://web.archive.org/web/20151205072445/http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html
65From https://web.archive.org/web/20151205072445/http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html
2008
2012
Input Munging Automatic Browsing
66
Block non-archival web requests
Automatic Browsing
Input Munging Data Analysis & Visualization
67
Timestamp filtering
Contributions
✓ TrackingExcavator, our Tool✓ Challenges of Retrospective Measurement● Validation of Archival Data vs. Ground Truth● Measurement Results: A 20 Year Picture of Web Tracking
68
Validation of Archival Data
69
Validation of Archival Data vs. Ground Truth
UW NSDI 2012 paper* (and follow-up measurements) provided a taxonomy of web tracking and ground truth data over the last 5 years
*Roesner, Franziska, Tadayoshi Kohno, and David Wetherall. "Detecting and defending against third-party tracking on the web." Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012. 70
Archival trends reflect ground truth trends
71
Contributions
✓ TrackingExcavator, our Tool✓ Challenges of Retrospective Measurement✓ Validation of Archival Data vs. Ground Truth● Measurement Results: A 20 Year Picture of Web Tracking
72
Measurement Results
73
Measurement Results
"Web Tracking has Increased Over Time"
1. Number of trackers2. Type of tracking employed3. Coverage of top trackers
74
Measurement Results
"Web Tracking has Increased Over Time"
1. Number of trackers2. Type of tracking employed3. Coverage of top trackers
75
76
Measurement Results
"Web Tracking has Increased Over Time"
1. Number of trackers2. Type of tracking employed3. Coverage of top trackers
77
78
79
80
The power of individual trackers has grown over time. Individual companies know more about your browsing history than they did before.
81
google-analytics.com
Individual companies' choices are important to your privacy
Transparency is important
82
The "Worst" Days of the Web
The Worst Days of the Web
84https://www.engadget.com/2014/08/14/the-creator-of-the-pop-up-ad-says-sorry/, accessed Aug 3 2016
The "Worst" Days of the Web
85
Popups enable stronger tracking
The "Worst" Days of the Web
The "Worst" Days of the Web
2004: Internet Explorer starts popup-blocking by default
The "Worst" Days of the Web
88
2003 - 2004
30 third-party popups (top 500 sites)
The choices of browser manufacturers, even those unrelated to tracking, influence the way trackers behave significantly
We should study web tracking over time
89
1996 20162009
Krishnamurthy and Wills '09
Roesner, Kohno, and Wetherall '12
2005
Englehardt et al. '15
Eubank et al. '13
Gomer et al. '13
Mayer and Mitchell '12
The Future
This Work
Adam, Anna, Yoshi, & Franzi say: "Thanks!"
Insight: Web archives exist (Wayback Machine)
Challenges: Web Archives are sometimes 1. incomplete & 2. inconsistent
Tool: TrackingExcavator
Validation: Comparison of archival measurements to ground truth
Measurement Results: A 20 year picture of third party web tracking
90https://trackingexcavator.cs.washington.edu