internet jones and the raiders of the lost trackers...validation of archival data vs. ground truth...

Post on 20-Aug-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Internet Jones and the Raiders of the Lost TrackersAn Archaeological Study of Web Tracking from 1996 to 2016

Adam Lerner*

1

Anna Kornfeld Simpson* Tadayoshi Kohno Franziska Roesner

*Joint First Authors in Alphabetical Order

Third Party Web Tracking

1. Third party web tracking is important2. We should be studying web tracking over time

2

What is Third Party Web Tracking?

3

● news.google.com● weather.com● twitch.tv● hulu.com● imdb.com● foxnews.com● zappos.com● honda.com● jetblue.com

I know that user "d04874f3" has visited these sites:

tracker.com

4

Being watched has chilling effects

Views and edits of privacy-related Wikipedia articles reduced after leaks of NSA surveillance activities*

People donate more for communal coffee when a picture of watching eyes is posted near the coffee**

*Penney, Jon. "Chilling Effects: Online Surveillance and Wikipedia Use."Berkeley Technology Law Journal (2016).

**Bateson M, Nettle D, Roberts G. Cues of being watched enhance cooperation in a real-world setting. Biology Letters. 2006;2(3):412-414. doi:10.1098/rsbl.2006.0509.

5

Web Tracking

1. Third party web tracking is important2. We should be studying web tracking over time

6

We should study web tracking over time

7

1996 20162009

Krishnamurthy and Wills '09

Roesner, Kohno, and Wetherall '12

2005

Englehardt et al. '15

Eubank et al. '13

Gomer et al. '13

Mayer and Mitchell '12

We should study web tracking over time

8

1996 20162009

Krishnamurthy and Wills '09

Roesner, Kohno, and Wetherall '12

2005

Englehardt et al. '15

Eubank et al. '13

Gomer et al. '13

Mayer and Mitchell '12

?

Time Travel

9

Time Travel

10

11

12

The Wayback Machine

13

The Wayback Machine

14

~11 petabytes of web sites archived!

15

~11 petabytes of web sites archived!

Includes JavaScript! HTTP headers! CSS! Images! Hosted on web servers!

16

The Wayback Machine enables us to perform retrospective web measurements

This Talk

A 20 year longitudinal study of third party web tracking, performed using archival web data

17

Abstract

18

Abstract

Insight: Web archives exist (e.g., Wayback Machine)

19

Abstract

Insight: Web archives exist (e.g., Wayback Machine)

Challenges: Web archives are sometimes 1. incomplete & 2. inconsistent

20

Abstract

Insight: Web archives exist (e.g., Wayback Machine)

Challenges: Web archives are sometimes 1. incomplete & 2. inconsistent

Tool: TrackingExcavator

21

Abstract

Insight: Web archives exist (e.g., Wayback Machine)

Challenges: Web archives are sometimes 1. incomplete & 2. inconsistent

Tool: TrackingExcavator

Validation: Comparison of archival measurements to ground truth

22

Abstract

Insight: Web archives exist (e.g., Wayback Machine)

Challenges: Web archives are sometimes 1. incomplete & 2. inconsistent

Tool: TrackingExcavator

Validation: Comparison of archival measurements to ground truth

Measurement Results: A picture of web tracking over the last 20 years

23

A first-party is a domain to which a user goes intentionally, by typing a URL or clicking a link.

A third-party is a domain whose content is embedded in a first-party web page.

How does Third Party Web Tracking Work?

24

Background

A first-party is a domain to which a user goes intentionally, by typing a URL or clicking a link.

A third-party is a domain whose content is embedded in a first-party web page.

How does Third Party Web Tracking Work?

25

Background

If a third party is able to link together a subset of a person's browsing history, we call this ability 3rd Party Web Tracking.

26

How does Third Party Web Tracking Work?

Background

tracker.com

ad

logo

27

How does Third Party Web Tracking Work?

Background

tracker.com

ad

logo

theonion.com

28

How does Third Party Web Tracking Work?

Background

tracker.com

ad

logo

theonion.com

29

How does Third Party Web Tracking Work?

Background

tracker.com

ad

logo

theonion.com

30

How does Third Party Web Tracking Work?

Background

tracker.com

ad

logo

31

How does Third Party Web Tracking Work?

Background

tracker.com

ad

logo

32

How does Third Party Web Tracking Work?

Background

tracker.com

ad

logo

Browsing profile for user 789: theonion.com

Set this cookie: id=789

33

How does Third Party Web Tracking Work?

Background

tracker.com

ad

logo

Browsing profile for user 789: theonion.com

Set this cookie: id=789

tracker.com:id=789

34

How does Third Party Web Tracking Work?

Background

tracker.com

ad

Browsing profile for user 789: theonion.com

ad

logo

logo

tracker.com:id=789

35

How does Third Party Web Tracking Work?

Background

tracker.com

ad

Browsing profile for user 789: theonion.com

ad

logo

logo

cnn.com

tracker.com:id=789

36

How does Third Party Web Tracking Work?

Background

tracker.com

ad

Browsing profile for user 789: theonion.com

ad

logo

logo

cnn.com

tracker.com:id=789

37

How does Third Party Web Tracking Work?

Background

tracker.com

ad

Browsing profile for user 789: theonion.com

ad

logo

logo

tracker.com:id=789

38

How does Third Party Web Tracking Work?

Background

tracker.com

ad

Browsing profile for user 789: theonion.com

ad

My cookie is: id=789

logo

logo

tracker.com:id=789

39

How does Third Party Web Tracking Work?

Background

tracker.com

ad

Browsing profile for user 789: theonion.com, cnn.com

ad

My cookie is: id=789

logo

logo

tracker.com:id=789

40

How does Third Party Web Tracking Work?

Background

tracker.com

ad

Browsing profile for user 789: theonion.com, cnn.com

ad

My cookie is: id=789

logo

logo

tracker.com:id=789

41

How does Third Party Web Tracking Work?

Background

tracker.com

ad

Browsing profile for user 789: theonion.com, cnn.com

ad

logo

logo

tracker.com:id=789

Contributions

● TrackingExcavator, our Tool● Challenges of Retrospective Measurement● Validation of Archival Data vs. Ground Truth● Measurement Results: A 20 Year Picture of Web Tracking

42

Contributions

● TrackingExcavator, our Tool● Challenges of Retrospective Measurement● Validation of Archival Data vs. Ground Truth● Measurement Results: A 20 Year Picture of Web Tracking

43

Our Tool, TrackingExcavator,and the Challenges of RetrospectiveMeasurements

44

45

TrackingExcavator

Input Munging

46

TrackingExcavator

Input Munging Automatic Browsing

47

TrackingExcavator

Input Munging Automatic Browsing

48

TrackingExcavator

Automatic Browsing

Input Munging Data Analysis & Visualization

49

TrackingExcavator

● Cookies don't work in the archive● Incomplete archives● Inconsistent archives ("anachronisms")

What Makes this Hard?

50

Cookies don't work in the archive

Set-cookie: id=789

becomes...

X-Archive-Orig-Set-cookie: id=789

51

Automatic Browsing

Input Munging Data Analysis & Visualization

52

TrackingExcavator

Cookie simulation

● Cookies are weird in archival environments● Incomplete archives● Inconsistent archives ("anachronisms")

What Makes this Hard?

53

Set-cookie: id=789

Set-cookie: oatmeal=raisin

Set-cookie: chocolate=chip

Incomplete Archives

54

55

57

58

Fortunately, we can still count requests to missing resources!

59

http://doubleclick.net/robots.txt:Robots.txt files

60

User-agent: *Disallow: /searchAllow: /search/aboutDisallow: /sdchDisallow: /groupsDisallow: /index.html?Disallow: /?Allow: /?hl=Disallow: /?hl=*&Allow: /?hl=*&gws_rd=ssl$Disallow: /?hl=*&*&gws_rd=sslAllow: /?gws_rd=ssl$Allow: /?pt1=true$Disallow: /imgres...

http://doubleclick.net/robots.txt:Robots.txt files

61

User-agent: *Disallow: /searchAllow: /search/aboutDisallow: /sdchDisallow: /groupsDisallow: /index.html?Disallow: /?Allow: /?hl=Disallow: /?hl=*&Allow: /?hl=*&gws_rd=ssl$Disallow: /?hl=*&*&gws_rd=sslAllow: /?gws_rd=ssl$Allow: /?pt1=true$Disallow: /imgres...

The Wayback Machine respects these preferences retroactively!

Automatic Browsing

Input Munging Data Analysis & Visualization

62

TrackingExcavator

Inconsistent Archives

63

64From https://web.archive.org/web/20151205072445/http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html

65From https://web.archive.org/web/20151205072445/http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html

2008

2012

Input Munging Automatic Browsing

66

Block non-archival web requests

Automatic Browsing

Input Munging Data Analysis & Visualization

67

Timestamp filtering

Contributions

✓ TrackingExcavator, our Tool✓ Challenges of Retrospective Measurement● Validation of Archival Data vs. Ground Truth● Measurement Results: A 20 Year Picture of Web Tracking

68

Validation of Archival Data

69

Validation of Archival Data vs. Ground Truth

UW NSDI 2012 paper* (and follow-up measurements) provided a taxonomy of web tracking and ground truth data over the last 5 years

*Roesner, Franziska, Tadayoshi Kohno, and David Wetherall. "Detecting and defending against third-party tracking on the web." Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012. 70

Archival trends reflect ground truth trends

71

Contributions

✓ TrackingExcavator, our Tool✓ Challenges of Retrospective Measurement✓ Validation of Archival Data vs. Ground Truth● Measurement Results: A 20 Year Picture of Web Tracking

72

Measurement Results

73

Measurement Results

"Web Tracking has Increased Over Time"

1. Number of trackers2. Type of tracking employed3. Coverage of top trackers

74

Measurement Results

"Web Tracking has Increased Over Time"

1. Number of trackers2. Type of tracking employed3. Coverage of top trackers

75

76

Measurement Results

"Web Tracking has Increased Over Time"

1. Number of trackers2. Type of tracking employed3. Coverage of top trackers

77

78

79

80

The power of individual trackers has grown over time. Individual companies know more about your browsing history than they did before.

81

google-analytics.com

Individual companies' choices are important to your privacy

Transparency is important

82

The "Worst" Days of the Web

The Worst Days of the Web

84https://www.engadget.com/2014/08/14/the-creator-of-the-pop-up-ad-says-sorry/, accessed Aug 3 2016

The "Worst" Days of the Web

85

Popups enable stronger tracking

The "Worst" Days of the Web

The "Worst" Days of the Web

2004: Internet Explorer starts popup-blocking by default

The "Worst" Days of the Web

88

2003 - 2004

30 third-party popups (top 500 sites)

The choices of browser manufacturers, even those unrelated to tracking, influence the way trackers behave significantly

We should study web tracking over time

89

1996 20162009

Krishnamurthy and Wills '09

Roesner, Kohno, and Wetherall '12

2005

Englehardt et al. '15

Eubank et al. '13

Gomer et al. '13

Mayer and Mitchell '12

The Future

This Work

Adam, Anna, Yoshi, & Franzi say: "Thanks!"

Insight: Web archives exist (Wayback Machine)

Challenges: Web Archives are sometimes 1. incomplete & 2. inconsistent

Tool: TrackingExcavator

Validation: Comparison of archival measurements to ground truth

Measurement Results: A 20 year picture of third party web tracking

90https://trackingexcavator.cs.washington.edu

top related