statistical identification of encrypted web-browsing traffic

23
Statistical Identification of Encrypted Web-Browsing Traffic Qixiang Sun Stanford University Daniel R. Simon, Yi-Min Wang, Wilf Russell, Venkata N. Padmanabhan, Lili Qiu Microsoft Research

Upload: mae

Post on 25-Feb-2016

22 views

Category:

Documents


1 download

DESCRIPTION

Statistical Identification of Encrypted Web-Browsing Traffic. Qixiang Sun Stanford University Daniel R. Simon, Yi-Min Wang, Wilf Russell, Venkata N. Padmanabhan, Lili Qiu Microsoft Research. Outline. Motivation & Problem Intuition Hypothetical Attacker Attacker’s Success Rate - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Statistical Identification of Encrypted Web-Browsing Traffic

Statistical Identification of Encrypted Web-Browsing Traffic

Qixiang SunStanford University

Daniel R. Simon, Yi-Min Wang, Wilf Russell, Venkata N. Padmanabhan, Lili Qiu

Microsoft Research

Page 2: Statistical Identification of Encrypted Web-Browsing Traffic

Outline

• Motivation & Problem• Intuition• Hypothetical Attacker• Attacker’s Success Rate• Countermeasures• Conclusion

Page 3: Statistical Identification of Encrypted Web-Browsing Traffic

Anonymous Web Browsing

• Protect personal information from Attacker’s Inference– Medical (Online support group)– Questionable Activities

• Question: Is this REALLY anonymous?

R1 R2 R3 R4

Page 4: Statistical Identification of Encrypted Web-Browsing Traffic

What’s Different?

In anonymous Web browsing– The chain of routers are used for both

sending and receiving data

Can link HTTP requests and responses!

– The target Web pages are publicly accessible

Responses are known!

Implication: The first link/router is an exploitable weakness.

Page 5: Statistical Identification of Encrypted Web-Browsing Traffic

What Information is Available?HTTP Get

HTTP Get

Response

Response

Bro

wse

r 1st R

outer

• Number of objects

• Object sizes

• Ordering of the objects

• Delay between packets

R1 R2 R3 R4

Page 6: Statistical Identification of Encrypted Web-Browsing Traffic

Intuition

• Number of objects and object sizes are sufficient to identify a Web page!

– On average, a Web page has 11 objects with each object yielding 8.4 bits of information

8.4*11 – log2(11!) 67 bits 1020 possibilities!!

– Currently, there are about 109 Web pages

Page 7: Statistical Identification of Encrypted Web-Browsing Traffic

An Hypothetical Attacker

List of target Sensitive sites URLs

ProgrammaticAccess to URL

& Traffic recording

Traffic patternConstruction &

Database update

TrafficPattern

Database

History

Similarity scoresCalculation

Decision module

Negative

Positive

R1

Traffic recording& Pattern construction

TrafficPattern

Browser

Page 8: Statistical Identification of Encrypted Web-Browsing Traffic

Guts of the Pattern Matching• Given two multisets of object sizes S1 and S2

Sim(S1, S2) = S1 S2 / S1 S2

• Decision module uses an absolute threshold.TrafficPattern

Database

TrafficPattern

Similarity scoresCalculation

Decision module

For example:S1 = {3KB, 3KB, 5KB}S2 = {3KB, 5KB, 5KB}

Sim(S1, S2) =

= 0.5

| {3KB, 5KB} |

| {3KB, 3KB, 5KB, 5KB} |

Page 9: Statistical Identification of Encrypted Web-Browsing Traffic

Experiment Setup

• Approximately 100,000 Web pages in total (URLs obtained from the Open Directory Project).

• The hypothetical attacker chooses about 2200 pages as target pages.

• Goal: Can these 2200 pages be identified without causing many false positives?

Page 10: Statistical Identification of Encrypted Web-Browsing Traffic

What is a Success and Failure?

• Successful Identification:– A target page passes the similarity threshold and is

not confused with other pages in the target set.

• False Positive:– A non-target page is incorrectly identified as one of

the target pages.

• Potential False Positive:– A page passes the similarity threshold when

compared with a single selected target page.

Page 11: Statistical Identification of Encrypted Web-Browsing Traffic

Attacker’s Success Rate

• A threshold of 0.5 is sufficient.

0

10

20

30

40

50

60

70

80

90

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Absolute Threshold

% o

f Pag

es

Identification rate(2191 targetpages)

Actual false-positives rate(98496 nontargetpages)

80.4%

2.1%

Is this small enough?

Page 12: Statistical Identification of Encrypted Web-Browsing Traffic

A Detailed Look Inside• False-positives are NOT generated uniformly!

707580859095

100

0 200 400 600 800 1000 1200

# of Potential False Positives

% o

f Tar

get P

ages

0-identifiable pages

HTTP 404sCommon-looking pages

Page 13: Statistical Identification of Encrypted Web-Browsing Traffic

Dynamism in Web Pages

• Most pages are relatively static

One-day-old pattern database is sufficient

0

20

40

60

80

100

0 0.2 0.4 0.6 0.8 1

Self Similarity Score

% o

f Tar

get P

ages

Page 14: Statistical Identification of Encrypted Web-Browsing Traffic

Countermeasures

• Padding– Individual objects– Add random-sized objects

• Morphing– Pipelining the HTTP GET requests– Pre-fetching

• Mimicking– Common templates or Web-hosting services

Page 15: Statistical Identification of Encrypted Web-Browsing Traffic

Padding Object Size• Linear – Nearest multiple of padding size• Exponential – Nearest power of 2

0

10

20

30

40

50

60

128 256 512 1024 2048 4096 8192 16384

Minimum Object Size

% o

f 0-id

entif

iabl

e pa

ges

Linear Padding

Exponential Padding

Page 16: Statistical Identification of Encrypted Web-Browsing Traffic

Padding Random Objects

05

1015202530354045

0.3 0.4 0.5 0.6 0.7

Absolute Threshold

% o

f 0-Id

entif

iabl

e P

ages

Multiple of 10

Page 17: Statistical Identification of Encrypted Web-Browsing Traffic

Two-chunk Pipelining

• Approximately 36% of the target pages are 0-identifiable.

– Very close to the theoretical limit of 1/e (assuming traffic patterns are random)

• Implication: Can harness the total entropy in the Web page traffic patterns.

Page 18: Statistical Identification of Encrypted Web-Browsing Traffic

One-chunk Pipelining

02468

1012

0 2 4 6 8 10 12

K (Number of Potential False Positives)

% o

f K

-iden

tifia

ble

Pag

es

Page 19: Statistical Identification of Encrypted Web-Browsing Traffic

Conclusion• Encrypted Web browsing can be identified by the target page’s “unique” traffic pattern.

Page 20: Statistical Identification of Encrypted Web-Browsing Traffic
Page 21: Statistical Identification of Encrypted Web-Browsing Traffic

010203040506070

Padding Bucket Size

% o

f Ide

ntifi

able

Site

s 0-identifiable1-identifiable2-identifiable

Linear Padding

Page 22: Statistical Identification of Encrypted Web-Browsing Traffic

05

1015

2025

3035

40

128

256

512

1024

2048

4096

8192

1638

4

Minimum Padding Size

% o

f Ide

ntifi

able

Site

s 0-identifiable1-identifiable2-identifiable

Exponential Padding

Page 23: Statistical Identification of Encrypted Web-Browsing Traffic

Pad Random Objects

05

1015202530354045

0.3 0.4 0.5 0.6 0.7

Absolute Threshold

% o

f Ide

ntifi

able

Site

s

Multiple of 10Multiple of 15Multiple of 20