statistical identification of encrypted web-browsing traffic

Post on 25-Feb-2016

22 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Statistical Identification of Encrypted Web-Browsing Traffic. Qixiang Sun Stanford University Daniel R. Simon, Yi-Min Wang, Wilf Russell, Venkata N. Padmanabhan, Lili Qiu Microsoft Research. Outline. Motivation & Problem Intuition Hypothetical Attacker Attacker’s Success Rate - PowerPoint PPT Presentation

TRANSCRIPT

Statistical Identification of Encrypted Web-Browsing Traffic

Qixiang SunStanford University

Daniel R. Simon, Yi-Min Wang, Wilf Russell, Venkata N. Padmanabhan, Lili Qiu

Microsoft Research

Outline

• Motivation & Problem• Intuition• Hypothetical Attacker• Attacker’s Success Rate• Countermeasures• Conclusion

Anonymous Web Browsing

• Protect personal information from Attacker’s Inference– Medical (Online support group)– Questionable Activities

• Question: Is this REALLY anonymous?

R1 R2 R3 R4

What’s Different?

In anonymous Web browsing– The chain of routers are used for both

sending and receiving data

Can link HTTP requests and responses!

– The target Web pages are publicly accessible

Responses are known!

Implication: The first link/router is an exploitable weakness.

What Information is Available?HTTP Get

HTTP Get

Response

Response

Bro

wse

r 1st R

outer

• Number of objects

• Object sizes

• Ordering of the objects

• Delay between packets

R1 R2 R3 R4

Intuition

• Number of objects and object sizes are sufficient to identify a Web page!

– On average, a Web page has 11 objects with each object yielding 8.4 bits of information

8.4*11 – log2(11!) 67 bits 1020 possibilities!!

– Currently, there are about 109 Web pages

An Hypothetical Attacker

List of target Sensitive sites URLs

ProgrammaticAccess to URL

& Traffic recording

Traffic patternConstruction &

Database update

TrafficPattern

Database

History

Similarity scoresCalculation

Decision module

Negative

Positive

R1

Traffic recording& Pattern construction

TrafficPattern

Browser

Guts of the Pattern Matching• Given two multisets of object sizes S1 and S2

Sim(S1, S2) = S1 S2 / S1 S2

• Decision module uses an absolute threshold.TrafficPattern

Database

TrafficPattern

Similarity scoresCalculation

Decision module

For example:S1 = {3KB, 3KB, 5KB}S2 = {3KB, 5KB, 5KB}

Sim(S1, S2) =

= 0.5

| {3KB, 5KB} |

| {3KB, 3KB, 5KB, 5KB} |

Experiment Setup

• Approximately 100,000 Web pages in total (URLs obtained from the Open Directory Project).

• The hypothetical attacker chooses about 2200 pages as target pages.

• Goal: Can these 2200 pages be identified without causing many false positives?

What is a Success and Failure?

• Successful Identification:– A target page passes the similarity threshold and is

not confused with other pages in the target set.

• False Positive:– A non-target page is incorrectly identified as one of

the target pages.

• Potential False Positive:– A page passes the similarity threshold when

compared with a single selected target page.

Attacker’s Success Rate

• A threshold of 0.5 is sufficient.

0

10

20

30

40

50

60

70

80

90

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Absolute Threshold

% o

f Pag

es

Identification rate(2191 targetpages)

Actual false-positives rate(98496 nontargetpages)

80.4%

2.1%

Is this small enough?

A Detailed Look Inside• False-positives are NOT generated uniformly!

707580859095

100

0 200 400 600 800 1000 1200

# of Potential False Positives

% o

f Tar

get P

ages

0-identifiable pages

HTTP 404sCommon-looking pages

Dynamism in Web Pages

• Most pages are relatively static

One-day-old pattern database is sufficient

0

20

40

60

80

100

0 0.2 0.4 0.6 0.8 1

Self Similarity Score

% o

f Tar

get P

ages

Countermeasures

• Padding– Individual objects– Add random-sized objects

• Morphing– Pipelining the HTTP GET requests– Pre-fetching

• Mimicking– Common templates or Web-hosting services

Padding Object Size• Linear – Nearest multiple of padding size• Exponential – Nearest power of 2

0

10

20

30

40

50

60

128 256 512 1024 2048 4096 8192 16384

Minimum Object Size

% o

f 0-id

entif

iabl

e pa

ges

Linear Padding

Exponential Padding

Padding Random Objects

05

1015202530354045

0.3 0.4 0.5 0.6 0.7

Absolute Threshold

% o

f 0-Id

entif

iabl

e P

ages

Multiple of 10

Two-chunk Pipelining

• Approximately 36% of the target pages are 0-identifiable.

– Very close to the theoretical limit of 1/e (assuming traffic patterns are random)

• Implication: Can harness the total entropy in the Web page traffic patterns.

One-chunk Pipelining

02468

1012

0 2 4 6 8 10 12

K (Number of Potential False Positives)

% o

f K

-iden

tifia

ble

Pag

es

Conclusion• Encrypted Web browsing can be identified by the target page’s “unique” traffic pattern.

010203040506070

Padding Bucket Size

% o

f Ide

ntifi

able

Site

s 0-identifiable1-identifiable2-identifiable

Linear Padding

05

1015

2025

3035

40

128

256

512

1024

2048

4096

8192

1638

4

Minimum Padding Size

% o

f Ide

ntifi

able

Site

s 0-identifiable1-identifiable2-identifiable

Exponential Padding

Pad Random Objects

05

1015202530354045

0.3 0.4 0.5 0.6 0.7

Absolute Threshold

% o

f Ide

ntifi

able

Site

s

Multiple of 10Multiple of 15Multiple of 20

top related