Transcript
Page 1: New Directions in Data Mining

New Directions in Data Mining

Ramakrishnan Srikant

Page 2: New Directions in Data Mining

Directions

• Electronic Commerce & Web– Catalog Integration (WWW 2001, with R. Agrawal)

– Searching with Numbers (WWW 2002, with R. Agrawal)

• Security

• Privacy

Page 3: New Directions in Data Mining

Catalog Integration

• B2B electronics portal: 2000 categories, 200K datasheets

a

ICs

LogicMem.DSP

fec db

ICs

Cat 2Cat 1

yx z w

Master Catalog New Catalog

Page 4: New Directions in Data Mining

Catalog Integration (cont.)

• After integration:

ICs

LogicMem.DSP

a fec db yx z

Page 5: New Directions in Data Mining

Goal

• Use affinity information in new catalog.– Products in same category are similar.

• Accuracy boost depends on match between two categorizations.

Page 6: New Directions in Data Mining

Intuition behind Algorithm

Standard

Algorithm

Enhanced

Algorithm

Computer Peripheral

Digital Camera

P1 20% 80%P2 40% 60%P3 60% 40%

Computer Peripheral

Digital Camera

P1 15% 85%P2 30% 70%P3 45% 55%

Page 7: New Directions in Data Mining

Problem Statement

• Given – master categorization M: categories C1, C2, …, Cn

• set of documents in each category

– new categorization N: set of categories S1, S2, …, Sn

• set of documents in each category

• Standard Alg: Compute Pr(Ci | d)

• Enhanced Alg: Compute Pr(Ci | d, S)

Page 8: New Directions in Data Mining

Enhanced Naïve Bayes classifier

• Use tuning set to determine w.– Defaults to standard Naïve Bayes if w = 0.

• Only affects classification of borderline documents.

j jj

iii CSC

CSCSC

)) in be topredicted in docs(#|(|

) in be topredicted in docs(#||)|Pr(

w

w

)|Pr(

)|Pr()|Pr(),|Pr(

Sd

CdSCSdC ii

i

Page 9: New Directions in Data Mining

Yahoo & Google

• 5 slices of the hierarchy: Autos, Movies, Outdoors, Photography, Software– Typical match: 69%, 15%, 3%, 3%, 1%, ….

• Merging Yahoo into Google– 30% fewer errors (14.1% absolute difference in accuracy)

• Merging Google into Yahoo– 26% fewer errors (14.3% absolute difference)

• Open Problems: SVM, Decision Tree, ...

Page 10: New Directions in Data Mining

Directions

• Electronic Commerce & Web– Catalog Integration (WWW 2001, with R. Agrawal)

– Searching with Numbers (WWW 2002, with R. Agrawal)

• Security

• Privacy

Page 11: New Directions in Data Mining

Data Extraction is hard

• Synonyms for attribute names and units.– "lb" and "pounds", but no "lbs" or

"pound".

• Attribute names are often missing.– No "Speed", just "MHz Pentium III"

– No "Memory", just "MB SDRAM"

• 850 MHz Intel Pentium III

• 192 MB RAM

• 15 GB Hard Disk

• DVD Recorder: Included;

• Windows Me

• 14.1 inch diplay

• 8.0 pounds

Page 12: New Directions in Data Mining

Searching with Numbers

IBM Thinkpad750 Mhz Pentium 3, 196 Mb RDRAM, ...

Dell Computer 700 MHz Celeron,

256 Mb SDRAM, ...

Catalog Database

800 200

800 200 3 lb

IBM Thinkpad (750 Mhz, 196 Mb)

...Dell (700 Mhz, 256

Mb)

Page 13: New Directions in Data Mining

Why does it work?

• Conjecture: If we get a close match on numbers, it is likely that we have correctly matched attribute names.

• Non-overlapping attributes: – Memory: 64 - 512 Mb, Disk: 10 - 40 Gb

• Correlations:– Memory: 64 - 512 Mb, Disk: 10 - 100 Gb still fine.

• Screen Shot

Page 14: New Directions in Data Mining

Empirical Results

1 2 3 4 5

Query Size

0

20

40

60

80

100

Acc

ura

cy

DRAMLCDProc

TransAutoCredit

GlassHousingWine

Records Fields

DRAM 3,800 10

LCD 1,700 12

Proc 1,100 12

Trans 22,273 24

Auto 205 16

Credit 666 6

Glass 214 10

Housing 506 14

Wine 179 14

Page 15: New Directions in Data Mining

Incorporating Hints

• Use simple data extraction techniques to get hints,

• Names/Units in query matched against Hints.

• Open Problem: Rethink data extraction in this context.

256 MB SDRAM memory

Unit Hint:MB

Attribute Hint:SDRAM, Memory

Page 16: New Directions in Data Mining

Other Problems

• P. Domingos & M. Richardson, Mining the Network Value of Customers, KDD 2001

• Polling over the Web

Page 17: New Directions in Data Mining

Directions

• Electronic Commerce & Web

• Security

• Privacy

Page 18: New Directions in Data Mining

Security

Credit AgenciesCriminal

Records

Demo-graphic

Birth Marriage

Phone

Email

"Frequent Traveler" Rating Model

State

Local

Page 19: New Directions in Data Mining

Some Hard Problems

• Past may be a poor predictor of future– Abrupt changes

• Reliability and quality of data– Wrong training examples

• Simultaneous mining over multiple data types

• Richer patterns

Page 20: New Directions in Data Mining

Directions

• Electronic Commerce & Web

• Security

• Privacy– Motivation

– Randomization Approach

– Crytographic Approach

– Challenges

Page 21: New Directions in Data Mining

Growing Privacy Concerns

• Popular Press:

– Economist: The End of Privacy (May 99)

– Time: The Death of Privacy (Aug 97)

• Govt. directives/commissions:

– European directive on privacy protection (Oct 98)

– Canadian Personal Information Protection Act (Jan 2001)

• Special issue on internet privacy, CACM, Feb 99

• S. Garfinkel, "Database Nation: The Death of Privacy in 21st Century", O' Reilly, Jan 2000

Page 22: New Directions in Data Mining

Privacy Concerns (2)

• Surveys of web users– 17% privacy fundamentalists, 56% pragmatic majority, 27%

marginally concerned (Understanding net users' attitude about online privacy, April 99)

– 82% said having privacy policy would matter (Freebies & Privacy: What net users think, July 99)

Page 23: New Directions in Data Mining

Technical Question

• Fear:– "Join" (record overlay) was the original sin.

– Data mining: new, powerful adversary?

• The primary task in data mining: development of models about aggregated data.

• Can we develop accurate models without access to precise information in individual data records?

Page 24: New Directions in Data Mining

Directions

• Electronic Commerce & Web

• Security

• Privacy– Motivation

– Randomization Approach

• R. Agrawal and R. Srikant, “Privacy Preserving Data Mining”, SIGMOD 2000.

– Crytographic Approach

– Challenges

Page 25: New Directions in Data Mining

Web Demographics

• Volvo S40 website targets people in 20s– Are visitors in their 20s or 40s?

– Which demographic groups like/dislike the website?

Page 26: New Directions in Data Mining

Randomization Approach Overview

50 | 40K | ... 30 | 70K | ... ...

...

Randomizer Randomizer

Reconstructdistribution

of Age

Reconstructdistributionof Salary

Data MiningAlgorithms

Model

65 | 20K | ... 25 | 60K | ... ...

Page 27: New Directions in Data Mining

Reconstruction Problem

• Original values x1, x2, ..., xn

– from probability distribution X (unknown)

• To hide these values, we use y1, y2, ..., yn

– from probability distribution Y

• Given

– x1+y1, x2+y2, ..., xn+yn

– the probability distribution of Y

Estimate the probability distribution of X.

Page 28: New Directions in Data Mining

Intuition (Reconstruct single point)

• Use Bayes' rule for density functions

10 90Age

V

Original distribution for Age

Probabilistic estimate of original value of V

Page 29: New Directions in Data Mining

Intuition (Reconstruct single point)

Original Distribution for Age

Probabilistic estimate of original value of V

10 90Age

V

• Use Bayes' rule for density functions

Page 30: New Directions in Data Mining

Reconstructing the Distribution

• Combine estimates of where point came from for all the points:– Gives estimate of original distribution.

10 90Age

Page 31: New Directions in Data Mining

Reconstruction: Bootstrapping

fX0 := Uniform distribution

j := 0 // Iteration number repeat

fXj+1(a) := (Bayes' rule)

j := j+1 until (stopping criterion met)

• Converges to maximum likelihood estimate.– D. Agrawal & C.C. Aggarwal, PODS 2001.

n

ij

XiiY

jXiiY

afayxf

afayxf

n 1 )())((

)())((1

Page 32: New Directions in Data Mining

Seems to work well!

0

200

400

600

800

1000

1200

20 60

Age

Nu

mb

er o

f P

eop

le

OriginalRandomizedReconstructed

Page 33: New Directions in Data Mining

Recap: Why is privacy preserved?

• Cannot reconstruct individual values accurately.

• Can only reconstruct distributions.

Page 34: New Directions in Data Mining

Classification

• Naïve Bayes– Assumes independence between attributes.

• Decision Tree– Correlations are weakened by randomization, not destroyed.

Page 35: New Directions in Data Mining

Algorithms

• “Global” Algorithm– Reconstruct for each attribute once at the beginning

• “By Class” Algorithm– For each attribute, first split by class, then reconstruct separately

for each class.

• Associate each randomized value with a reconstructed interval & run decision-tree classifier.– See SIGMOD 2000 paper for details.

Page 36: New Directions in Data Mining

Experimental Methodology

• Compare accuracy against– Original: unperturbed data without randomization.

– Randomized: perturbed data but without making any corrections for randomization.

• Test data not randomized.

• Synthetic data benchmark from [AGI+92].

• Training set of 100,000 records, split equally between the two classes.

Page 37: New Directions in Data Mining

Synthetic Data Functions

• F3 ((age < 40) and

(((elevel in [0..1]) and (25K <= salary <= 75K)) or

((elevel in [2..3]) and (50K <= salary <= 100K))) or

((40 <= age < 60) and ...

• F4 (0.67 x (salary+commission) - 0.2 x loan - 10K) > 0

Page 38: New Directions in Data Mining

Quantifying Privacy

• Add a random value between -30 and +30 to age.

• If randomized value is 60– know with 90% confidence that age is between 33 and 87.

• Interval width amount of privacy.– Example: (Interval Width : 54) / (Range of Age: 100) 54%

randomization level @ 90% confidence

Page 39: New Directions in Data Mining

Acceptable loss in accuracy

100% Randomization Level

40

50

60

70

80

90

100

Fn 1 Fn 2 Fn 3 Fn 4 Fn 5

Acc

urac

y OriginalRandomizedByClassGlobal

Page 40: New Directions in Data Mining

Accuracy vs. Randomization Level

Fn 3

40

50

60

70

80

90

100

10 20 40 60 80 100 150 200

Randomization Level

Acc

ura

cy OriginalRandomizedByClass

Page 41: New Directions in Data Mining

Directions

• Electronic Commerce & Web

• Security

• Privacy– Motivation

– Randomization Approach

– Crytographic Approach• Y. Lindell and B. Pinkas, “Privacy Preserving Data Mining”,

Crypto 2000.

– Challenges

Page 42: New Directions in Data Mining

Inter-Enterprise Data Mining

• Problem: Two parties owning confidential databases wish to build a decision-tree classifier on the union of their databases, without revealing any unnecessary information.

• Horizontally partitioned.– Records (users) split across companies.

– Example: Credit card fraud detection model.

• Vertically partitioned.– Attributes split across companies.

– Example: Associations across websites.

Page 43: New Directions in Data Mining

Cryptographic Adversaries

• Malicious adversary: can alter its input, e.g., define input to be the empty database.

• Semi-honest (or passive) adversary: Correctly follows the protocol specification, yet attempts to learn additional information by analyzing the messages.

Page 44: New Directions in Data Mining

Yao's two-party protocol

• Party 1 with input x

• Party 2 with input y

• Wish to compute f(x,y) without revealing x,y.

• Yao, “How to generate and exchange secrets”, FOCS 1986.

Page 45: New Directions in Data Mining

Private Distributed ID3

• Key problem: find attribute with highest information gain.

• We can then split on this attribute and recurse.– Assumption: Numeric values are discretized, with n-way split.

Page 46: New Directions in Data Mining

Information Gain

• Let– T = set of records (dataset),

– T(ci) = set of records in class i,

– T(ci,aj) = set of records in class i with value(A) = aj.

– Entropy(T) =

– Gain(T,A) = Entropy(T) -

• Need to compute– j i |T(aj, ci)| log |T(aj, ci)|

– j |T(aj)| log |T(aj)|.

||

|)(|log

||

|)(|

T

cT

T

cT i

i

i

))aEntropy(T(||

|)(|j j

j

T

aT

Page 47: New Directions in Data Mining

Selecting the Split Attribute

• Given v1 known to party 1 and v2 known to party 2, compute (v1 + v2) log (v1 + v2) and output random shares.– Party 1 gets Answer - – Party 2 gets , where is a random number

• Given random shares for each attribute, use Yao's protocol to compute information gain.

Page 48: New Directions in Data Mining

Summary (Cryptographic Approach)

• Solves different problem (vs. randomization)– Efficient with semi-honest adversary and small number of parties.

– Gives the same solution as the non-privacy-preserving computation (unlike randomization).

– Will not scale to individual user data.

• Can we extend the approach to other data mining problems?– J. Vaidya and C.W. Clifton, “Privacy Preserving Association Rule

Mining in Vertically Partitioned Data”, KDD 2002

Page 49: New Directions in Data Mining

Directions

• Electronic Commerce

• Security

• Privacy– Motivation

– Randomization Approach

– Crytographic Approach

– Challenges

Page 50: New Directions in Data Mining

Challenges

• Application: Privacy-Sensitive Security Profiling

• Privacy Breaches

• Clustering & Associations

Page 51: New Directions in Data Mining

Privacy-sensitive Security Profiling

• Heterogeneous, distributed data.

• New domains: text, graph

Credit AgenciesCriminal

Records

Demo-graphic

Birth Marriage

Phone

Email

"Frequent Traveler" Rating Model

State

Local

Page 52: New Directions in Data Mining

Potential Privacy Breaches

• Distribution is a spike.– Example: Everyone is of age 40.

• Some randomized values are only possible from a given range.– Example: Add U[-50,+50] to age and get 125 True age is 75.

– Not an issue with Gaussian.

Page 53: New Directions in Data Mining

Potential Privacy Breaches (2)

• Most randomized values in a given interval come from a given interval.– Example: 60% of the people whose randomized value is in

[120,130] have their true age in [70,80].

– Implication: Higher levels of randomization will be required.

• Correlations can make previous effect worse.– Example: 80% of the people whose randomized value of age is in

[120,130] and whose randomized value of income is [...] have their true age in [70,80].

• Challenge: How do you limit privacy breaches?

Page 54: New Directions in Data Mining

Clustering

• Classification: ByClass partitioned the data by class & then reconstructed attributes.– Assumption: Attributes independent given class attribute.

• Clustering: Don’t know the class label.– Assumption: Attributes independent.

• Global (latter assumption) does much worse than ByClass.

• Can we reconstruct a set of attributes together?– Amount of data needed increases exponentially with number of

attributes.

Page 55: New Directions in Data Mining

Associations• Very strong correlations Privacy breaches major issue.• Strawman Algorithm: Replace 80% of the items with other randomly

selected items.– 10 million transactions, 3 items/transaction, 1000 items– <a, b, c> has 1% support = 100,000 transactions– <a, b>, <b, c>, <a, c> each have 2% support

• 3% combined support excluding <a, b, c>– Probability of retaining pattern = 0.23 = 0.8%

• 800 occurrences of <a, b, c> retained.– Probability of generating pattern = 0.8 * 0.001 = 0.08%

• 240 occurrences of <a, b, c> generated by replacing one item.– Estimate with 75% confidence that pattern was originally present!

• Ack: Alexandre Evfimievski

Page 56: New Directions in Data Mining

Associations (cont.)

• "Where does a wise man hide a leaf? In the forest. But what does he do if there is no forest?" ... "He grows a forest to hide it in." -- G.K. Chesterton

• A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke, “Privacy Preserving Mining of Association Rules”, KDD 2002.

• S. Rizvi, J. Haritsa, “Privacy-Preserving Association Rule Mining”, VLDB 2002.

Page 57: New Directions in Data Mining

Summary (Privacy)

• Have your cake and mine it too!– Preserve privacy at the individual level, but still build accurate

models.

• Challenges– Privacy Breaches, Security Applications, Clustering &

Associations

• Opportunities– Web Demographics, Inter-Enterprise Data Mining, Security

Applications

Page 58: New Directions in Data Mining

Slides available from ...

www.almaden.ibm.com/cs/people/srikant/talks.html

Page 59: New Directions in Data Mining

Backup

Page 60: New Directions in Data Mining

Classification Example

Age Salary Repeat Visitor?

23 50K Repeat

17 30K Repeat

43 40K Repeat

68 50K Single

32 70K Single

20 20K Repeat

Age < 25

Salary < 50K

Repeat

Repeat

Single

Yes

Yes

No

No

Page 61: New Directions in Data Mining

Work in Statistical Databases

• Provide statistical information without compromising sensitive information about individuals (surveys: AW89, Sho82)

• Techniques– Query Restriction

– Data Perturbation

• Negative Results: cannot give high quality statistics and simultaneously prevent partial disclosure of individual information [AW89]

Page 62: New Directions in Data Mining

Statistical Databases: Techniques

• Query Restriction– restrict the size of query result (e.g. FEL72, DDS79)– control overlap among successive queries (e.g. DJL79)– suppress small data cells (e.g. CO82)

• Output Perturbation– sample result of query (e.g. Den80)– add noise to query result (e.g. Bec80)

• Data Perturbation– replace db with sample (e.g. LST83, LCL85, Rei84)– swap values between records (e.g. Den82)– add noise to values (e.g. TYW84, War65)

Page 63: New Directions in Data Mining

Statistical Databases: Comparison

• We do not assume original data is aggregated into a single database.

• Concept of reconstructing original distribution.– Adding noise to data values problematic without such

reconstruction.


Top Related