synthesizing high-frequency rules from different data sources

39
1 Synthesizing High- Frequency Rules from Different Data Sources Xindong Wu and Shichao Zhang IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGI NEERING, VOL. 15, NO. 2, MARCH/APRIL 2003

Upload: calvin-briggs

Post on 31-Dec-2015

34 views

Category:

Documents


0 download

DESCRIPTION

Synthesizing High-Frequency Rules from Different Data Sources. Xindong Wu and Shichao Zhang IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 2, MARCH/APRIL 2003. Pre-work. Knowledge management. Knowledge discovery Data mining. Data warehouse. Knowledge Management. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Synthesizing High-Frequency Rules from Different Data Sources

1

Synthesizing High-Frequency Rules from

Different Data Sources

Xindong Wu and Shichao Zhang

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 2, MARCH/APRIL 2003

Page 2: Synthesizing High-Frequency Rules from Different Data Sources

2

Pre-work

Knowledge management.

Knowledge discovery

Data mining.

Data warehouse

Page 3: Synthesizing High-Frequency Rules from Different Data Sources

3

Knowledge Management

Building data warehousing by Knowledge management

Page 4: Synthesizing High-Frequency Rules from Different Data Sources

4

Knowledge Discovery and Data Mining

Data mining is a tool of knowledge discovery

Page 5: Synthesizing High-Frequency Rules from Different Data Sources

5

Why data mining

Simon

Commodities

Supermarket

If a supermarket manager, simon, want to arrange these commodities into supermarket, how to do will make more revenues, conveniences….

if one customer buys milk then he is likely to buy bread, so...

Page 6: Synthesizing High-Frequency Rules from Different Data Sources

6

Why data mining

Simon

Before long, if simon want to send some advertisement letters for customers, how to consider the individual differences is an important task.

Mary always buys diapers and milk powders, she may have a baby, so ….

Page 7: Synthesizing High-Frequency Rules from Different Data Sources

7

The role of Data mining

Preprocess data

Useful patterns

Knowledge and strategy

Page 8: Synthesizing High-Frequency Rules from Different Data Sources

8

Mining association rules

Bread

Milk

IF bread is bought then milk is bought

Page 9: Synthesizing High-Frequency Rules from Different Data Sources

9

Mining steps

step1: define minsup and minconfex: minsup=50%

minconf=50%

step2: find large itemsets

step3: generate association rules

Page 10: Synthesizing High-Frequency Rules from Different Data Sources

10

Example

Large itemsets

Page 11: Synthesizing High-Frequency Rules from Different Data Sources

11

Outline

Introduction

Weights of Data Sources

Rule Selection

Synthesizing High-Frequency Rules Algorithm

Relative Synthesizing Model

Experiments

Conclusion

Page 12: Synthesizing High-Frequency Rules from Different Data Sources

12

AB→CA→DB→E

AB→CA→DB→E

Introduction Framework

DB1 DB2

...DBn

RD1 RD2 RDn...

GRB Synthesizing High-Frequency Rules

• Weighting

• Ranking

AB→CA→DB→E

Page 13: Synthesizing High-Frequency Rules from Different Data Sources

13

Weights of Data Sources

Definition Di : data sources

Si : set of association rules from Di

Ri : association rule

3 Steps Step 1 : union of all Si

Step 2 : assigning each Ri a weight

Step 3 : assigning each Di a weight & normalization

Page 14: Synthesizing High-Frequency Rules from Different Data Sources

14

Example

3 Data Sources (minsupp=0.2, minconf=0.3)

S1 AB→C with supp=0.4, conf=0.72 A→D with supp=0.3, conf=0.64 B→E with supp=0.34, conf=0.7

S2 B→C with supp=0.45, conf=0.87 A→D with supp=0.36, conf=0.7 B→E with supp=0.4, conf=0.6

S3 AB→C with supp=0.5, conf=0.82 A→D with supp=0.25, conf=0.62

Page 15: Synthesizing High-Frequency Rules from Different Data Sources

15

Step 1

Union of all Si

S’ = {S1, S2, S3}

R1 : AB→C

S1, S3 2 times

R2 : A→D S1, S2, S3 3 times

R3 : B→E S1, S2 2 times

R4 : B→C S2 1 time

S1

1. AB→ C with supp=0.4, conf=0.722. A→ D with supp=0.3, conf=0.643. B→ E with supp=0.34, conf=0.7

S2

1. B→ C with supp=0.45, conf=0.872. A→ D with supp=0.36, conf=0.73. B→ E with supp=0.4, conf=0.6

S3

1. AB→ C with supp=0.5, conf=0.822. A→ D with supp=0.25, conf=0.62

Page 16: Synthesizing High-Frequency Rules from Different Data Sources

16

Step 2 Assigning each Ri a weight

R1

R2

R3

R4

WR1 = 2 + 3 + 2 + 12

= 0.25

WR2 = 2 + 3 + 2 + 13

= 0.375

WR3 = 2 + 3 + 2 + 12

= 0.25

WR4 = 2 + 3 + 2 + 11

= 0.125

Page 17: Synthesizing High-Frequency Rules from Different Data Sources

17

Step 3

Assigning each Di a weight WD1

2*0.25+3*0.375+2*0.25=2.125

WD2 1*0.125+2*0.25+3*0.375=2

WD3 2*0.25+3*0.375=1.625

Normalization WD1 2.125/(2.125+2+1.625)=0.3695

WD2 2/(2.125+2+1.625)=0.348

WD3 1.625/(2.125+2+1.625)=0.2825

Ri WRi Time Si

R1:AB→C 0.25 2 S1, S3

R2:A→D 0.375 3 S1,S2, S3

R3:B→E 0.25 2 S1, S2

R4:B→C 0.125 1 S2

Page 18: Synthesizing High-Frequency Rules from Different Data Sources

18

Why Rule Selection ?

Goal Extracting High-Frequency Rules

Low-Frequency Rules Noise

Solution If

Num(Ri) / n < n : data sources, Num(Ri) : frequency of Ri

Then Rule Ri be wiped out

Page 19: Synthesizing High-Frequency Rules from Different Data Sources

19

Rule Selection

Example : 10 Data Sources D1~D9 : {R1 : X→Y}

D10 : {R1 : X→Y, R2: X1→Y1, …, R11: X10→Y10 }

Let =0.8Num(R1) / 10 = 10/10 = 1

> keep

Num(R2~11) / 10 = 1/10 = 0.1 < be wiped out

D1~D10 : {R1 : X→Y} WR1 : 10/10=1 WD1~10 : 10*1 / 10*10*1 = 0.1

n Num(R1)

WR1

Page 20: Synthesizing High-Frequency Rules from Different Data Sources

20

Comparison

Without Rules Selection WD1~9 0.099

WD10 0.109

With Rules Selection WD1~10 0.1

From High-Frequency Rules Point of viewWeight Errors

D1~9 |0.1-0.099| 0.001

D10 |0.1-0.109| 0.009

Total Error = 0.01

Page 21: Synthesizing High-Frequency Rules from Different Data Sources

21

Synthesizing High-Frequency Rules Algorithm

5 Steps Step 1 : Rules Selection

Step 2 : Weights of Data Sources Step 2.1 : union of all Si

Step 2.2 : assigning each Ri a weight

Step 2.3 : assigning each Di a weight & normalization

Step 3 : computing supp & conf of each Ri

Step 4 : ranking all rules by support

Step 5 : output the High-Frequency Rules

Page 22: Synthesizing High-Frequency Rules from Different Data Sources

22

An Example

3 Data Sources

=0.4, minsupp=0.2, minconf=0.3

S1

1. AB→ C with supp=0.4, conf=0.722. A→ D with supp=0.3, conf=0.643. B→ E with supp=0.34, conf=0.7

S2

1. B→ C with supp=0.45, conf=0.872. A→ D with supp=0.36, conf=0.73. B→ E with supp=0.4, conf=0.6

S3

1. AB→ C with supp=0.5, conf=0.822. A→ D with supp=0.25, conf=0.62

Page 23: Synthesizing High-Frequency Rules from Different Data Sources

23

Step 1

Rules Selection R1 : AB→C

S1, S3 2 times

Num(R1) / 3 = 0.66 keep

R2 : A→D S1, S2, S3 3 times

Num(R2) / 3 = 1 keep

R3 : B→E S1, S2 2 times

Num(R3) / 3 = 0.66 keep

R4 : B→C S2 1 time

Num(R4) / 3 = 0.33 wiped out

Page 24: Synthesizing High-Frequency Rules from Different Data Sources

24

Step 2 : Weights of Data Sources Weights of Ri

Weight of Di

WD1 2*0.29+3*0.42+2*0.29=2.42

WD2 3*0.42+2*0.29=1.84

WD3 2*0.29+3*0.42=1.84

Normalization WD1 2.42/(2.42+1.84+1.84)=0.3695=0.396

WD2 1.84/(2.42+1.84+1.84)=0.302

WD3 1.84/(2.42+1.84+1.84)=0.302

WR1 = 2 + 3 + 22 = 0.29

Ri WRi Time Si

R1:AB→ C 0.29 2 S1, S3

R2:A→ D 0.42 3 S1,S2, S3

R3:B→ E 0.29 2 S1, S2

WR2 = 2 + 3 + 23 = 0.42

WR2 = 2 + 3 + 22 = 0.29

Page 25: Synthesizing High-Frequency Rules from Different Data Sources

25

Step 3 Computing supp & conf of each Ri

Support ABC

0.396*0.4+0.302*0.5=0.3094 AD

0.396*0.3+0.302*0.36=0.228 BE

0.396*0.34+0.302*0.4=0.255

Confidence ABC

0.396*0.72+0.302*0.82=0.532 AD

0.396*0.64+0.302*0.7=0.465 BE

0.396*0.7+0.302*0.6=0.458

S1

1. AB→ C with supp=0.4, conf=0.722. A→ D with supp=0.3, conf=0.643. B→ E with supp=0.34, conf=0.7

S2

2. A→ D with supp=0.36, conf=0.73. B→ E with supp=0.4, conf=0.6

S3

1. AB→ C with supp=0.5, conf=0.822. A→ D with supp=0.25, conf=0.62

WD1 =0.396WD2 =0.302WD3 =0.302

Page 26: Synthesizing High-Frequency Rules from Different Data Sources

26

Step 4 & Step 5

Ranking all rules by support & output minsupp=0.2, minconf=0.3

ABC, BE, AD

Ranking 1. ABC (0.3094) 2. BE (0.255) 3. AD (0.228)

Output – 3 rules ABC(0.3094, 0.532) BE (0.255, 0.458) AD (0.228, 0.465)

Page 27: Synthesizing High-Frequency Rules from Different Data Sources

27

Internet

Relative Synthesizing Model

Framework

Web books journals

X→Yconf=0.7

X→Yconf=0.72

X→Yconf=0.68

X→Yconf=?

Synthesizing• clustering method• roughly method

Unknown Di

Page 28: Synthesizing High-Frequency Rules from Different Data Sources

28

Synthesizing Methods

Physical Meaning if the confidences irregularly distributed

Maximum synthesizing operator

Minimum synthesizing operator

Average synthesizing operator

if the confidences (X) normal distribution clustering interval [a, b]

satisfy

1. P{ a Xb } (m/n) 2. | b – a | 3. a, b > minconf.

Page 29: Synthesizing High-Frequency Rules from Different Data Sources

29

Clustering Method

5 Steps Step 1 : closeness 1 - | confi – confj |

The distance relation table

Step 2 : closeness degree measure The confidence-confidence matrix

Step 3 : two confidences close enough ? The confidence relationship matrix

Step 4 : classes creating [a, b] interval of the confidence of rule X→Y

Step 5 : interval verifying satisfy the constraints ?    

Page 30: Synthesizing High-Frequency Rules from Different Data Sources

30

An Example

Assume rule X→Y

conf1=0.7, conf2=0.72, conf3=0.68, conf4=0.5

conf5=0.71, conf6=0.69, conf7=0.7, conf8=0.91

3 parameters =0.7

=0.08

=0.69

Page 31: Synthesizing High-Frequency Rules from Different Data Sources

31

Step 1 : Closeness

Example conf1=0.7, conf2=0.72

c1, 2= 1 - | conf1 - conf2 | = 1 - |0.70-0.72|=0.98

Page 32: Synthesizing High-Frequency Rules from Different Data Sources

32

Step 2 : Closeness Degree Measure

Example

Page 33: Synthesizing High-Frequency Rules from Different Data Sources

33

Step 3 : Close Enough ? Example

=6.9

> 6.9

< 6.9

Page 34: Synthesizing High-Frequency Rules from Different Data Sources

34

Step 4 : Classes Creating

Example

Class 1 : conf1~3, conf5~7

1

Class 2 : conf4 Class 3 : conf82

3

Page 35: Synthesizing High-Frequency Rules from Different Data Sources

35

Step 5 : Interval Verifying

Example Class 1

conf1=0.7, conf2=0.72, conf3=0.68, conf5=0.71, conf6=0.69, conf7=0.7

[min, max] = [conf3, conf2] = [0.68, 0.72] constraint 1 P{ 0.68 X 0.72 } (6/8) (0.7) constraint 2 |0.72-0.68| (0.04) < (0.08) constraint 3 0.68, 0.75 > minconf. (0.65)

In the same way Class 2 & Class 3 be wiped out

Result X→Y : conf=[0.68, 0.72]

Support ? In the same way Interval

Page 36: Synthesizing High-Frequency Rules from Different Data Sources

36

Roughly Method

Example R : AB→C

supp1=0.4, conf1=0.72

supp2=0.5, conf2=0.82

Maximum max ( supp (R) )=max (0.4, 0.5)=0.5

max ( conf (R) )=max (0.72, 0.82)=0.82

Minimum & Average min 0.4, 0.72

avg 0.45, 0.77

Page 37: Synthesizing High-Frequency Rules from Different Data Sources

37

Experiments

Time SWNBS (without rules selection)

SWBRS (with rules selection)

SWNBS > SWBRS

Error first 20 frequent itemset

Max=0.000065

Avg=0.00003165

Page 38: Synthesizing High-Frequency Rules from Different Data Sources

38

Conclusion

Synthesizing Model Data Sources known

weighting

Data Sources unknown clustering method

roughly method

Page 39: Synthesizing High-Frequency Rules from Different Data Sources

39

Future works

Sequence pattern

Combine GA and other techniques