data warehousing overview cs245 notes 11 hector garcia-molina stanford university cs 245 1 notes11

95
Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

Upload: abigayle-pearson

Post on 21-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

Data Warehousing OverviewCS245 Notes 11

Hector Garcia-Molina

Stanford University

CS 2451

Notes11

Page 2: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes112

Outline

What is a data warehouse? Why a warehouse? Models & operations Implementing a warehouse

Page 3: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes113

What is a Warehouse?

Collection of diverse data subject oriented aimed at executive, decision maker often a copy of operational data with value-added data (e.g., summaries, history)

integrated time-varying non-volatile

more

Page 4: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes114

What is a Warehouse?

Collection of tools gathering data cleansing, integrating, ... querying, reporting, analysis data mining monitoring, administering warehouse

Page 5: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes115

Warehouse Architecture

Client Client

Warehouse

Source Source Source

Query & Analysis

Integration

Metadata

Page 6: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes116

Motivating Examples

Forecasting Comparing performance of units Monitoring, detecting fraud Visualization

Page 7: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes117

Alternative to Warehousing

Two Approaches: Query-Driven (Lazy) Warehouse (Eager)

Source Source

?

Page 8: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes118

Query-Driven Approach

Client Client

Wrapper Wrapper Wrapper

Mediator

Source Source Source

Page 9: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes119

Advantages of Warehousing

High query performance Queries not visible outside warehouse Local processing at sources unaffected Can operate when sources unavailable Can query data not stored in a DBMS Extra information at warehouse

Modify, summarize (store aggregates) Add historical information

Page 10: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1110

Advantages of Query-Driven

No need to copy data less storage no need to purchase data

More up-to-date data Query needs can be unknown Only query interface needed at sources May be less draining on sources

Page 11: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1111

Warehouse Models & Operators

Data Models relational cubes

Operators

Page 12: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1112

Star

customer custId name address city53 joe 10 main sfo81 fred 12 main sfo

111 sally 80 willow la

product prodId name pricep1 bolt 10p2 nut 5

store storeId cityc1 nycc2 sfoc3 la

sale oderId date custId prodId storeId qty amto100 1/7/97 53 p1 c1 1 12o102 2/7/97 53 p2 c1 2 11105 3/8/97 111 p1 c3 5 50

Page 13: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1113

Star Schema

saleorderId

datecustIdprodIdstoreId

qtyamt

customercustIdname

addresscity

productprodIdnameprice

storestoreId

city

Page 14: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1114

Terms

Fact table Dimension tables Measures

saleorderId

datecustIdprodIdstoreId

qtyamt

customercustIdname

addresscity

productprodIdnameprice

storestoreId

city

Page 15: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1115

Dimension Hierarchies

store storeId cityId tId mgrs5 sfo t1 joes7 sfo t2 freds9 la t1 nancy

city cityId pop regIdsfo 1M northla 5M south

region regId namenorth cold regionsouth warm region

sType tId size locationt1 small downtownt2 large suburbs

storesType

city region

snowflake schema constellations

Page 16: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1116

Cube

sale prodId storeId amtp1 c1 12p2 c1 11p1 c3 50p2 c2 8

c1 c2 c3p1 12 50p2 11 8

Fact table view: Multi-dimensional cube:

dimensions = 2

Page 17: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1117

3-D Cube

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

dimensions = 3

Multi-dimensional cube:Fact table view:

Page 18: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

Operators

Traditional selection aggregation ...

Analysis clean data find trends ...

CS 245 Notes1118

Relational

Cube

Page 19: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1119

Aggregates

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

• Add up amounts for day 1• In SQL: SELECT sum(amt) FROM SALE WHERE date = 1

81

Page 20: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1120

Aggregates

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

• Add up amounts by day• In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date

ans date sum1 812 48

Page 21: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1121

Another Example

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

• Add up amounts by day, product• In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId

sale prodId date amtp1 1 62p2 1 19p1 2 48

drill-down

rollup

Page 22: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1122

Aggregates

Operators: sum, count, max, min, median, ave

“Having” clause Using dimension hierarchy

average by region (within store) maximum by month (within date)

Page 23: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1123

Cube Aggregation

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3sum 67 12 50

sump1 110p2 19

129

. . .

drill-down

rollup

Example: computing sums

Page 24: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1124

Cube Operators

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3sum 67 12 50

sump1 110p2 19

129

. . .

sale(c1,*,*)

sale(*,*,*)sale(c2,p2,*)

Page 25: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1125

c1 c2 c3 *p1 56 4 50 110p2 11 8 19* 67 12 50 129

Extended Cube

day 2 c1 c2 c3 *p1 44 4 48p2* 44 4 48

c1 c2 c3 *p1 12 50 62p2 11 8 19* 23 8 50 81

day 1

*

sale(*,p2,*)

Page 26: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1126

Aggregation Using Hierarchies

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

region A region Bp1 56 54p2 11 8

customer

region

country

(customer c1 in Region A;customers c2, c3 in Region B)

Page 27: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1127

Data Analysis

Decision Trees Clustering Association Rules

Page 28: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1128

Decision Trees

sale custId car age city newCarc1 taurus 27 sf yesc2 van 35 la yesc3 van 40 sf yesc4 taurus 22 sf yesc5 merc 50 la noc6 taurus 25 la no

Example:• Conducted survey to see what customers were interested in new model car• Want to select customers for advertising campaign

trainingset

Page 29: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1129

One Possibility

sale custId car age city newCarc1 taurus 27 sf yesc2 van 35 la yesc3 van 40 sf yesc4 taurus 22 sf yesc5 merc 50 la noc6 taurus 25 la no

age<30

city=sf car=van

likely likelyunlikely unlikely

YY

Y

NN

N

Page 30: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1130

Another Possibility

sale custId car age city newCarc1 taurus 27 sf yesc2 van 35 la yesc3 van 40 sf yesc4 taurus 22 sf yesc5 merc 50 la noc6 taurus 25 la no

car=taurus

city=sf age<45

likely likelyunlikely unlikely

YY

Y

NN

N

Page 31: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1131

Issues

Decision tree cannot be “too deep” would not have statistically significant amounts of

data for lower decisions

Need to select tree that most reliably predicts outcomes

Page 32: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1132

Clustering

age

income

education

Page 33: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1133

Clustering

age

income

education

Page 34: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1134

Another Example: Text

Each document is a vector e.g., <100110...> contains words 1,4,5,...

Clusters contain “similar” documents Useful for understanding, searching

documents

internationalnews

sports

business

Page 35: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1135

Issues

Given desired number of clusters? Finding “best” clusters Are clusters semantically meaningful?

e.g., “yuppies’’ cluster? Using clusters for disk storage

Page 36: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1136

Association Rule Mining

tran1 cust33 p2, p5, p8tran2 cust45 p5, p8, p11tran3 cust12 p1, p9tran4 cust40 p5, p8, p11tran5 cust12 p2, p9tran6 cust12 p9

transactio

n

id custo

mer

id products

bought

salesrecords:

• Trend: Products p5, p8 often bough together• Trend: Customer 12 likes product p9

market-basketdata

Page 37: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1137

Association Rule

Rule: {p1, p3, p8} Support: number of baskets where these

products appear High-support set: support threshold s Problem: find all high support sets

Page 38: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

Implementation Issues

ETL (Extraction, transformation, loading) Getting data to the warehouse Entity Resolution

What to materialize? Efficient Analysis

Association rule mining ...

CS 245 Notes1138

Page 39: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1139

ETL: Monitoring Techniques

Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Screen scraping Application level monitoring

A

dvan

tage

s &

Dis

adva

ntag

es!!

Page 40: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1140

ETL: Data Cleaning

Migration (e.g., yen dollars) Scrubbing: use domain-specific knowledge (e.g.,

social security numbers) Fusion (e.g., mail list, customer merging)

Auditing: discover rules & relationships(like data mining)

billing DB

service DB

customer1(Joe)

customer2(Joe)

merged_customer(Joe)

Page 41: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

41

More details: Entity Resolution

N: a A: b CC#: c Ph: e

e1

N: a Exp: d Ph: e

e2

Page 42: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

42

Applications

comparison shopping mailing lists classified ads customer files counter-terrorism

N: a A: b CC#: c Ph: e

e1

N: a Exp: d Ph: e

e2

Page 43: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

Why is ER Challenging?

Huge data sets No unique identifiers Lots of uncertainty Many ways to skin the cat

43

Page 44: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

Taxonomy: Pairwise vs Global

Decide if r, s match only by looking at r, s? Or need to consider more (all) records?

44

Nm: Pat SmithAd: 123 Main StPh: (650) 555-1212

Nm: Patrick SmithAd: 132 Main StPh: (650) 555-1212

Nm: Patricia SmithAd: 123 Main StPh: (650) 777-1111

or

Page 45: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

Taxonomy: Pairwise vs Global

Global matching complicates things a lot! e.g., change decision as new records arrive

45

Nm: Pat SmithAd: 123 Main StPh: (650) 555-1212

Nm: Patrick SmithAd: 132 Main StPh: (650) 555-1212

Nm: Patricia SmithAd: 123 Main StPh: (650) 777-1111

or

Page 46: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

Taxonomy: Outcome

Partition of records e.g., comparison shopping

Merged records

46

Nm: Pat SmithAd: 123 Main StPh: (650) 555-1212

Nm: Patricia SmithAd: 123 Main StPh: (650) 555-1212 (650) 777-1111Hair: Black

Nm: Patricia SmithAd: 132 Main StPh: (650) 777-1111Hair: Black

Page 47: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

47

Taxonomy: Outcome Iterate after merging

Nm: TomWk: IBMOc: laywerSal: 500K

Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBM

Nm: ThomasAd: 123 MaimOc: lawyer

Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyer

Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K

Page 48: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

Taxonomy: Record Reuse One record related to multiple entities?

48

Nm: Pat Smith Sr.Ph: (650) 555-1212

Ph: (650) 555-1212Ad: 123 Main St

Nm: Pat Smith Jr.Ph: (650) 555-1212

Nm: Pat Smith Sr.Ph: (650) 555-1212Ad: 123 Main St

Nm: Pat Smith Jr.Ph: (650) 555-1212Ad: 123 Main St

Page 49: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

Taxonomy: Record Reuse

Partitions

49

• Merges

r s t r

s

t

rs

st

Page 50: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

Taxonomy: Record Reuse

Partitions

50

• Merges

r s t r

s

t

rs

st

• Record reuse complex and expensive!

Page 51: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

51

Taxonomy: Multiple Entity Types

person 1 person 2

Organization B

Organization Abrother

member

business

member

Page 52: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

52

Taxonomy: Multiple Entity Types

p1

p2

p5

p7

a1

a2

a3

a5

a4

authors papers

same??

Page 53: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

53

Taxonomy: Exact vs Approximate

products

camerasresolvedcameras

CDs

books

...

resolvedCDs

resolvedbooks

...

ER

ER

ER

Page 54: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

54

Taxonomy: Exact vs Approximate

terrorists terroristssortby age

B Cooper 30match against

ages 25-35

Page 55: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

Implementation Issues

ETL (Extraction, transformation, loading) Getting data to the warehouse Entity Resolution

What to materialize? Efficient Analysis

Association rule mining ...

CS 245 Notes1155

Page 56: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1156

What to Materialize?

Store in warehouse results useful for common queries

Example:day 2

c1 c2 c3p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3p1 67 12 50

c1p1 110p2 19

129

. . .

total sales

materialize

Page 57: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1157

Materialization Factors

Type/frequency of queries Query response time Storage cost Update cost

Page 58: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1158

Cube Aggregates Lattice

city, product, date

city, product city, date product, date

city product date

all

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3p1 67 12 50

129

use greedyalgorithm todecide whatto materialize

Page 59: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1159

Dimension Hierarchies

all

state

city

cities city statec1 CAc2 NY

Page 60: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1160

Dimension Hierarchies

city, product

city, product, date

city, date product, date

city product date

all

state, product, date

state, date

state, product

state

not all arcs shown...

Page 61: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1161

Interesting Hierarchy

all

years

quarters

months

days

weeks

time day week month quarter year1 1 1 1 20002 1 1 1 20003 1 1 1 20004 1 1 1 20005 1 1 1 20006 1 1 1 20007 1 1 1 20008 2 1 1 2000

conceptualdimension table

Page 62: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

Implementation Issues

ETL (Extraction, transformation, loading) Getting data to the warehouse Entity Resolution

What to materialize? Efficient Analysis

Association rule mining ...

CS 245 Notes1162

Page 63: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1163

Finding High-Support Pairs

Baskets(basket, item) SELECT I.item, J.item, COUNT(I.basket)

FROM Baskets I, Baskets JWHERE I.basket = J.basket AND I.item < J.itemGROUP BY I.item, J.itemHAVING COUNT(I.basket) >= s;

Page 64: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1164

Finding High-Support Pairs

Baskets(basket, item) SELECT I.item, J.item, COUNT(I.basket)

FROM Baskets I, Baskets JWHERE I.basket = J.basket AND I.item < J.itemGROUP BY I.item, J.itemHAVING COUNT(I.basket) >= s;

WHY?

Page 65: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1165

Example

basket itemt1 p2t1 p5t1 p8t2 p5t2 p8t2 p11... ...

basket item1 item2t1 p2 p5t1 p2 p8t1 p5 p8t2 p5 p8t2 p5 p11t2 p8 p11... ... ...

Page 66: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1166

Example

basket itemt1 p2t1 p5t1 p8t2 p5t2 p8t2 p11... ...

basket item1 item2t1 p2 p5t1 p2 p8t1 p5 p8t2 p5 p8t2 p5 p11t2 p8 p11... ... ...

check ifcount s

Page 67: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1167

Issues

Performance for size 2 rulesbasket item

t1 p2t1 p5t1 p8t2 p5t2 p8t2 p11... ...

basket item1 item2t1 p2 p5t1 p2 p8t1 p5 p8t2 p5 p8t2 p5 p11t2 p8 p11... ... ...

bigevenbigger!

Performance for size k rules

Page 68: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1168

Association Rules

How do we perform rule mining efficiently?

Page 69: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1169

Association Rules

How do we perform rule mining efficiently? Observation: If set X has support t, then

each X subset must have at least support t

Page 70: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1170

Association Rules

How do we perform rule mining efficiently? Observation: If set X has support t, then

each X subset must have at least support t For 2-sets:

if we need support s for {i, j} then each i, j must appear in at least s

baskets

Page 71: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1171

Algorithm for 2-Sets

(1) Find OK products those appearing in s or more baskets

(2) Find high-support pairs using only OK products

Page 72: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1172

Algorithm for 2-Sets

INSERT INTO okBaskets(basket, item) SELECT basket, item FROM Baskets GROUP BY item HAVING COUNT(basket) >= s;

Page 73: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1173

Algorithm for 2-Sets

INSERT INTO okBaskets(basket, item) SELECT basket, item FROM Baskets GROUP BY item HAVING COUNT(basket) >= s;

Perform mining on okBaskets SELECT I.item, J.item, COUNT(I.basket) FROM okBaskets I, okBaskets J WHERE I.basket = J.basket AND I.item < J.item GROUP BY I.item, J.item HAVING COUNT(I.basket) >= s;

Page 74: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1174

Counting Efficiently

One way:

basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p2 p3t3 p5 p8t3 p2 p8... ... ...

threshold = 3

Page 75: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1175

Counting Efficiently

One way:

basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p2 p3t3 p5 p8t3 p2 p8... ... ...

sort

basket I.item J.itemt3 p2 p3t3 p2 p8t1 p5 p8t2 p5 p8t3 p5 p8t2 p8 p11... ... ...

threshold = 3

Page 76: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1176

Counting Efficiently

One way:

basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p2 p3t3 p5 p8t3 p2 p8... ... ...

sort

basket I.item J.itemt3 p2 p3t3 p2 p8t1 p5 p8t2 p5 p8t3 p5 p8t2 p8 p11... ... ...

count &remove count I.item J.item

3 p5 p85 p12 p18... ... ...

threshold = 3

Page 77: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1177

Counting Efficiently

Another way:

basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p2 p3t3 p5 p8t3 p2 p8... ... ...

threshold = 3

Page 78: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1178

Counting Efficiently

Another way:

basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p2 p3t3 p5 p8t3 p2 p8... ... ...

scan &count

count I.item J.item1 p2 p32 p2 p83 p5 p85 p12 p181 p21 p222 p21 p23... ... ...

keep counterarray in memory

threshold = 3

Page 79: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1179

Counting Efficiently

Another way:

basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p2 p3t3 p5 p8t3 p2 p8... ... ...

remove count I.item J.item3 p5 p85 p12 p18... ... ...

scan &count

count I.item J.item1 p2 p32 p2 p83 p5 p85 p12 p181 p21 p222 p21 p23... ... ...

keep counterarray in memory

threshold = 3

Page 80: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1180

Counting Efficiently

Another way:

basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p2 p3t3 p5 p8t3 p2 p8... ... ...

remove count I.item J.item3 p5 p85 p12 p18... ... ...

scan &count

count I.item J.item1 p2 p32 p2 p83 p5 p85 p12 p181 p21 p222 p21 p23... ... ...

keep counterarray in memory

threshold = 3

Page 81: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1181

Yet Another Way

basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p2 p3t3 p5 p8t3 p2 p8... ... ...

(1)scan &hash &count

count bucket1 A5 B2 C1 D8 E1 F... ...

in-memoryhash table threshold = 3

Page 82: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1182

Yet Another Way

basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p2 p3t3 p5 p8t3 p2 p8... ... ...

(1)scan &hash &count

count bucket1 A5 B2 C1 D8 E1 F... ...

in-memoryhash table threshold = 3

basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p5 p8t5 p12 p18t8 p12 p18... ... ...

(2) scan &remove

Page 83: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1183

Yet Another Way

basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p2 p3t3 p5 p8t3 p2 p8... ... ...

(1)scan &hash &count

count bucket1 A5 B2 C1 D8 E1 F... ...

in-memoryhash table threshold = 3

basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p5 p8t5 p12 p18t8 p12 p18... ... ...

(2) scan &remove

false positive

Page 84: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1184

Yet Another Way

basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p2 p3t3 p5 p8t3 p2 p8... ... ...

(1)scan &hash &count

count bucket1 A5 B2 C1 D8 E1 F... ...

in-memoryhash table threshold = 3

basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p5 p8t5 p12 p18t8 p12 p18... ... ...

(2) scan &remove

count I.item J.item3 p5 p81 p8 p115 p12 p18... ... ...

(3) scan& count

in-memorycounters

false positive

Page 85: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1185

Yet Another Way

basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p2 p3t3 p5 p8t3 p2 p8... ... ...

(1)scan &hash &count

count bucket1 A5 B2 C1 D8 E1 F... ...

in-memoryhash table threshold = 3

basket I.item J.itemt1 p5 p8t2 p5 p8t2 p8 p11t3 p5 p8t5 p12 p18t8 p12 p18... ... ...

(2) scan &remove

count I.item J.item3 p5 p85 p12 p18... ... ...

(4) removecount I.item J.item

3 p5 p81 p8 p115 p12 p18... ... ...

(3) scan& count

in-memorycounters

false positive

Page 86: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1186

Discussion

Hashing scheme: 2 (or 3) scans of data Sorting scheme: requires a sort! Hashing works well if few high-support pairs

and many low-support ones

Page 87: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245 Notes1187

Discussion

Hashing scheme: 2 (or 3) scans of data Sorting scheme: requires a sort! Hashing works well if few high-support pairs

and many low-support ones

item-pairs ranked by frequency

fre

que

ncy

threshold

iceberg queries

Page 88: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

Implementation Issues

ETL (Extraction, transformation, loading) Getting data to the warehouse Entity Resolution

What to materialize? Efficient Analysis

Association rule mining ...

CS 245 Notes1188

Page 89: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

Extra: Data Mining in the InfoLab

CS 245 Notes1189

quarters

Recommendations in CourseRank

user q1 q2 q3 q4u1 a: 5 b: 5 d: 5 u2 a: 1 e: 2 d: 4 f: 3u3 g: 4 h: 2 e: 3 f: 3u4 b: 2 g: 4 h: 4 e: 4u a: 5 g: 4 e: 4

Page 90: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

Extra: Data Mining in the InfoLab

CS 245 Notes1190

quarters

Recommendations in CourseRank

user q1 q2 q3 q4u1 a: 5 b: 5 d: 5 u2 a: 1 e: 2 d: 4 f: 3u3 g: 4 h: 2 e: 3 f: 3u4 b: 2 g: 4 h: 4 e: 4u a: 5 g: 4 e: 4

u3 and u4 are similar to u Recommend h

Page 91: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

Extra: Data Mining in the InfoLab

CS 245 Notes1191

quarters

Recommendations in CourseRank

user q1 q2 q3 q4u1 a: 5 b: 5 d: 5 u2 a: 1 e: 2 d: 4 f: 3u3 g: 4 h: 2 e: 3 f: 3u4 b: 2 g: 4 h: 4 e: 4u a: 5 g: 4 e: 4

Recommend d (and f, h)

Page 92: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

Sequence Mining

Given a set of transcripts, use Pr[x|a]to predict if x is a good recommendationgiven user has taken a.

Two issues...

CS 245 Notes1192

Page 93: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

Pr[x|a] Not Quite Right

CS 245 Notes1193

transcript containing1 -2 a3 x4 a -> x5 x -> a

Pr[x|a] = 2/3

Pr[x|a~x] = 1/2

target user’s transcript:[ ... a .... || unknown ]

recommend x?

Page 94: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

User Has Taken >= 1 Course

User has taken T= {a, b, c} Need Pr[x|T~x] Approximate as Pr[x|a~x b~x c~x ] Expensive to compute, so...

CS 245 Notes1194

Page 95: Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CourseRank User Study

CS 245 Notes1195

good, expected

good, unexpected

per

cent

ag

e o

f ra

ting

s