1 acctg 6910 building enterprise & business intelligence systems (e.bis) physical data warehouse...

24
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Physical Data Warehouse Design Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of Business

Post on 21-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

1

ACCTG 6910Building Enterprise &

Business Intelligence Systems(e.bis)

ACCTG 6910Building Enterprise &

Business Intelligence Systems(e.bis)

Physical Data Warehouse Design

Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business

Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business

2

It’s all about trading storage for speed!

• Fundamentals• Aggregates (Ch. 16, pp. 356 - 357)• Indexes (Ch. 16, p. 357)

3

Fundamentals: the Storage Hierarchy

CPU

Cache

Memory

Disk

Storage Capacity

Small

Large

Access Speed

Slow

Fast

10-8 second

10-7 second

10-2 second

500-1000 MIPS

512 KB

512 MB

512 GB

4

Fundamentals: the Storage Hierarchy

CPU

Memory Disk

Disk Drive (I/O Channel)

Cache

Bus

How long does it take to query sales by city?How large is the Sales Fact table?How long does it take to access the Sales Fact table?

5

FundamentalsHow large is the fact table?e.g., 1 million records/day, 0.2KB/record 0.2

GB/day

SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES

CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY

PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME

TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER

reference

referenced by

reference

referenced by

reference

referenced by

6

Fundamentals

How long does it take to access all the fact records?

E.g., the small fact table is 1 Terabyte in size!

– 0.01s*1012=325 years LONG!!!!!!!!!!!!!

7

Fundamentals: the Storage Hierarchy

CPU

Memory Disk

Disk Drive (I/O Channel)

Cache

Bus

The logical unit of data transferred between disk and memory is block (e.g., 4k bytes)

8

Fundamentals

How long does it take to access all the fact records?

E.g., the small fact table is 1 Terabyte in size!

– Number of blocks: 2.5 millions– Access time = 0.01s*2500000= < 7

hrs!!!

9

Aggregate

• In data warehouse design, we choose the gain of fact table to be the possible lowest level.

SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES

CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY

PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME

TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER

reference

referenced by

reference

referenced by

reference

referenced by

Grain: orderline

10

Aggregate

• The reasons to choose the lowest level of fact: – (X) Analysts want to query on single

record

– (O) Analysts want to flexibly cut and group records.

11

Aggregate

• However, keeping the most detailed fact records could result in

– huge-size fact table: TeraBytes?! (1 million records/day, 256 Bytes/record

-> 0.2 GB/day)

– slow query

12

Aggregate

• To keep s data warehouse flexible, fact tables need to store facts in their lowest levels of detail.

• To improve query performance, another type of fact table which stores pre-computed summaries of detailed facts helps.

• Reduced to a logical design solution

13

Aggregate

• An aggregate fact table is a fact table that summarizes base-level fact table records

along one or several dimensions.• An aggregate dimension table is a

dimension table that summarizes base-level dimension table records.

• E.g., marketing managers check daily product sales by city --- aggregate by city in customer dimension

14

Aggregate

CUST_CITY# CITY_KEY* CITY* STATE

SALES_BY_CITY# TIME_KEY# PRODUCT_KEY# CITY_KEY* AVERAGE_PRICE_BY_CITY* TOTAL_QUANTITY_BY_CITY* TOTAL_SALES_BY_CITY

PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME

TIME# TIME_KEY* ORDERDATE* DAY_NUMBER_IN_MONTH

ref

ref

ref

ref

ref

refAggregate fact table

Aggregate dimension table

15

Aggregate

SALES_BY_CITY# TIME_KEY# PRODUCT_KEY# CITY_KEY* AVERAGE_PRICE_BY_CITY* TOTAL_QUANTITY_BY_CITY* TOTAL_SALES_BY_CITY

CUST_CITY# CITY_KEY* CITY* STATE

SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES

PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME

CUSTOMER# CUSTOMER_KEY* CID* CNAME* CITY* STATE

TIME# TIME_KEY* ORDERDATE* DAY_NUMBER_IN_MONTH

ref

ref

ref

ref

ref

ref

ref

ref

refref

ref

ref

16

Indexes

SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES

CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY

PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME

TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER

reference

referenced by

reference

referenced by

reference

referenced by

How long does it take to

find out the total purchase

Amt by Tom Jones?

17

Indexes

• Customer table– 1M records, each record 0.200 Kbytes

long– Block is 4K size, block access time is 0.01s– Number of records/block: 4/0.2=20 – Number of blocks: 1M/20=50K

• Sequential search– Time: 25K*0.01s=250s=4min.

18

Indexes

• Binary search– Time: log(50K)*0.01s=16*0.01s=0.16s

• B+ tree index– Create index pn on customer(cname);– If each node (block) in B+ tree has 117 keys, then

• # of access to indexes: log117(1M)=3 (i.e.height of the tree)

• # of access to Customer Dimension: 1• Total time = 4*0.01 = 0.04s

19

...(11 key values, 12 pointers)

...

B+-trees - P=12

Indexes to customer records

……….

Indexes to indexes

20

Indexes

SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES

CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY

PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME

TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER

reference

referenced by

reference

referenced by

reference

referenced by

How long does it take to

find out the total sales of

Desktop computers?

21

Performance Improvement

• Suppose there are only 4 product categories for 1M products

• Create a B+ tree index???– Suppose the size of product category

and block ID is 10 bytes– Size of index = 1M * 10 = 10 M bytes

22

Performance Improvement

• A bitmap index for an attribute A is a collection of bit vectors, one for each possible value of A. The vector for value v has 1 in position i if the ith record has v for attribute A.

23

Bitmaps

Desktop 1 0 1

Notebook 0 1 0

Server 0 0 0

Accessory

0 0 0

Product record 1 record 2 record 3

A bitmap index for an attribute A is a collection of bit vectors, one for each possible value of A. The vector for value v has 1 in position i if the ith record has v for attribute A.

24

Performance Improvement

• Bitmap index is suitable for low cardinality attribute.– Cardinality(A) = # of possible values for A/#of records

• Compared with B+ tree index, bitmap index has the following advantages for low cardinality attributes– Storage space saving (1M*4/8=500K bytes)– Efficient for boolean operations

• CREATE BITMAP INDEX bitpc ON PRODUCT (PCNAME);