1
ACCTG 6910Building Enterprise &
Business Intelligence Systems(e.bis)
ACCTG 6910Building Enterprise &
Business Intelligence Systems(e.bis)
Physical Data Warehouse Design
Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business
Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business
2
It’s all about trading storage for speed!
• Fundamentals• Aggregates (Ch. 16, pp. 356 - 357)• Indexes (Ch. 16, p. 357)
3
Fundamentals: the Storage Hierarchy
CPU
Cache
Memory
Disk
Storage Capacity
Small
Large
Access Speed
Slow
Fast
10-8 second
10-7 second
10-2 second
500-1000 MIPS
512 KB
512 MB
512 GB
4
Fundamentals: the Storage Hierarchy
CPU
Memory Disk
Disk Drive (I/O Channel)
Cache
Bus
How long does it take to query sales by city?How large is the Sales Fact table?How long does it take to access the Sales Fact table?
5
FundamentalsHow large is the fact table?e.g., 1 million records/day, 0.2KB/record 0.2
GB/day
SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES
CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY
PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME
TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER
reference
referenced by
reference
referenced by
reference
referenced by
6
Fundamentals
How long does it take to access all the fact records?
E.g., the small fact table is 1 Terabyte in size!
– 0.01s*1012=325 years LONG!!!!!!!!!!!!!
7
Fundamentals: the Storage Hierarchy
CPU
Memory Disk
Disk Drive (I/O Channel)
Cache
Bus
The logical unit of data transferred between disk and memory is block (e.g., 4k bytes)
8
Fundamentals
How long does it take to access all the fact records?
E.g., the small fact table is 1 Terabyte in size!
– Number of blocks: 2.5 millions– Access time = 0.01s*2500000= < 7
hrs!!!
9
Aggregate
• In data warehouse design, we choose the gain of fact table to be the possible lowest level.
SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES
CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY
PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME
TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER
reference
referenced by
reference
referenced by
reference
referenced by
Grain: orderline
10
Aggregate
• The reasons to choose the lowest level of fact: – (X) Analysts want to query on single
record
– (O) Analysts want to flexibly cut and group records.
11
Aggregate
• However, keeping the most detailed fact records could result in
– huge-size fact table: TeraBytes?! (1 million records/day, 256 Bytes/record
-> 0.2 GB/day)
– slow query
12
Aggregate
• To keep s data warehouse flexible, fact tables need to store facts in their lowest levels of detail.
• To improve query performance, another type of fact table which stores pre-computed summaries of detailed facts helps.
• Reduced to a logical design solution
13
Aggregate
• An aggregate fact table is a fact table that summarizes base-level fact table records
along one or several dimensions.• An aggregate dimension table is a
dimension table that summarizes base-level dimension table records.
• E.g., marketing managers check daily product sales by city --- aggregate by city in customer dimension
14
Aggregate
CUST_CITY# CITY_KEY* CITY* STATE
SALES_BY_CITY# TIME_KEY# PRODUCT_KEY# CITY_KEY* AVERAGE_PRICE_BY_CITY* TOTAL_QUANTITY_BY_CITY* TOTAL_SALES_BY_CITY
PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME
TIME# TIME_KEY* ORDERDATE* DAY_NUMBER_IN_MONTH
ref
ref
ref
ref
ref
refAggregate fact table
Aggregate dimension table
15
Aggregate
SALES_BY_CITY# TIME_KEY# PRODUCT_KEY# CITY_KEY* AVERAGE_PRICE_BY_CITY* TOTAL_QUANTITY_BY_CITY* TOTAL_SALES_BY_CITY
CUST_CITY# CITY_KEY* CITY* STATE
SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES
PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME
CUSTOMER# CUSTOMER_KEY* CID* CNAME* CITY* STATE
TIME# TIME_KEY* ORDERDATE* DAY_NUMBER_IN_MONTH
ref
ref
ref
ref
ref
ref
ref
ref
refref
ref
ref
16
Indexes
SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES
CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY
PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME
TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER
reference
referenced by
reference
referenced by
reference
referenced by
How long does it take to
find out the total purchase
Amt by Tom Jones?
17
Indexes
• Customer table– 1M records, each record 0.200 Kbytes
long– Block is 4K size, block access time is 0.01s– Number of records/block: 4/0.2=20 – Number of blocks: 1M/20=50K
• Sequential search– Time: 25K*0.01s=250s=4min.
18
Indexes
• Binary search– Time: log(50K)*0.01s=16*0.01s=0.16s
• B+ tree index– Create index pn on customer(cname);– If each node (block) in B+ tree has 117 keys, then
• # of access to indexes: log117(1M)=3 (i.e.height of the tree)
• # of access to Customer Dimension: 1• Total time = 4*0.01 = 0.04s
19
...(11 key values, 12 pointers)
...
B+-trees - P=12
Indexes to customer records
……….
Indexes to indexes
20
Indexes
SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES
CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY
PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME
TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER
reference
referenced by
reference
referenced by
reference
referenced by
How long does it take to
find out the total sales of
Desktop computers?
21
Performance Improvement
• Suppose there are only 4 product categories for 1M products
• Create a B+ tree index???– Suppose the size of product category
and block ID is 10 bytes– Size of index = 1M * 10 = 10 M bytes
22
Performance Improvement
• A bitmap index for an attribute A is a collection of bit vectors, one for each possible value of A. The vector for value v has 1 in position i if the ith record has v for attribute A.
23
Bitmaps
Desktop 1 0 1
Notebook 0 1 0
Server 0 0 0
Accessory
0 0 0
Product record 1 record 2 record 3
A bitmap index for an attribute A is a collection of bit vectors, one for each possible value of A. The vector for value v has 1 in position i if the ith record has v for attribute A.
24
Performance Improvement
• Bitmap index is suitable for low cardinality attribute.– Cardinality(A) = # of possible values for A/#of records
• Compared with B+ tree index, bitmap index has the following advantages for low cardinality attributes– Storage space saving (1M*4/8=500K bytes)– Efficient for boolean operations
• CREATE BITMAP INDEX bitpc ON PRODUCT (PCNAME);