december 01-03, 2009 minneapolis, chicago, milwaukee 0 best practices to improve query performance...
TRANSCRIPT
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee1
Best Practices to Improve Query Performance in a Data Warehouse - 2
Calisto Zuzarte, STSM, IBM, [email protected]
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee2
Data Warehouse Life Cycle
Database design / Application design– The Warehouse Application architects and Database
Administrators work together to design the queries and schema before they put the application in production
Database performance layer implementation– In order to meet SLAs, DBAs usual go through some iterations
augmenting the database with performance layer objects and set up the initial configuration to get good performance
Database tuning operations– During production, with changing requirements and change in
data, there is on-going tuning required to keep operations smooth.
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee3
Motivation
Data warehouse environments characteristics:– Large volumes of data
• Millions/Billions of rows involved in some tables
• Large Joins
• Large Sorts,
• Large Aggregations
• Many tables involved
• Large amount of data rolled-in and rolled-out
– Complex queries • Report Queries
• Ad Hoc Queries
It is important to pay attention to query performance
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee4
Objective
Provide recommendations from a DB2 optimizer perspective to improve query performance through the Data Warehouse life cycle
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee5
Agenda
SESSION 1
Best Practices – Database Design
Best Practices – Application Design
Best Practices – Configuration and Operations
SESSION 2
Best Practices – Performance Layer
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee6
Best Practices – Performance Layer
Indexes
Statistics
Constraints
Materialized Query Tables
Replicated Tables
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee7
We want Index Only Access instead of more expensive ISCAN-FETCH or TSCAN (Table Scan)
We want to avoid SORTs particularly those that spill
We want to promote index-ORing and index-ANDing
We want to promote Star Joins
When you have range join predicates– Better performance with Nested Loop Join
When we want to avoid random I/O with better clustering
Indexes are useful when …
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee8
ISCAN
INDEXONLY
INDEX_ORing INDEX-ANDingLIST PREFETCH
FETCH|
ISCAN
FETCH|
RIDSCN|
SORT|
ISCAN
ISCANFETCH
|RIDSCN
SORT|
ISCAN
SORT|
ISCAN ISCAN
FETCH|
SORT(RID)|
IXAND
REGULAR
Types of Index Access
Minimize List Prefetch with REORG
Encourage Index-ANDing
Encourage Index-ORing
Encourage Index Only Access
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee9
Nested Loop Join Merge Join Hash Join
HSJOIN
outer
NLJOINMGJN
inner outer inner probe build
4030
1919
29
319
319
31219
79
0239
191940
33379
12191919
SORT/ Index SORT / Index
319
319
31219
79
51020151015
555
51020
55
155
1015
51020151015
555
4030
1919
29
3231233
3231233
3323123
Better With Index
Indexes Can Help NLJN and MGJN
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee10
HSJOIN /------------------+-----------------\ TBSCAN DTQ | | TABLE:SUPPLIER NLJOIN /------+------\ TBSCAN FETCH | /----+----\ SORT IXSCAN TABLE: PART | | FETCH P_PK /----+---\ RIDSCN TABLE: PARTSUPP | SORT | IXAND /-----------+-----------\ NLJOIN NLJOIN /------+-----\ /----+---\ TBSCAN IXSCAN BTQ IXSCAN | | | | TABLE: PART INDEX: PS_PK TBSCAN INDEX: PS_SK | TABLE: SUPPLIER Star Join requires fact table Indexes
Indexes Promote Star Joins
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee11
SORT Spills
The details for a SORT will indicate if the SORT spilled
The I/Os indicate that there was spilling associated with the SORT.
Minimize spills by considering indexes and (Also discussed later by balancing SORTHEAP, SHEAPTHRES and BUFFERPOOL) SORT
( 16) 6.14826e+06 1.30119e+06 | 3.65665e+07 TBSCAN ( 17) 2.00653e+06 1.14286e+06 | 3.74999e+07 TABLE: TPCD.ORDERS
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee12
Define Indexes-To get index-only access-To avoid SORT (spills)-To promote Index-OR / Index AND-To promote star joins-To better handle range join predicate-For point Queries (Nested Loop Join)-For better clustering (MDC Indexes)
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee13
Best Practices – Performance Layer
Indexes
Statistics
Constraints
Materialized Query Tables
Replicated Tables
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee14
Estimating the size of intermediate results is critical to getting good query execution plans
Without sufficient information, the optimizer can only guess based on some assumptions
Data skew and statistical correlation between multiple column values introduce uncertainty
Cardinality Estimation
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee15
Cardinality Estimation With and Without Distribution Statistics
Distribution statistics are exploited ONLY when predicates use constantsThey make a significant difference if you have skew
SELECT * FROM T WHERE C1 = 300
Equality Predicate (no distribution Statistics)• CARD = 1000, COLCARD = 100• FF = 1/COLCARD = 1/100 = 0.01 (1% of the rows)• Cardinality Estimate = 0.01 * 1000 = 10 rows
Equality Predicate with distribution • Value 300 has VALCOUNT 200• Cardinality = 200
0
50
100
150
200
250
45 300 50 200 250 10 20 50 30 280 … …
Frequency
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee16
Collect Distribution Statistics
when you have skew and predicates with constants
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee17
Quantile or Histogram Statistics – Date Columns
Table size ~400,000 ROWS
4 buckets each ~100,000 ROWS (number of quantiles = 4)
Assume 4 months per bucket
20080601 20081001 20090201 20090601 20101001
WHERE DateCol BETWEEN 20081115 AND 20081101
~100,000
Cardinality Estimate = ((20081115 – 20081101) / (20090201 – 20081001)) * 100000 = (14 / 9200 ) * 100000 = ~152 ROWS
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee18
Quantile or Histogram Statistics – Date Columns
Table size ~400,000 ROWS
16 buckets each ~25,000 ROWS (number of quantiles = 16)
20080601 20081001 20090201 20090601 20101001
20081101 20081201 20090101
~25,000
WHERE DateCol BETWEEN 20081115 AND 20081101
Cardinality Estimate = ((20081115 – 20081101) / (20081201 - 20081101)) * 25000 = (14 / 100 ) * 25000 = ~3500 ROWS
Could be more accurate if 20081130
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee19
Collect a large number of Quantile Statistics
on DATE columns
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee20
Column Group Statistics
Example: WHERE COUNTRY = ‘Germany’ And CITY = ‘Frankfurt’
No CGS: Selectivity = ½ * 1/3 = 1/6 …Estimate 1 row
With CGS: Selectivity = 1/3 …Estimate 2 rows
Country City Hotel Name
Germany Bremen Hilton
Germany Bremen Best Western
Germany Frankfurt InterCity
Germany Frankfurt Shangri-La
Canada Toronto Four Seasons
Canada Toronto Intercontinental
1
2
3
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee21
Column Group Statistics
The DB2 optimizer can exploit column group statistics for better estimates
SELECT … FROM dw.hotel H
WHERE H.country = ? and H.city = ?
RUNSTATS ON TABLE dw.hotel … ON COLUMNS (… (country, city) … ) …
If you have an index on (country, city) we have FIRST2KEYCARD statistic so no need of CGS
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee22
Collect Column Group Statistics
when you have multiple equality predicates on the
same table on columns that might be correlated
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee23
Join Cardinality Estimation
Join cardinality estimation is typically based on inclusion and uniformity
PRODUCT.PROD_ID = SALES.PROD_ID– PRODUCT Table
• 20000 rows• 10000 distinct PROD_IDs
– SALES• 100000 rows• 5000 distinct PROD_IDs
– Join Cardinality• 20000*100000 / MAX(10000, 5000) = 200000
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee24
CUSTID CNAME
1 ‘ABC’
2 ‘DEF’
3 ‘GHI’
4 ‘IBM’
5 ‘JKL’
6 ‘MNO’
7 ‘PQR’
… …
… …
100 ‘XYZ’
CUSTID # of Rows
4 2000000
10 700000
8 500000
63 300000
72 100000
9 50000
12 20000
… …
… …
5 5
CUST Table Frequency Statistics – SALES Table
SELECT … FROM SALES, CUSTWHERE CUST.CNAME = ‘IBM’ AND CUST.CUSTID = SALES.CUSTID
Cardinality Estimate with Uniformity : Cardinality Estimate of CUST after local predicate = (1/100 * 100) = 1Join Card Estimate= (1 * 10000000) / MAX(1, 100) = 100000Actual Cardinality : 2000000 !!!!!!!!!!!!!!!!!!!!
100 rows100 distinct values
10000000 rows100 distinct values
Problem Scenario - Skew
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee25
What Are Statistical Views
Views with statistics that can be used by the DB2 optimizer
If appropriate for a query, these statistics can be exploited to get better cardinality estimates
The view need not be directly referenced in the query - matched like MQTs
Single table or simple join views works best
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee26
Creating Statistical Views
Create a view like a regular view :CREATE VIEW cust_fact AS(SELECT C.* FROM CUSTOMER C, FACT F
WHERE C.CUST_ID = F.CUST_ID)
Alter the view to make it a statistical viewALTER VIEW cust_fact ENABLE QUERY OPTIMIZATION
Collecting StatisticsRUNSTATS ON TABLE dba.cust_fact WITH DISTRIBUTION
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee27
Define Statistical Views
when you have skew on join columns
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee28
Best Practices – Performance Layer
Indexes
Statistics
Constraints
Materialized Query Tables
Replicated Tables
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee29
Referential Integrity (RI)
Facilitates aggregation push down– Example follows
Eliminates redundant joins in views– RI helps determine that queries that do not require data from a
primary key table need not do that join even if it is in the view
Helps with Materialized Query Table matching– Allows Queries to match MQTs with more dimension table joins
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee30
Exploiting Referential Integrity Constraints
Find the top 20 stores in terms of total sales, and include the store name and city information
SELECT st.store_id, st.name, st.city, sum(F.sales) as sm FROM salesF as F, store as ST WHERE F.store_id = ST.store_id GROUP BY st.store_id, st.name, st.city ORDER BY sm desc FETCH FIRST 20 ROWS ONLY;
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee31
Exploiting Referential Integrity Constraints
f.store_id=st.store_id /* FK=PK */
sum(sales) as sm group by store_id, name, city
store_id, name, city, sm
order by sm descfetch first 20 rows only
store_id, name, city
2,000 rows store_id, sales
1,000,000 rows
JOIN
Sales (F) Store (ST)
GROUPBY
SORT
SELECT
2,000 rows
20 rows
1,000,000 rows
20 rows
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee32
Exploiting Referential Integrity Constraints
f.store_id=st.store_id /* FK=PK */
sum(sales) as sm group by store_id
store_id, name, city, sm
order by sm descfetch first 20 rows only
store_id, name, city
2,000 rows
GROUPBY
Sales (F) Store (ST)
JOIN
SORT
SELECT
2,000 rows
store_id, sales
1,000,000 rows
20 rows
20 rows
2,000 rows
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee33
Exploiting Referential Integrity Constraints
f.store_id=st.store_id /* FK=PK */
sum(sales) as sm group by store_id
store_id, name, city, sm
order by sm descfetch first 20 rows only
store_id, name, city
2,000 rows
GROUPBY
Sales (F) Store (ST)
SORT
JOIN
SELECT
2,000 rows
store_id, sales
1,000,000 rows
20 rows
20 rows
20 rows
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee34
Define Referential Integrity
Whenever possible(Informational if you know for sure)
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee35
Best Practices – Performance Layer
Indexes
Statistics
Constraints
Materialized Query Tables
Replicated Tables
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee36
Consider Materialized Query Tables
Fact
GB
JOIN
Dim2
Dim1
JOIN
Fact
GB
JOIN
Dim2
Dim1
JOIN
Fact
GB
JOIN
Dim2
Dim1
JOIN
Fact
GB
JOIN
Dim2
Dim1
JOIN
Joe’s Query
Sue’s Query
Bob’s QueryFact
GB
JOIN
Dim2
Dim1
JOIN
MQT
Q9 Q9
Bob’s QSue’s
Joe’s Q
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee37
Best Practices - Defining Materialized Query Tables
Compute and use the dimension lattice to assist in MQT selection. Determine cardinality of aggregates that appear to be good candidates by executing COUNT queries against base tables.
Try to achieve at least a 10X reduction in size between fact and the MQT
Build MQTs with a reasonable number of GROUP BY columns (3 to 4 dimension keys) at a time based on query patterns
As far as possible build the MQT from the fact table alone. This is easier to manage although the dimension tables may need to be joined for most queries.
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee38
Best Practices - MQT Matching
Define Referential Integrity to help with matching MQTs that contain more tables than the queries
Define Functional Dependencies for thinner MQTs
Use COUNT_BIG instead of COUNT for DPF MQTs
Define indexes on MQTs
Keep statistics up-to-date
Include as many Foreign Key Columns as the size allows in the MQT
– Allows joins of dimension tables when needed
Define base table columns NOT NULL as far as possible– For example we can match SUM(A + B) with SUM(A) + SUM(B)
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee39
Best Practices – MQT Maintenance
REFRESH IMMEDIATE– Create an index on the GROUP BY columns
– If not aggregating, create the index on the set of columns that form a unique key
– If possible, define the base table columns used in the GROUP BY clause of REFRESH IMMEDIATE MQTs as NOT NULL
– Define RI relationships wherever applicable
– Always keep the base table and MQT statistics up-to-date
REFRESH DEFERRED– If log space is an issue, consider ALTER TABLE … NOT LOGGED
INITIALLY
– An MQT can be temporarily toggled into a regular table by using • ALTER TABLE … DROP MATERIALIZED QUERY
• ALTER TABLE … ADD MATERIALIZED QUERY
– For Full REFRESH, consider LOAD FROM CURSOR.
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee40
Best Practices MQTs
Define Referential Integrity Make FK columns
NOT NULL
Define FunctionalDependencies
Use COUNT_BIG(*) in DPFConsider GROUPING SETS,
ROLLUP or CUBE
10X Reduction ?
Indexes and RUNSTATS
Include FK columns in the MQT
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee41
Best Practices – Performance Layer
Indexes
Statistics
Constraints
Materialized Query Tables
Replicated Tables
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee42
Replicated Tables
Replicate dimension tables (unless collocated with fact table)
Benefit : Avoids data movement, Avoids too many agents
Important : Define suitable indexes
If too large, replicate a subset of frequently used columns and/or rows
BTQ
JOIN
BTQ
JOIN
SALES SALESCUST SALES
BTQ
JOIN
CUSTCOPY
CUSTCOPY
CUSTCOPY
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee43
Largest dimension table- Collocate with the fact table
All other dimension tables-Single partition and replicated
- At least the HOT columns
December 01-03, 2009 •Minneapolis, Chicago, Milwaukee44
Summary
Session 2 : Best practices to improve data warehouse query performance related to the performance layer
– Indexes
– Statistics (Column Group Statistics and Statistical Views)
– Referential Integrity
– Materialized Query Tables
– Replicated Tables
Recap : Session 1 covered Database Design, Application Design, Configuration and Operations
– These include considerations related to Parallelism, Partitioning, Schema, Application queries, Configuration
45
© Copyright IBM Corporation [current year]. All rights reserved.U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY. WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBM’S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY WARRANTIES OR REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS), OR ALTERING THE TERMS AND CONDITIONS OF ANY AGREEMENT OR LICENSE GOVERNING THE USE OF IBM PRODUCTS AND/OR SOFTWARE.
Please update paragraph below for the particular product or family brand trademarks you mention such as WebSphere, DB2, Maximo, Clearcase, Lotus, etc
IBM, the IBM logo, ibm.com, [IBM Brand, if trademarked], and [IBM Product, if trademarked] are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml
If you have mentioned trademarks that are not from IBM, please update and add the following lines:
[Insert any special 3rd party trademark names/attributions here] Other company, product, or service names may be trademarks or service marks of others.
Disclaimer