statistics enhancements teradata 14

Teradata 14 Statistics Enhancement

1

TERADATA 14 STATISTICS

ENHANCEMENT

By: Smarak Das Employee Id: 391485 Project: Kaiser Permanente Role: Teradata DBA


2

1. INTRODUCTION

The essential task of the optimizer is to produce the optimal execution plan (i.e., with

lowest cost) among many possible plans. The estimations in the Teradata Database are

derived primarily from statistics on tables, columns and indexes. But, collecting statistics

involves time and resources. So, it is important for the users of the Teradata Database to

understand the new statistics enhancements and make best use of them to keep the cost of

statistics collection and maintenance to the minimum.

2. AUDIENCE The targeted audience for this document is experienced Teradata Database administrators

and those with a background in query tuning and physical database design. However, the

content is designed to be readily understood by any reader with a reasonable background

in database technologies. This Document has been prepared by referencing several

Teradata Guides or Orange Books.


3

3. NEW STATISTICS COLLECTION OPTIONS

3.1 COLLECT SUMMARY Statistics

A new option called SUMMARY is introduced to collect only the table-level statistical

information such as row count, average block size, average row size, etc. without the

histogram detail. This option can be used to provide up-to-date summary information to

the optimizer in a quick and efficient way. This runs very quickly with negligible impact

to the system resources.

Note that SUMMARY statistics are different from regular column or index statistics and

doesn’t contain any histogram. When SUMMARY option is specified in a collect

statistics statement, no column or index specification is allowed.

The SUMMARY statistics are automatically collected when statistics are collected or

recollected on any column or index. Therefore, it is not required to explicitly submit a

request to collect summary statistical information when statistics are being collected or

recollected for individual columns or indexes.

The optimizer depends on the summary statistical information to estimate the table row

count and cost to do a full table scan of the table. For a row count estimate, it depended

on the primary index and/or PARTITION statistics in releases prior to Teradata Database

14.0.

SYNTAX:

COLLECT SUMMARY STATISTICS ON Orders;

3.2 SAMPLE STATISTICS

The enhanced SAMPLE option allows the users to customize sample percentage for

different columns. The recommended approach to collect sample statistics is to use the

system determined sample percentage by specifying the SYSTEM SAMPLE option. If

the final statistics are not reasonably accurate with the system-determined sample

percentage, this new capability can be used to customize the sample percentage. The

sampling options are remembered when explicitly specified and are applied during

recollections.


4

The following are the different variations of the sampling options:

SYNTAX:

COLLECT STATISTICS USING SYSTEM SAMPLE COLUMN EmployeeID ON Employee;

COLLECT STATISTICS USING SAMPLE 20 PERCENT COLUMN EmployeeID ON Employee;

3.3 MAX INTERVALS

The maximum number of intervals of a histogram can be customized using the

MAXINTERVALS option. If this option is not specified, the system uses a default

maximum number of intervals which is defined as 250.

A greater number of intervals increases the granularity of the data in the histogram

and helps to get better single-table and join selectivity estimations for non-uniform

data. However, this should be done selectively as needed (for columns involved in a

predicate exhibiting over or under estimations) as it increases the size of the

histogram which can increase the query optimization time.


5

3.4 MAXVALUELENGTH

Histogram records values such as min, max, mode and biased values. The

MAXVALUELENGTH refers to the maximum value length that can be used to build

these values. If the value length is larger than the system-determined or specified

maximum, it gets truncated. Note that in prior releases of the Teradata Database 14.0,

the maximum value length could not exceed 16 bytes.

Increase the maximum value length for columns that require more detailed

information in the histogram to improve single-table predicate selectivity estimations.

However, this should be done selectively as needed as it increases the size of the

histogram which can increase the query optimization time.

For example, if a UserId column has the same characters for the first 25 characters for

all users, the default value length truncates the data to the first 25 characters which

makes the histogram unreliable for estimations. Increasing the value length to the

length which can uniquely identify each UserId greatly improves the selectivity for

the single-table predicates based on this column.

If this option is not specified, the system uses a default maximum value length which

is defined as 25. For single-column statistics of non-LATIN type, the default is used

as 25 Unicode characters which translate to 50 bytes. In all other cases, it is used as

25 bytes.

SYNTAX:

COLLECT STATISTICS COLUMN EmployeeID ON Employee;

COLLECT STATISTICS USING SYSTEM MAXVALUELENGTH COLUMN EmployeeID ON

Employee;

COLLECT STATISTICS USING MAXVALUELENGTH 50 COLUMN EmployeeID ON Employee;

3.5 COLUMN ORDERING FOR MULTICOLUMN STATISTICS

The column ordering specified in the collect statistics statement is honored with the

exception of the statistics collected with INDEX specification. Column ordering is

useful when collecting multicolumn statistics on columns having single-table and join

predicates or group by clause. It is recommended to use columns having single-table

predicates as leading columns. In other cases, the ordering of the columns does not

have any relevance.


6

COLLECT STATISTICS COLUMN (EmployeeID, EmployeeName)

ON EMPLOYEE;

COLLECT STATISTICS COLUMN (EmployeeName, EmployeeID)

ON EMPLOYEE; [Not Allowed as ”Column” Type Stats Collection]

COLLECT STATISTICS INDEX (EmployeeID, EmployeeName)

ON EMPLOYEE;

COLLECT STATISTICS INDEX (EmployeeName, EmployeeID)

ON EMPLOYEE; [Allowed as “INDEX” Type Stats Collection]

3.6 COLLECTING PARTITION STATISTICS

Statistics can be collected on the system-derived column PARTITION as before.

When statistics are collected on PARTITION on row and column partitioned tables,

the system gathers the following two histograms.

(1) A row partitioning histogram with combined partition number and the

corresponding number of active rows.

(2) A column partitioning histogram with the column partition number and the

corresponding compression ratio for each column partition.

The collection of PARTITION statistics is highly optimized and runs quickly with

minimal impact to the system resources. The collection of PARTITION statistics is

highly recommended on all partitioned tables.

PARTITION statistics can also be collected on non-partitioned tables. All the rows

are considered to be in a single partition (with PARTITION number of 0) for non-

partitioned tables. This also results in updating the table-level demographics (as noted

before, collecting statistics on any column automatically updates the table-level

demographics). The only advantage of this operation is to update the table-level

demographics indirectly. In Teradata Database 14.0, to provide up-to-date table row

count information to the optimizer, use SUMMARY statistics collection instead of

PARTITION statistics on non-partitioned tables.


7

4. DROP STATISTICS When the statistics are dropped on the table by explicit column or index specification,

the summary statistical information is not dropped. Only the table-level drop

statement can drop the summary statistical information. This is done to retain the

summary statistical information of the table which is useful to the optimizer even

when no column or index statistics are present

SYNTAX:

DROP STATS ON customer;

5. HELP STATISTICS The HELP STATISTICS statement displays the summary of statistical information. It

is enhanced to display the summary statistical information with column name “*”. A

new CURRENT clause is introduced to display the extrapolated statistical

information.


8

6. SHOW STATISTICS

The SHOW STATISTICS statement displays the data definition statement more recently used to

collect the statistics with the most recent USING options in effect.


9

7. STATISTICS COLLECTION PERFORMANCE IMPROVEMENTS

Collecting statistics is resource intensive in terms of CPU and IO. The statistics collection

process has been enhanced with various optimization techniques to reduce the resource

consumption. The new optimizer logic analyzes all the applicable access paths to read the base

table data during statistics collection and chooses the optimal one based on the cost.

These optimizations are automatic and are cost-based. When statistics are collected the first time,

some of the optimizations may not take place as the optimizer may not have the required

statistical information to do the costing. The optimizer will be more aggressive in considering all

the applicable optimizations during statistics recollections.

(a) ROLLUP AGGREGATIONS

When statistics are requested on both a single-column and a multicolumn on the same set of

columns, the aggregation result used to produce multicolumn statistics is reused and rolled up to

produce the single-column statistics.

For example, if statistics are requested on (Employee_ID, Employee_Name), (Employee_Name)

and (Employee_ID), the optimizer performs an aggregation based on (Employee_ID,

Employee_Name) first and uses it to roll up to individual columns Employee_ID and

Employee_Name.

(b) PRE-AGGREGATIONS

When multiple statistics are requested in a single-statement, the optimizer considers the option of

pre-aggregations followed by a rollup to the specified column combinations. This avoids reading

the base table multiple times and also helps the subsequent aggregations as they read smaller

data set since the initial pre-aggregations collapses the data based on the pre-aggregation

grouping columns.


10

8. STATISTICS CACHE

In pre-Teradata Database 14.0 releases, statistics are cached in the general purpose dictionary

cache. Since it is expected there will be an increase in histogram size with the new options such

as increased number of intervals and value length, a new dedicated cache is implemented for

statistics to keep the query parsing time low.

9. STALE STATISTICS EXTRAPOLATIONS

Statistics extrapolation functionality is designed to reduce the frequency of statistics

recollections. Even though recollecting the statistics at regular intervals is still required to get

optimal plans, this feature is designed to eliminate the need to do expensive statistics

recollections on a daily basis particularly on DATE and TIMESTAMP columns.

Understanding how the optimizer detects stale statistics and does extrapolations helps Teradata

Database users to take full advantage of this feature. This is especially important because

extrapolations are predictions based on what is known to and assumed by the optimizer about the

newly added data. Even though extrapolations provide better estimates for a majority of the

scenarios, they may not help in all scenarios.

The optimizer attempts to extrapolate the following statistical information.

(1) Table row count

(2) Stale Histograms

(a) Number of distinct values

(b) Number of nulls

(c) High mode frequency

(d) Maximum value of the histogram

Note that the extrapolation works for only upward data growth. If the table size (number of rows)

is constant or smaller, the extrapolations may not happen.


11

10. GUIDELINES FOR USE

Use the following guidelines to optimize the collect statistics processing and to simplify the use

of various collect statistics operations.

When multiple statistics need to be collected, it is recommended to group the statistics

into a single collect statistics statement. For first time collections, group all the statistics

that have the same USING options. For recollections, use table-level collect statistics

command without any column or index specification. This enables the optimizer to apply

global optimizations such as pre-aggregation and aggregation rollups to speed to up the

collection process.

Use SHOW STATISTICS statement to export the statistics from one system and import

them in another system in dual active environments. This allows the system resources to

be consumed in only one system.

Do not copy or transfer SUMMARY statistics between tables within a system or across

the systems. It is recommended to recollect them natively.

Do not copy or transfer the PARTITION statistics between tables within a system or

across the systems when their partitioning expressions don’t match or when the database

issues a warning (in some cases the internal PARTITION numbers can differ even with

the same partitioning expression; database identifies these scenarios and issues a warning

when attempted to copy within a system). It is recommended to recollect the

PARTITION statistics natively in these scenarios which runs very quickly.

When a pattern of rows within a column changes completely (an example can be a new

set of product codes in the beginning of the year) or when the data has sudden highs or

lows (not following the normal pattern), the history records from the histogram may not

correctly predict the nature of the newly added data and can potentially cause

unreasonable extrapolations. It is recommended to drop the statistics (which also gets rid

of the history records), and recollect the statistics.

Recollect SUMMARY statistics after the data loads. This command runs quickly with

negligible impact to the system resources and provides up-to-date row count to the

optimizer. The accurate row count is a crucial piece of information for the optimizer to

identify stale statistics and to enable extrapolations.

statistics enhancements teradata 14

Documents