cs 345: topics in data warehousing thursday, october 21, 2004

CS 345:Topics in Data Warehousing

Thursday, October 21, 2004

Review of Tuesday’s Class• Database System Architecture

– Memory management– Secondary storage (disk) – Query planning process

• Joins– Nested Loop Join– Merge Join– Hash Join

• Grouping– Sort vs. Hash

Outline of Today’s Class

• Indexes– B-Tree and Hash Indexes– Clustered vs. Non-Clustered– Covering Indexes

• Using Indexes in Query Plans

• Bitmap Indexes– Index intersection plans– Bitmap compression

Indexes

• Provide efficient access to relevant records– Based on values of particular attribute(s)

• Same idea as index in back of a book• “fact tables 16, 17, 49”

– Information about fact tables on pages 16, 17, and 49– No information about fact tables on other pages– Without an index, we’d have to look through the whole

book page by page

Typical Index Structure

• Indexes organized based on some search key– Column (or set of columns) whose values are used to

access the index– Organization can be sorting or hashing

• Index is built for some relation– One index entry per record in the relation

• Index consists of <Value, RID> pairs– Value = value of the search key for this record– RID = record identifier

• Tells the DBMS where the record is stored• Usually (page number, offset in page)

Sorted Index

• Index entries usually much smaller than records– Record has many attributes besides search key

• Build search tree on top of index entries– Allows particular value to be located quickly

2 4 4 5 7 8

2 5

B-Tree Index

• By far the most common type of index• Sorted index with search tree• Good for point queries and range queries

– Point query: A = 5– Range query: A BETWEEN 5 AND 10

• Search tree nodes are page-sized– Contain <Value, Pointer> pairs– Each Pointer is to a node of the level below

• Trade-off in choosing index page sizes– Larger pages → fewer search tree levels → fewer

page reads– Larger pages → each page read takes longer

Hash Indexes

• Useful for point queries– Slightly better performance than B-Trees– Not useful for range queries

• Less widely supported than B-Trees

Alternate B-Tree Organization

• Many records with same search key causes redundancy– <Stanford,RID1>,<Stanford,RID2>,

<Stanford,RID3>,<Stanford,RID4>

• Can store RID-lists instead– <Stanford, (RID1,RID2,RID3,RID4)>– Each value occurs once in the index– Index entry is <Value,RID-list> instead of

<Value,RID>– Saves space when search key has many repeated

values

Clustered Indexes• An index is clustered (or “clustering”) if records in the

relation are organized based on index search key• Clustered indexes are good because:

– Records satisfying a range query are packed onto a small number of consecutive pages

• In unclustered indexes, by contrast:– Records satisfying a range query are spread across a large

number of random pages– Commingled with other records that do not satisfy the query

• Only one clustered index allowed per relation– A relation can’t be simultaneously sorted by 2 different attributes– (Unless there are multiple copies of the relation)

Clustered vs. Unclustered

2 4 4 5 7 8

4

2

7

4

5

8

2 5

2 4 4 5 7 8

2

5

4

7

4

8

2 5Clustered

Unclustered

SequentialReads

RandomReads

Comparing Access Plans

• Consider query “SELECT * FROM R WHERE A=5”• Three query plans:

– Scan relation R• Sequential read of all pages in R• Regardless of how many tuples have A=5

– Use clustered index on A• Sequential read of relevant pages in R• Num. relevant pages = (# of tuples with A=5) / (# of tuples per page)• Plus overhead of accessing index pages

– Use unclustered index on A• Random read of relevant pages in R• Number of relevant pages = (# of tuples with A=5)

– Less if A is highly correlated with sort order of relation

• Plus overhead of accessing index pages

Comparing Access Plans

• Clustered index is always best– Unless all tuples are being returned (then use scan)– But clustered index may not be available

• Unclustered index beats scan when fraction of tuples returned is small – Depends on these factors:

• % of tuples being returned• Cost ratio of random I/O vs. sequential I/O• # of tuples per page

– Query returns >10% of rows → scan is almost certainly faster

Covering Indexes

• Example using index in a book:– “What does this book say about fact tables?”

• Look up “fact tables” in the index• Turn to each page that is listed• Read that page and see what it says

– “Which of these topics are discussed in this book: fact tables, bridge tables, B-trees?”

• Look up the three topics in the index• See how many of them appear• Don’t need to read any of the actual book

Covering Indexes

• Sometimes an index has all the data you need– Allows index-only query plan– Not necessary to access the actual tuples– Such an index is called a covering index

• SELECT COUNT(*) FROM R WHERE A=5– Use index on A– Count number of <5,RID> entries– No need to look up records referenced by RIDs

• An index is a “thin” copy of a relation– Not all columns from the relation are included– The index is sorted in a particular way

Multi-Column Indexes

• Multi-column indexes are very useful in data warehousing– We say such an index has a composite key

• Example: B-Tree index on (A,B)– Search key is (A,B) combination– Index entries sorted by A value– Entries with same A value are sorted by B value– Called a lexicographic sort

• SELECT SUM(B) FROM R WHERE A=5– Our (A,B) index covers this query!

• Coverage vs. size trade-off– More attributes in search key → index covers more queries– More attributes in search key → index takes up more disk space

Fact and Dimension Indexes• Dimension table index• Narrow version of table with

only frequently-queried attributes

• Always include dimension key!• Improve performance on large

dimension tables

• Fact table index• Narrow version of fact that

omits certain dimensions / measures

• Useful for queries that exclusively reference indexed dimensions / measures

Order of Composite Key

• Index on (A,B) ≠ Index on (B,A)– Can efficiently search based on leading terms– No efficient search for trailing terms

• SELECT SUM(B) FROM R WHERE A=5– Index on (A,B) is sorted by A

• Search for records where A=5• Scan only the relevant portion of the index

– Index on (B,A) is sorted by B• Records with A=5 are scattered throughout index• Need to scan the entire index• Or else do one search for each distinct value of B

– Oracle’s “index skip scans”

– Index on (A,B) is better for this query– Either index is much faster than accessing relation!

Index Summary

• Indexes are useful in two ways:– Indexes allow efficient search on some attributes due

to the way they are organized– Index-only plans use small indexes in place of large

relations

• For OLAP queries, the second use is generally more important– Search via non-covering, non-clustered index leads to

random I/O– Analysis queries typically aggregate lots of tuples– Doing one random I/O per tuple can be costly

Example

• Sales(Date, Store, Product, Promotion, TransactionId, Quantity, DollarAmt)– Index on (Date, Store, Quantity, DollarAmt)– Index on (Date, Promotion, Product, Quantity,

DollarAmt)– Index on (Product, Date, Store, Quantity, DollarAmt)

• Store– Index on (Name, District, StoreKey)

• Product– Index on (Name, Brand, Dept, ProductKey)– Index on (Brand, Dept, ProductKey)

Example Query

SELECT Brand, SUM(DollarAmt)FROM Sales, Product, StoreWHERE Sales.ProductKey = Product.ProductKeyAND Sales.StoreKey = Store.StoreKeyAND Store.Name = 'Crystal Springs Safeway‘GROUP BY Brand

Product:Brand

Store:Name

Sales:DollarAmt

Selecting Indexes

• Sales(Date, Store, Product, Promotion, TransactionId, Quantity, DollarAmt)– Index on (Date, Store, Quantity, DollarAmt)– Index on (Date, Promotion, Product, Quantity,

DollarAmt)– Index on (Product, Date, Store, Quantity, DollarAmt)

• Store– Index on (Name, District, StoreKey)

• Product– Index on (Name, Brand, Dept, ProductKey)– Index on (Brand, Dept, ProductKey)

LacksProduct

LacksStore

WiderThan

Needed

Query Plan

• Search Store(Name, District, StoreKey) index for Name=‘Crystal Springs Safeway’

• Nested Loop Join– Outer = Sales(Product,Date,Store,Quantity,DollarAmt) index – Inner = Qualifying Store index entries– Output preserves sort order of Sales index

• Sort Product(Brand,Dept,ProductKey) index entries by ProductKey

• Merge Join– Result of Nested Loop Join (already sorted by ProductKey)– Product(Brand,Dept,ProductKey)

• Hash resulting tuples on Brand (for GROUP BY)– Compute SUM(DollarAmt) for each Brand

Index Intersection

• Suppose we have table R(A,B,C,D,E)– B-Tree index on A– B-Tree index on B– No multi-column indexes

• SELECT COUNT(*) FROM R WHERE A=5 AND B < 10• Use an index intersection plan

– Search A index for A=5• Index entries have <A,RID>• Think of the index as a 2-column table with schema I1(A,RID)

– Search B index for B<10• Index entries have <B,RID>• Think of the index as a 2-column table with schema I2(B,RID)

– Join qualifying index entries on I1.RID = I2.RID

Index Intersection

• Index intersection works well for conjunction of multiple, moderately selective filters– SELECT SUM(C) FROM R WHERE A=5 AND B<10– 5% of rows have A=5– 5% of rows have B<10– 5% * 5% = 0.25% of rows have A=5 AND B<10– Retrieving rows matching A index alone, or B index

alone, would be slow– Only a few rows match both indexes

• Intersect indexes and retrieve rows that match both– Overhead of joining indexes often small relative to

cost of retrieving matching records from relation

Bitmap Indexes

• Earlier idea: use RID-lists in place of RIDs– Save space when attribute values repeat

• Bitmap indexes take this one step further– Use Bitmap in place of RID-list– Each RID in the entire relation is represented by 1 bit

• 1 = RID is present in RID-list• 0 = RID is absent from RID-list

– Bitmaps are usually compressed• E.g using run-length encoding

Bitmap Index Example

• Bitmap index looks like this:<M,10100011><F,01011100>

ID Name Sex

1 Fred M

2 Jill F

3 Joe M

4 Fran F

5 Ellen F

6 Kate F

7 Matt M

8 Bob M

Why Bitmap Indexes?

• Index intersection plans with bitmap indexes are fast– Just perform bitwise AND!– Index intersection with B-Trees requires a join

• SELECT COUNT(*) FROM R WHERE A=5 AND B < 10– Bitmap index on A– Bitmap index on B– OR together bitmaps for B values that are < 10– AND the result with the bitmap for A=5– Can be computed very quickly

• Assuming not too many distinct B values that are < 10

• Save space for low-cardinality attributes– As compared to a B-Tree or Hash index– Particularly if compression is used

• Most useful for attributes with low or medium cardinality– Not good for something like LastName

Compressing Bitmaps• Consider a bitmap index on an attribute with 20 distinct values• Each row has 1 value for that attribute• 20 different bitmaps

– ith bit is set to 1 in one bitmap– ith is set to 0 in 19 bitmaps

• Bitmaps consist mostly of zeros (95% of bits are zero)– Good opportunity for compression

• Compression via run length encoding– Just record number of zeros between adjacent ones– 00000001000010000000000001100000– Store this as “7,4,12,0,5”

• Compression Pros and Cons– Reduce storage space → reduce number of I/Os required– Need to compress/uncompress → increase CPU work required – Each compression scheme negotiates this trade-off differently– Operate directly on compressed bitmap → improved performance

cs 345: topics in data warehousing thursday, october 21, 2004

Documents