cs 345: topics in data warehousing thursday, november 11, 2004

CS 345:Topics in Data Warehousing

Thursday, November 11, 2004

Review of Thursday’s Class• Eliminating Bottlenecks

– Multi-channel RAID– Clusters of Processors

• Shared-Memory (SMP)• Shared-Disk• Shared-Nothing

• Horizontal Partitioning• Partitioning and Clustered Indexes• Multi-Dimensional Clustering• Clustered File Organization• Integer Mapped Tables

Outline of Today’s Class

• Dimension Key Mapping Revisited– Comments on Assignment #2

• Updating the Data Warehouse– Incremental maintenance vs. drop & rebuild– Self-maintainable views

• Approximate Query Processing– Sampling-based techniques

• Computing confidence intervals• Online vs. pre-computed samples• Sampling and joins

– Alternative techniques

Notes on Assignment #2

• During 2nd warehouse load, need to determine which player_key to use for fact rows– Some player_ids had multiple player_keys in the dimension

table– Due to preserving history of dimension value changes

• Many people used the rule “the maximum player_key for each player_id is the most current”– Relies on fact that surrogate keys assigned in increasing order

• Worked fine for this assignment, but two things to note– Fixing past errors when preserving history– Non-determinism of order in SQL query results

• Frequent source of “Heisenbugs”

Rewriting History Retroactively • Simplest technique for handling change = update dimension table

– Restates attributes of past facts that reference modified dimension rows– Desirable if the reason for the update is to correct a data entry error– Often undesirable if update represents genuine change in an attribute over time

• Preserve history by creating new dim. rows in response to a value change• But what if we later discover a past error?

– John Smith lives in CA in 2000– John Smith moves to WA in 2001, but the source system is not changed– John Smith moves to OR in 2002, and the source system is changed– In 2003, we learn that John Smith lived in WA in 2001– Would like to restate history– Insert a 3rd dimension row for John Smith– Update all 2001 facts for John Smith to point to the new dimension row

• Also update aggregate tables– But now maximum dimension key for John Smith <> current dimension key

• Two solutions– Add 4th dimension row for John Smith with State = ‘OR’

• Row #2 and Row #4 for John Smith are identical– Maintain mapping table in staging area with current key for each person

Use ORDER BY to guarantee order• SQL uses multiset semantics, not ordered list semantics• Inserting same records in different order → same relation• Query result sets have order, but order is arbitrary• Mandatory to use ORDER BY if ordering is important• Example:

INSERT INTO player (id, key) VALUES (4503, 23);INSERT INTO player (id, key) VALUES (4503, 640);SELECT key FROM player WHERE id = 4503;

• What will be the result?– Two possibilities

• 23640

• 64023

– To guarantee the first result, use:SELECT key FROM player WHERE id = 4503 ORDER BY key;

Warehouse Updates

• Typical data warehousing practice is to batch updates– Data warehouse is read-only most of the time– Periodically warehouse is refreshed with new data

• Benefits of batch updates– Manageability

• Avoid synchronization issues caused by concurrent queries and updates

• Perform data transformations and integration tasks occasionally rather than continually

• Quality assurance becomes easier– Performance

• Indexes, materialized views, etc. can be updated once instead of continuously

• Amortize fixed update costs across many changed records• Potentially substitute sequential access for random access

• Downside of batch updates– Data staleness (warehouse does not offer real-time view of data)

Updating Indexes and Aggregates

• Data warehouses use lots of indexes and materialized views• These must be updated whenever the warehouse is updated• Updating existing indexes can be expensive

– Index becomes fragmented, requires reorganization• Index entries are stored sorted by index search key• Deleting a record leaves gaps in the index• Inserting new record requires moving index entries around to make room

– Assuming that entries are packed tightly• Similar to the problem of maintaining a sorted array• Typically “overflow pages” are used

– Physical ordering of pages on disk may not exactly match sort order

– Applying updates one at a time involves random reads• To avoid random reads, sort updates first and apply them en masse• Similar idea to sort merge join

• Specify “fill factor” when creating index– How much empty space is built in to accommodate future inserts?– High fill factor → More efficient space usage– Low fill factor → Fragmentation is less of a problem

Updating Indexes

<2,RID>, <4,RID>, <4,RID> <5,RID>, <7,RID>, <10,RID><12,RID>, <14,RID>, <15,RID>

<2,RID>, <4,RID>, <4,RID> <5,RID>, <6,RID> <12,RID>, <14,RID>, <15,RID>

<7,RID>, <10,RID>

Index before update

After inserting record with value 6

Scan of indexjumps around

on disk

Maintain vs. Drop/Rebuild• Two techniques for managing indexes during incremental loads

– Incrementally maintain– Drop and rebuild

• Incrementally maintain indexes– Allow the DBMS to update indexes as new records are loaded– Occasionally reorganize indexes to combat fragmentation

• Some systems will do this automatically on-the-fly– Loading is slowed down by cost of index maintenance

• Drop and rebuild– Drop indexes before loading new data– Load the new data– Recreate indexes from scratch after load completes

• Which is best depends on the situation– Percentage of new or changed records

• Incrementally maintain: cost proportional to number of new records• Drop and rebuild: cost proportional to number of total records

– Number of indexes• More indexes → worse locality of access when maintaining indexes

Maintain vs. Drop/Rebuild

• Drop/rebuild allows fill factor of 100%– Since indexes won’t be updated, no need to leave extra space– Use smaller fill factor for incrementally maintained indexes

• Partitioning based on date improves performance of both methods– Especially drop/rebuild– Only need to rebuild most recent partition– (Size of new data) / (Size of existing partition data) is often large

→ partition-level rebuild is a good choice

• Rough rule of thumb– Rebuild if > 20% of rows are new/changed– Maintain if < 20% of rows are new/changed– Actual threshold depends on hardware, number of indexes

Self-Maintainable Views• Incremental maintenance of materialized views is a hard problem• Naïve solution: Always recompute view contents from scratch

– Re-execute query specified in view definition whenever data changes• Better solution: Apply incremental maintenance rules

– Nice when it’s possible!– Example: SELECT SUM(Quantity) FROM Sales– Current value is 45– Insert a new row into Sales with Quantity = 4– New value for materialized view = 49

• A view (or set of views) is self-maintainable if the new contents of the view(s) can be computed from:– The old contents of the view(s)– The changed records

• Not all views are self-maintainable– Example: SELECT MAX(Quantity) FROM Sales– Self-maintainable if database is append-only– Not self-maintainable if deletes are allowed

• What if record with maximum quantity is deleted?

Self-Maintainable Views

• V1 = SELECT Customer.State, Sales.QuantityFROM Sales, CustomerWHERE Sales.Cust_key = Customer.Cust_keyGROUP BY Customer.State

• V1 is not self-maintainable– Suppose a new Sales record is inserted– Can’t determine State from the new Sales record alone

• V2 =SELECT Cust_key, State FROM Customer

• {V1, V2} is self-maintainable– New Customer record → add a row to V2– New Sales record → look up State from V2 and update V1

• Given a view, how to add more views to achieve self-maintenance?– Interesting research problem (goal is to add as little as possible)

Exploratory Data Analysis

• Much data analysis is exploratory in nature– Primary goal is improved understanding– Better understanding of data leads (eventually) to specific,

actionable queries• Direction and scope of exploratory analysis changes as

analysis proceeds– Future queries are inherently unpredictable– They depend on insights gained from the results of present

queries• Unpredictable queries are hard to speed up

– Pre-computed summary structures work best when workload is well-known, static

– Each index, aggregate, etc. targets specific class of queries– Only a limited number of indexes, etc. can be constructed

• Due to disk space and load time considerations

Exploratory Data Analysis• The most important computation takes place in the user’s head

– Make connections– Develop mental models– Hypothesize and experiment

• Interactive vs. batch-mode processing– Interactive processing

• User asks a series of questions in one sitting• Mental focus and concentration is maintained• No “context switches” necessary

– Batch-mode processing• User asks a question and goes off to do something else• User retrieves the answer later, then formulates another question• Cognitively expensive → many context switches

• Interactive processing is much preferable– Places requirements on query processing time– User needs to be able to maintain attention– Limit time spent waiting for query response to intervals that can

productively spent thinking

Approximate Query Processing• Interactive response times are often impractical for large data

warehouses– Data volumes are prohibitive– Merely scanning the fact table may take many minutes!– Queries that miss indexes/aggregates are slow

• One approach: fast, approximate answers– Provide quick “ballpark estimates” based on small synopsis of database– Approximate answer:

• $85,000 +/- $2000 (with 95% confidence)• Return answer in 10 seconds

– Exact answer: • $84,792.45 • Return answer in 1 hour

– Good enough for high-level view of database structure, trends• Combine approximate and exact answers

– Start with approximate answers for high-level “quick-and-dirty” analysis– Use exact answers to refine and confirm hypotheses

AQP Using Random Samples

• Simplest data synopsis = uniform random sample– Evaluate query over random sample of the data– Extrapolate from random sample to estimate overall result– Large body of knowledge from statistics– Example: Pre-election polls

• Unbiased estimator for COUNT / SUM queries– Compute the aggregate over the random sample– Scale the answer by (1/sampling rate)– Expected value of estimate = actual query answer

• Definition of unbiased estimator

– Variance of estimate is also important• Corresponds to confidence intervals reported by pollsters• The lower the variance, the more confident is the estimate

AQP Example

Product Amount

CPU 1

CPU 1

CPU 2

CPU 3

CPU 4

Disk 1

Disk 2

Monitor 1

Product Amount

CPU 1

CPU 2

CPU 3

Disk 2

Sales SalesSample

SELECT SUM(Amount) FROM SalesWHERE Product = 'CPU'

Exact Answer: 1+1+2+3+4 = 11

Approx. Answer: (1+2+3)*2= 12

Variance of Estimates• An estimate can be unbiased but still useless if confidence is too

low• Example: Ask 1 random person “Bush or Kerry?”

– If they say Bush, estimate that 100% of population prefers Bush.– If they say Kerry, estimate that 0% of population prefers Bush.– Estimate is unbiased

• Assume that in reality 51% of population prefers Bush• Expected value = (0.51 * 100%) + (0.49 * 0%) = 51%

– Estimate is not very helpful because variance is too high• Factors that affect variance of estimate

– Bigger sample size → less variance• Inversely proportional to square root of sample size

– Attribute being SUMmed is highly skewed → more variance• Hard to find a needle in a haystack• Scenario 1: 99 people have $0, 1 person has $100• Scenario 2: 50 people have $2, 50 people have $0• Estimate total wealth via sample of 10 people → Scenario 2 is easier

– More restrictive filters → more variance• Records that are filtered out don’t help

Confidence Intervals for Sampling

Guarantees?90% Confidence Interval (±)Method

as (S) (R)3.16 * (S) / sqrt(|S|)Chebychev (est. (R))

always3.16 * (R) / sqrt(|S|)Chebychev (know (R))

always1.22 * (MAX-MIN) / sqrt(|S|)Hoeffding

as |S| 1.65 * (S) / sqrt(|S|)Central Limit Theorem

Confidence intervals for Average: select avg(R.A) from R

(Can replace R.A with any arithmetic expression on the attributes in R)

(R) = standard deviation of the values of R.A; (S) = s.d. for S.A

Slide from Garofalakis & Gibbons, VLDB 2001 tutorial

Online vs. Pre-Computed Sample

• Two basic approaches to sampling– Online sampling

• Generate sample “on-the-fly” when query is asked– Pre-computed samples

• Generate samples of big tables in advance• Store the pre-computed samples separately

• Online sampling can be slow– Generation of sample can be very expensive

• Sometimes even more expensive then scanning entire relation!– Example:

• 200 tuples fit on each disk block• Take a 1% sample of relation• Average page has 2 tuples included in sample• Most pages have ≥ 1 tuples included in sample (with high

probability)• Most pages need to be read when generating the sample

Online Sampling

• Alternatives to tuple-level online sampling– Block-level sampling

• Choose random sample of disk blocks on which a relation is stored• Include all tuples on the selected blocks• Problem: Correlations between queried attributes and sort order

– Customer relation is clustered by Age– Query count of retired customers– Age and RetiredFlag are highly correlated– Retired customers are not spread evenly across blocks– Confidence of estimate decreases– Difficult to calculate accurate confidence intervals

– Store relation in random order• To generate x% random sample, scan x% of rows• Problem: Lose the ability to have a clustered index• Problem: Relation must be rearranged when new records added

AQP with Pre-Computed Samples

WarehouseClient

QueryRewriter

SQLQueries

DataWarehouse

WarehouseClient

WarehouseClient

RewrittenQueries

• Similar architecture to aggregate navigator• Client queries reference base-level fact and dimension tables•Re-written queries reference sample tables

Samples

Sampling and Joins

• Sampling for join queries presents added difficulties• Here’s a poor approach:

– Construct 1% sample of fact– Construct 1% sample of dimension 1– Construct 1% sample of dimension 2– Compute join of the three samples– Expected number of tuples in result = Only 0.0001% of tuples in

actual join!• Here’s a better approach:

– Construct 1% sample of fact– Compute join of fact sample with full dimension tables– Result is a 1% sample of join result

• Fact table is usually much bigger than dimensions– Sampling dimension tables doesn’t speed things up much

Sampling and Joins

• Alternative approach for big dimension tables– Compute 1% sample of fact– Construct semi-join of big dimension with fact sample

• In other words, identify dimension rows that join with at least one row in the fact sample

– Store semi-join result as “sample” of dimension• The rest of the dimension rows are not relevant for queries

that use the fact sample– (# of rows in semi-join result) ≤ (# of rows in fact

sample)• Non-foreign-key joins cause further problems

– Fortunately standard star schema queries on join on primary key-foreign key relationships

Additional AQP Topics

• Non-uniform (stratified) sampling– Can provide more accurate answers for some queries– Example: query with filter State = ‘NV’

• Compare two equal-sized samples– Uniform random sample– Sample of only records from states in Pacific Time Zone

• Second sample has higher sampling rate → more relevant tuples → more accurate answer

– Example: query that groups by State• CA is better represented than WY in uniform sample• Estimate for CA will be more accurate than estimate for WY• Could use stratified sample with equal # of records from each state

– Different sampling rates for each state– Scaling up the answer is more complicated– Multiply sub-total for each state by a different scaling factor

• Stratified sample delivers lowest average error across all groups– Researchers have explored various non-uniform sampling

techniques

Additional AQP Topics

• Approximate answers using other synopses– A sample is one data structure for summarizing a large data set– Many other techniques exist

• Histograms• Compression via wavelet decomposition• “Sketches” based on random projection

– Methodology for AQP with other synopses similar to using pre-computed samples

• Pre-compute a synopsis of the fact table– And optionally synopses of the dimensions

• Translate query on facts and dimensions into query over synopses• Answers query over synopses using database server• Generate estimated answer and confidence interval based on result

Next Week: Data Mining

• High-level course outline– Logical database design– Query processing– Physical database design– Data mining

• Data mining:– “the process of identifying valid, novel, potential useful, and

ultimately understandable patterns in data”– Or, “What can I do with my data warehouse?”

• Remainder of course– Next 2 weeks: Data mining

• Decision trees, association rules, clustering– Last 1 week: Project presentations & course review– Final exam Wednesday December 8

cs 345: topics in data warehousing thursday, november 11, 2004

Documents