Download - Performance tuning in database, sql, and LookML for optimal data exploration, a case study

Connect. Describe. Explore.

Performance Tuning for Optimal Data Exploration...Segah MeerData & Analytics

1. Signals to Watch for (for existing Lookers)2. Case #1: Very wide reports3. Case #2: Lots of event data4. Case #3: Live data with complex analysis5. When to use PDTs for performance needs

Performance Tuning for Optimal Data Exploration...

● Slow response time○ Queries returning in >1 minute on large*

datasets○ Queries returning in >10-20 seconds on

small* datasets● Bad throughput/resource utilization

○ Concurrent queries are affected○ Excessive CPU usage○ Queries are aborted due to limited disk

space

Performance Tuning for existing LookersSignals to watch for...

“What’s wrong with the Donkey?”

Specs:❏ Database: MySQL❏ Size: >100 million rows❏ Normalization: 3rd Normal Form❏ Indexing: lots of missing Indexes on timestamps❏ Model Stage: Complete

Ask:❏ Query excel-like reports with 30-50 columns❏ Show metrics (by date) associated with different parts of the business in one

Look

Three Cases Case 1: Very Wide Reports

Problem:❏ One query could be generating 1000+ lines of SQL on the

entire dataset (all 100 million rows) - this led to slow queries, stalled concurrent queries, and high CPU usage

Solution:➔ Force date ranges with an always_filter➔ Heavy use of templated_filter➔ Result caching with persist_for


http://www.looker.com/docs/reference/explore-params/always_filter

http://www.looker.com/docs/data-modeling/learning-lookml/templated-filters

http://www.looker.com/docs/reference/view-params/persist_for-for-derived_table


● Force date ranges with an always_filter● Cache every result for 8 hours

Takeaway #1: “most queries do NOT need to run against the entire dataset”

http://www.looker.com/docs/reference/explore-params/always_filter


Caveat #1: “for date ranges to be effective, timestamp columns should be indexed”Caveat #2: “Normalized schemas require more understanding of the final model for correct set-up of indexes”

Check indexes in the SQL Runner:


Heavy use of templated_filter

Takeaway #2: “Joins can be optimized by filtering the joined table for a subset of data”

http://www.looker.com/docs/data-modeling/learning-lookml/templated-filters


More templated filter optimization (in Derived Tables)....

Specs:❏ Database: Redshift (3 HDD Nodes)❏ Size: >5 billion rows❏ Normalization: Heavily Denormalized❏ Indexing: some Sortkeys/Distkeys❏ Growth: >50 million new events added daily❏ Model Stage at start: Partially Complete (suitable for general exploration)

Ask:❏ Query new retention metrics by date❏ Show many metrics in a dashboard (drill paths are not important)

Three Cases Case 2: Lots of Event Data

Problem:❏ These metrics require 1) date ranges with 500+ million events (even for 2-

4 weeks or 30 day intervals), 2) and expensive self-joins❏ Queries were taking minutes or had to be scheduled to pre-run at

midnight and be cached..

Solution:➔ Roll-up Events in PDTs into metrics by Date, Channel, Campaign➔ Match correctly date filters with dates, time filters with timestamps


http://www.looker.com/docs/data-modeling/learning-lookml/derived-tables#persistent-derived-tables


Roll-up Events in PDTs into metrics by Date, Channel, Campaign

Takeaway: “Event data is rarely looked at on a row level. OK to roll-up!”

http://www.looker.com/docs/data-modeling/learning-lookml/derived-tables#persistent-derived-tables


Caveat #1: “Incorrect use of datatype can lead to Redshift not effectively making use of SORTKEYs”

http://www.looker.com/docs/reference/field-reference/dimension-type-reference


Caveat #2: “Sometimes this even requires us to create a date-timestamp PDT”

Specs:❏ Database: Redshift (10 SSD Nodes)❏ Size: >800 million rows❏ Normalization: 3rd Normal Form❏ Indexing: Correct Sortkeys/Distkeys❏ Growth: 0.1 - 10 million new rows depending on the table❏ Model Stage at start: Complete

Ask:❏ Very flexible exploration environment (drill to row-level)❏ Very complex analytical patterns (financial use case)❏ Nearly live data (ETL running every 10 minutes)

Three Cases Case 3: Live Data with Complex Analysis

Problem:❏ Complex queries were not returning results fast enough (or at

all) to reflect changes in data.❏ Data caching overloaded the storage.

Solution:➔ Swap sql_trigger_value for persist_for on intermediate PDTs➔ Wait until completion of Vacuum and Analyze commands

before making the table accessible (2 approaches)


http://www.looker.com/docs/reference/view-params/sql_trigger_value



Swap sql_trigger_value for persist_for on intermediate PDTs to garbage collect scratch tables.

Takeaway #1: “Choosing Compute Nodes over Dense Storage does NOT have to come at a steeper price if done intelligently”

http://www.looker.com/docs/reference/view-params/sql_trigger_value



Takeaway #2: “Even with appropriate Sortkeys and Distkey, queries can be slow if Redshift never finishes Vacuuming and Sorting the data.”

● Union all tables in a view and then only query the Redshift view from Looker


Takeaway #2: “Even with appropriate Sortkeys and Distkey, queries can be slow if Redshift never finishes Vacuuming and Sorting the data.”

● Swap table schemas upon completion of Vacuum and Analyze commands.

Major Takeaways

Do NOT use:● Should not be a first line of defense● If PDTs are considered only on the basis of performance, the logic is best suited

for the ETL process

Do Use:● If no exploration is necessary into a complex and slow-running query other than

into the summary it produces● If the joins in the query are too expensive due to size of the joined tables

When to use PDTs for Performance Improvements

Questions?

Download - Performance tuning in database, sql, and LookML for optimal data exploration, a case study

Top Related