Download - Performance tuning in database, sql, and LookML for optimal data exploration, a case study
Connect. Describe. Explore.
Performance Tuning for Optimal Data Exploration...Segah MeerData & Analytics
1. Signals to Watch for (for existing Lookers)2. Case #1: Very wide reports3. Case #2: Lots of event data4. Case #3: Live data with complex analysis5. When to use PDTs for performance needs
Performance Tuning for Optimal Data Exploration...
● Slow response time○ Queries returning in >1 minute on large*
datasets○ Queries returning in >10-20 seconds on
small* datasets● Bad throughput/resource utilization
○ Concurrent queries are affected○ Excessive CPU usage○ Queries are aborted due to limited disk
space
Performance Tuning for existing LookersSignals to watch for...
“What’s wrong with the Donkey?”
Specs:❏ Database: MySQL❏ Size: >100 million rows❏ Normalization: 3rd Normal Form❏ Indexing: lots of missing Indexes on timestamps❏ Model Stage: Complete
Ask:❏ Query excel-like reports with 30-50 columns❏ Show metrics (by date) associated with different parts of the business in one
Look
Three Cases Case 1: Very Wide Reports
Problem:❏ One query could be generating 1000+ lines of SQL on the
entire dataset (all 100 million rows) - this led to slow queries, stalled concurrent queries, and high CPU usage
Solution:➔ Force date ranges with an always_filter➔ Heavy use of templated_filter➔ Result caching with persist_for
Three Cases Case 1: Very Wide Reports
Three Cases Case 1: Very Wide Reports
● Force date ranges with an always_filter● Cache every result for 8 hours
Takeaway #1: “most queries do NOT need to run against the entire dataset”
Three Cases Case 1: Very Wide Reports
Caveat #1: “for date ranges to be effective, timestamp columns should be indexed”Caveat #2: “Normalized schemas require more understanding of the final model for correct set-up of indexes”
Check indexes in the SQL Runner:
Three Cases Case 1: Very Wide Reports
Heavy use of templated_filter
Takeaway #2: “Joins can be optimized by filtering the joined table for a subset of data”
Three Cases Case 1: Very Wide Reports
More templated filter optimization (in Derived Tables)....
Specs:❏ Database: Redshift (3 HDD Nodes)❏ Size: >5 billion rows❏ Normalization: Heavily Denormalized❏ Indexing: some Sortkeys/Distkeys❏ Growth: >50 million new events added daily❏ Model Stage at start: Partially Complete (suitable for general exploration)
Ask:❏ Query new retention metrics by date❏ Show many metrics in a dashboard (drill paths are not important)
Three Cases Case 2: Lots of Event Data
Problem:❏ These metrics require 1) date ranges with 500+ million events (even for 2-
4 weeks or 30 day intervals), 2) and expensive self-joins❏ Queries were taking minutes or had to be scheduled to pre-run at
midnight and be cached..
Solution:➔ Roll-up Events in PDTs into metrics by Date, Channel, Campaign➔ Match correctly date filters with dates, time filters with timestamps
Three Cases Case 2: Lots of Event Data
Three Cases Case 2: Lots of Event Data
Roll-up Events in PDTs into metrics by Date, Channel, Campaign
Takeaway: “Event data is rarely looked at on a row level. OK to roll-up!”
Three Cases Case 2: Lots of Event Data
Caveat #1: “Incorrect use of datatype can lead to Redshift not effectively making use of SORTKEYs”
Three Cases Case 2: Lots of Event Data
Caveat #2: “Sometimes this even requires us to create a date-timestamp PDT”
Specs:❏ Database: Redshift (10 SSD Nodes)❏ Size: >800 million rows❏ Normalization: 3rd Normal Form❏ Indexing: Correct Sortkeys/Distkeys❏ Growth: 0.1 - 10 million new rows depending on the table❏ Model Stage at start: Complete
Ask:❏ Very flexible exploration environment (drill to row-level)❏ Very complex analytical patterns (financial use case)❏ Nearly live data (ETL running every 10 minutes)
Three Cases Case 3: Live Data with Complex Analysis
Problem:❏ Complex queries were not returning results fast enough (or at
all) to reflect changes in data.❏ Data caching overloaded the storage.
Solution:➔ Swap sql_trigger_value for persist_for on intermediate PDTs➔ Wait until completion of Vacuum and Analyze commands
before making the table accessible (2 approaches)
Three Cases Case 3: Live Data with Complex Analysis
Three Cases Case 3: Live Data with Complex Analysis
Swap sql_trigger_value for persist_for on intermediate PDTs to garbage collect scratch tables.
Takeaway #1: “Choosing Compute Nodes over Dense Storage does NOT have to come at a steeper price if done intelligently”
Three Cases Case 3: Live Data with Complex Analysis
Takeaway #2: “Even with appropriate Sortkeys and Distkey, queries can be slow if Redshift never finishes Vacuuming and Sorting the data.”
● Union all tables in a view and then only query the Redshift view from Looker
Three Cases Case 3: Live Data with Complex Analysis
Takeaway #2: “Even with appropriate Sortkeys and Distkey, queries can be slow if Redshift never finishes Vacuuming and Sorting the data.”
● Swap table schemas upon completion of Vacuum and Analyze commands.
Major Takeaways
Do NOT use:● Should not be a first line of defense● If PDTs are considered only on the basis of performance, the logic is best suited
for the ETL process
Do Use:● If no exploration is necessary into a complex and slow-running query other than
into the summary it produces● If the joins in the query are too expensive due to size of the joined tables
When to use PDTs for Performance Improvements
Questions?