aws june 2016 webinar series - amazon redshift or big data analytics
TRANSCRIPT
![Page 1: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/1.jpg)
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Zach Christopherson, Database Engineer, AWS
June 28, 2016
Amazon Redshift for Big Data Analytics
![Page 2: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/2.jpg)
Agenda
• Service Overview• Migration Considerations • Optimization:
• Schema / Table Design• Data Ingestion• Maintenance and Database Tuning
• Demonstration
![Page 3: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/3.jpg)
Service Overview
![Page 4: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/4.jpg)
Relational data warehouse
Massively parallel; petabyte scale
Fully managed
HDD and SSD platforms
$1,000/TB/year; starts at $0.25/hour
Amazon Redshift
a lot fastera lot simplera lot cheaper
![Page 5: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/5.jpg)
Selected Amazon Redshift customers
![Page 6: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/6.jpg)
Amazon Redshift system architectureLeader node
• SQL endpoint• Stores metadata• Coordinates query execution
Compute nodes• Local, columnar storage• Execute queries in parallel• Load, backup, restore via
Amazon S3; load from Amazon DynamoDB, Amazon EMR, or SSH
Two hardware platforms• Optimized for data processing• DS2: HDD; scale from 2TB to 2PB• DC1: SSD; scale from 160GB to 326TB
10 GigE(HPC)
IngestionBackupRestore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
S3 / EMR / DynamoDB / SSH
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute Node
128GB RAM
16TB disk
16 coresCompute Node
128GB RAM
16TB disk
16 coresCompute Node
LeaderNode
![Page 7: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/7.jpg)
A deeper look at compute node architecture
Each node contains multiple slices• DS2 – 2 slices on XL, 16 on 8XL• DC1 – 2 slices on L, 32 on 8XL
Each slice is allocated CPU, and table data
Each slice processes a piece of the workload in parallel
Leader Node
![Page 8: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/8.jpg)
Amazon Redshift dramatically reduces I/OColumn storage
Data compression
Zone maps
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
• Calculating SUM(Amount) with row storage:
– Need to read everything– Unnecessary I/O
ID Age State Amount
![Page 9: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/9.jpg)
Amazon Redshift dramatically reduces I/OColumn storage
Data compression
Zone maps
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
• Calculating SUM(Amount) with column storage:
– Only scan the necessary blocks
ID Age State Amount
![Page 10: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/10.jpg)
Amazon Redshift dramatically reduces I/OColumn storage
Data compression
Zone maps
• Columnar compression
– Effective due to like data – Reduces storage requirements– Reduces I/O
ID Age State Amount
analyze compression orders; Table | Column | Encoding--------+-------------+---------- orders | id | mostly32 orders | age | mostly32 orders | state | lzo orders | amount | mostly32
![Page 11: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/11.jpg)
Amazon Redshift dramatically reduces I/OColumn storage
Data compression
Zone maps
• In-memory block metadata• Contains per-block MIN and MAX value• Effectively prunes blocks which don’t
contain data for a given query• Minimize unnecessary I/O
ID Age State Amount
![Page 12: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/12.jpg)
Migration Considerations
![Page 13: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/13.jpg)
Forklift = BAD
![Page 14: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/14.jpg)
On Redshift
• Optimal throughput at 8-12 concurrency, 50 possible
• Upfront table design is critical, not able to add indexes• DISTKEY and SORTKEY significantly influence performance• Primary Key, Foreign Key, Unique constraints will help
• CAUTION: These are not enforced• Compression is for speed as well
![Page 15: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/15.jpg)
On Redshift
• Commits are expensive, batch loads
• Blocks are immutable, a small write (~1-10 rows) and a relatively larger write (~100k rows) have a similar cost
• Partition temporal data with time series tables and UNION ALL VIEWs
![Page 16: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/16.jpg)
Optimization: Schema Design
![Page 17: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/17.jpg)
Goals of Distribution
• Distribute data evenly for parallel processing• Minimize data movement
• Co-located Joins• Localized Aggregations
Distribution Key All
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
keyA
keyB
keyC
keyD
Full table data on first slice of every nodeSame key to same location
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
EvenRound robin distribution
![Page 18: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/18.jpg)
Choosing a Distribution Style
KEY• Large FACT tables• Large or rapidly changing
tables used in joins• Localize columns used within
aggregations
ALL• Have slowly changing data• Reasonable size (i.e., few
millions but not 100’s of millions of rows)
• No common distribution key for frequent joins
• Typical use case – joined dimension table without a common distribution key
EVEN• Tables not frequently joined or
aggregated• Large tables without acceptable
candidate keys
![Page 19: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/19.jpg)
Goals of Sorting
• Physically sort data within blocks and throughout table• Enable rrscans to prune blocks by leveraging zone maps
• Optimal SORTKEY is dependent on:• Query patterns• Data profile• Business requirements
![Page 20: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/20.jpg)
COMPOUND• Most Common• Well defined filter criteria• Time-series data
Choosing a SORTKEY
INTERLEAVED• Edge Cases• Large tables (>Billion Rows)• No common filter criteria• Non time-series data
• Primarily as a query predicate (date, identifier, …)• Optionally choose a column frequently used for aggregates• Optionally choose same as distribution key column for most efficient
joins (merge join)
![Page 21: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/21.jpg)
Compressing Data
• COPY automatically analyzes and compresses data when loading into empty tables
• ANALYZE COMPRESSION checks existing tables and proposes optimal compression algorithms for each column
• Changing column encoding requires a table rebuild
![Page 22: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/22.jpg)
Automatic compression is a good thing (mostly)
“The definition of insanity is doing the same thing over and over again, but expecting different results”.
- Albert Einstein
If you have a regular ETL process and you use temp tables or staging tables, turn off automatic compression
• Use analyze compression to determine the right encodings• Bake those encodings into your DML• Use CREATE TABLE … LIKE
![Page 23: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/23.jpg)
Automatic compression is a good thing (mostly)
• From the zone maps we know:• Which block(s) contain the range• Which row offsets to scan
• Highly compressed sort keys: • Many rows per block • Large row offset
Skip compression on just the leading column of the compound sortkey
0000
100K
0000
100K
20K
![Page 24: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/24.jpg)
Optimization: Ingest Performance
![Page 25: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/25.jpg)
Amazon Redshift Loading Data Overview
AWS CloudCorporate Data center
AmazonDynamoDB
Amazon S3
Data Volume
Amazon Elastic MapReduce
Amazon RDS
Amazon Redshift
Amazon Glacier
logs / files
Source DBs
VPN Connection
AWS Direct Connect
S3 Multipart Upload
AWS Import/ Export
EC2 or On-Prem (using
SSH)
![Page 26: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/26.jpg)
Parallelism is a function of load files
DS2.8XL Compute Node
Each slice’s query processors are able to load one file at a time• Streaming Decompression• Parse• Distribute• Write
A single input file means only one slice is ingesting data
Realizing only partial cluster usage as 6.25% of slices are active
0 2 4 6 8 10 12 141 3 5 7 9 11 13 15
Single Input File
![Page 27: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/27.jpg)
Maximize Throughput with Multiple Files
Use at least as many input files as there are slices in cluster
With 16 input files, all slices are working so you maximize throughput
COPY continues to scale linearly as you add additional nodes
16 Input Files
DS2.8XL Compute Node
0 2 4 6 8 10 12 141 3 5 7 9 11 13 15
![Page 28: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/28.jpg)
Optimization: Performance Tuning
![Page 29: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/29.jpg)
Optimizing a database for querying
• Periodically check your table status• Vacuum and Analyze regularly
• SVV_TABLE_INFO• Missing statistics• Table skew• Uncompressed Columns• Unsorted Data
• Check your cluster status• WLM queuing• Commit queuing• Database Locks
![Page 30: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/30.jpg)
Missing Statistics
• Amazon Redshift’s query optimizer relies on up-to-date statistics
• Statistics are only necessary for data which you are accessing
• Updated stats important on:• SORTKEY• DISTKEY• Columns in query predicates
![Page 31: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/31.jpg)
Table Skew• Unbalanced workload• Query completes as fast as the
slowest slice completes• Can cause skew inflight:
• Temp data fills a single node resulting in query failure
Table Maintenance and Status
Unsorted Table• Sortkey is just a guide, but data
needs to actually be sorted• VACUUM or DEEP COPY to
sort• Scans against unsorted tables
continue to benefit from zone maps:
• Load sequential blocks
![Page 32: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/32.jpg)
WLM QueueIdentify short/long-running queries and prioritize them
Define multiple queues to route queries appropriately.
Default concurrency of 5
Leverage wlm_apex_hourly to tune WLM based on peak concurrency requirements
Cluster Status: Commits and WLM
Commit QueueHow long is your commit queue?
• Identify needless transactions
• Group dependent statements within a single transaction
• Offload operational workloads
• STL_COMMIT_STATS
![Page 33: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/33.jpg)
Cluster Status: Database Locks
• Database Locks• Read locks, Write locks, Exclusive locks• Reads block exclusive• Writes block writes and exclusive• Exclusives block everything
• Ungranted locks block subsequent lock requests• Exposed through SVV_TRANSACTIONS
![Page 34: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/34.jpg)
Demonstrationhttps://s3.amazonaws.com/chriz-webinar/webinar.zip
![Page 35: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/35.jpg)
Use SORTKEYs to effectively prune blocks
![Page 36: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/36.jpg)
Use SORTKEYs to effectively prune blocks
![Page 37: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/37.jpg)
Use SORTKEYs to effectively prune blocks
![Page 38: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/38.jpg)
Don’t compress initial SORTKEY column
![Page 39: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/39.jpg)
Use compression encoding to reduce I/O
![Page 40: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/40.jpg)
Choose a DISTKEY which avoids data skew
![Page 41: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/41.jpg)
Ingest: Disable predictable compression analysis
![Page 42: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/42.jpg)
Ingest: Load multiple files to match cluster slices
![Page 43: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/43.jpg)
VACUUM to physically removed deleted rows
![Page 44: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/44.jpg)
VACUUM to keep your tables sorted
![Page 45: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/45.jpg)
Gather statistics to assist the query planner
![Page 46: AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics](https://reader036.vdocuments.us/reader036/viewer/2022070514/587f2ab91a28ab121d8b4e75/html5/thumbnails/46.jpg)
Thank you!
Questions?
https://s3.amazonaws.com/chriz-webinar/webinar.zip