awssummit berlin day2 steffeng query your data in …...including csv, json, orc, avro, and parquet...
Post on 20-May-2020
23 Views
Preview:
TRANSCRIPT
S U MM I TBERL IN
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Query your data in S3 with SQL andoptimize for cost and performance
Steffen GrunwaldSolutions Architect, Amazon Web Services
steffeng@amazon.deTwitter: @steffeng
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Data
every 5 years
There is more data than people think.
years
live for
Data platforms need to
scalegrows
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Traditionally, analytics looked like this
Expensive: Large initial capex + $10k $50k/TB/year
GBs-TBs scale [not designed for PB/EBs]
Relational data
90% of data was thrown away because of cost
OLTP ERP CRM LOB
Data Warehouse
Business Intelligence
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Snowball
Snowmobile KinesisData Firehose
KinesisData Streams
S3Redshift
EMRAthena Kinesis
Elasticsearch Service
Data Lakes on AWS
KinesisVideo Streams
AI Services
QuickSight
Exabyte scale
Store and analyze relational and non-relational data
Purpose-built analytic tools
Analytics run on the same data without movement
Cost effective• Store at $0.0245 per GB-month in S3 Standard (Frankfurt)• Query with Athena at $0.05/GB scanned• DW with Redshift for $1,000/TB/year
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Ingest data with Amazon Kinesis Firehose
• Stores stream of records as files in a bucket• "YYYY/MM/DD/HH“ prefix or
(New!) custom pattern based on ingestion time (UTC)• Optionally compress (GZIP, ZIP, Snappy)• Optionally store as columnar format (ORC, Parquet)• Optionally transform records with AWS Lambda
Amazon Kinesis Data Firehose
Amazon Simple Storage Service (S3)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Amazon Athena
Fast. Really Fast.
Interactive performance even for large datasets. Athena automatically
executes queries in parallel, so most results come back
within seconds.
Open. Powerful. Standard
Start Querying Instantly Pay Per Query
Athena is serverless. Just point to your data in
Amazon S3, define the schema, and start querying
using the built-in query editor.
Amazon Athena uses Presto with ANSI SQL support and
works with a variety of standard data formats,
including CSV, JSON, ORC, Avro, and Parquet
With Amazon Athena, you pay only for the queries that you run. You are charged $5 per terabyte scanned by your
queries.
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Familiar technologies under the covers
Used for SQL QueriesIn-memory distributed query engineANSI-SQL compatible with extensions(Eg. SELECT * FROM tableName)
Used for DDL functionalityComplex data typesMultitude of formats Supports data partitioning(Eg. CREATE TABLE, ALTER TABLE,
MSCK REPAIR)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Movable Ink provides real-time personalization of marketing emails based on a wide range of user, device, and contextual data
Movable Ink has been collecting data on user actions since 2011, and this database grows by 75 to 100GB per day. To reduce time-to-insight, optimize costs, and increase flexibility for its analytics, the company recently adopted the serverless Amazon Athena query service.Since the company began using Amazon Athena, it has realized both cost savings and improved performance for analytics related to user actions.
“Using Amazon Athena, we’re able to query seven years’ worth of data—adding up to hundreds of terabytes—get results at least 50 percent faster, and save nearly $15,000 per month, all without keeping a cluster running.”
Matt Chesler, director of DevOps at Movable Ink
SUMM IT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
The example data: NYC Taxi trips
Data is generated by kinesis-taxi-stream-produceravailable at http://a.co/3KyaGLV
java -jar kinesis-taxi-stream-producer.jar-speedup 400 -statisticsFrequency 10000-stream nyctlc-ingestion –noWatermark-region eu-central-1 -adaptTime ingestion
à ~2GB/h of raw data, 11 days, 487 GB total
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT Firehose (orc)
Firehose (raw)
Firehose (gzip)
Firehose (parquet)
Test setup: Ingesting data with different settings
(max. Firehose bufferinghints: 128MB & 900s)
S3Amazon Kinesis Data Streams
Instance
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Example query I
Show some rides on 2nd September 10-11h:SELECT *FROM "128mb"WHERE pickup_datetime
BETWEEN '2018-09-02T10' AND '2018-09-02T11'LIMIT 10
Run time: 3.53 seconds, Data scanned: 4.62GB
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Example query II (gzip)gzip reduces 487GB to 76GB.
Show some rides on 2nd September 10-11h:SELECT *FROM "128mbgz"WHERE pickup_datetime
BETWEEN '2018-09-02T10' AND '2018-09-02T11'LIMIT 10
Run time: 3.53 seconds, Data scanned: 4.62GBRun time: 2.45 seconds, Data scanned: 303.04KB
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Example query III (without LIMIT 10)What was the distribution of passenger load on 2nd September 10-11h?
SELECT passenger_count, count(*) countFROM "128mbgz"WHERE pickup_datetime
BETWEEN '2018-09-02T10' AND '2018-09-02T11'GROUP BY passenger_count
Run time: 50.36 seconds, Data scanned: 76.5GB
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT Photo by Tang Junwen on Unsplash
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Partitions to theRescue
AWS Glue crawlers addpartitions based on fileprefixes/ directories.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Example query IV
What was the distribution of passenger load on 2nd September 10-11h?
SELECT passenger_count, count(*) countFROM "128mbgz"WHERE pickup_datetime
BETWEEN '2018-09-02T10' AND '2018-09-02T11'AND partition_0 || partition_1 || partition_2 ||
partition_3BETWEEN '2018090210' AND '2018090215'
GROUP BY passenger_count
Run time: 50.36 seconds, Data scanned: 76.5GBRun time: 5.59 seconds, Data scanned: 1.77GB
SUMM IT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Option 1: Crawl with AWS Glue
Why? Just schedule the crawler, no need to code!Deals with schema evolution.
Lookupschema
Create tablepartitions
Query data
Crawldata
S3 Bucket
Glue Crawler
Glue Datacatalog
Amazon AthenaLog Data
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Option 2: Use Apache Hive-style file format in S3Move/ copy:
YYYY/MM/DD/HH/file
year=YYYY/month=MM/day=DD/hours=HH/file
Why? Format easy to create on write, easy to move.Reload partitions by: msck repair table.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Option 3: Add partitions with AWS Lambda
Why? Add partitions instantly, just AWS Lambda cost.
Lookupschema
Add tablepartition
Query data
New filetrigger
S3 Bucket
LambdaFunction
Glue Datacatalog
Amazon AthenaLog Data
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Option 4: Prepopulate partitions
Issue statements with Amazon Athena every hour/ day:
ALTER TABLE mytableADD PARTITION(year='2015',month='01',day='01')LOCATION 's3://[...]/2015/01/01/'
Why? Easy for predictable paths.Data is instantly available in queries.
SUMM IT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Last_NameFischerMeierSchulzMalerMüller
Age3425456329
GenderFemFemFemMalMal
First_NameHeidiAnnaNicoleKlausRainer
Flat file sample layout
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Last_NameFischerMeierSchulzMalerMüller
MIN: FischerMAX: Schulz
Age3425456329
MIN: 25MAX: 63
GenderFemFemFemMalMal
MIN: FemMAX: Mal
First_NameHeidiAnnaNicoleKlaus
RainerMIN: Anna
MAX: Rainer
Columnar formats layout (Parquet & ORC)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
SELECT * FROM ... WHERE Age > 30
Last_NameFischerMeierSchulzMalerMüller
MIN: FischerMAX: Schulz
Age3425456329
MIN: 25MAX: 63
GenderFemFemFemMalMal
MIN: FemMAX: Mal
First_NameHeidiAnnaNicoleKlaus
RainerMIN: Anna
MAX: Rainer
Benefit 1: predicate pushdown
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Last_NameFischerMeierSchulzMalerMüller
MIN: FischerMAX: Schulz
Age3425456329
MIN: 25MAX: 63
GenderFemFemFemMalMal
MIN: FemMAX: Mal
First_NameHeidiAnnaNicoleKlaus
RainerMIN: Anna
MAX: Rainer
SELECT First_Name FROM ... WHERE Age > 30
Benefit 2: projection pushdown/ column pruning
SUMM IT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Example query V (parquet)What was the distribution of passenger loadon 2nd September 10-11h?SELECT passenger_count, count(*) countFROM "128mbparquet"WHERE pickup_datetime
BETWEEN '2018-09-02T10' AND '2018-09-02T11'AND partition_0 || partition_1 || partition_2 ||
partition_3BETWEEN '2018090210' AND '2018090215'
GROUP BY passenger_count
Run time: 5.59 seconds, Data scanned: 1.77GBRun time: 3.21 seconds, Data scanned: 300.7MB
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Analyzing an Apache parquet file
• parquet-tools• head – view data in file• meta – get metadata summary• dump -d -n – get detailed metadata down to page
level stats included
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Schema information
1. Download and build https://github.com/apache/parquet-mr/2. run: java -jar parquet-tools.jar meta parquetfile
Row Count Total Byte Size Size in Bytes Value Count Encoding
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
parquet-tools dump: encoding & statisticstotal_amount:- DOUBLE SNAPPY DO:0 FPO:4155231 SZ:329324/338501/1.03 [more]... ST:[min: -76.8, max: 1121.3, num_nulls: 0]dropoff_datetime:- BINARY SNAPPY DO:0 FPO:3315979 SZ:839131/5540639/6.60 [more]... ST:[no stats for this column]
Use (unix epoch) or partition by timestamp for time seriesdata.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Example query VI (ORC)What was the distribution of passenger loadon 2nd September 10-11h?SELECT passenger_count, count(*) countFROM "128mborc"WHERE pickup_datetime
BETWEEN '2018-09-02T10' AND '2018-09-02T11'AND partition_0 || partition_1 || partition_2 ||
partition_3BETWEEN '2018090210' AND '2018090215'
GROUP BY passenger_count
Run time: 3.21 seconds, Data scanned: 300.7MBRun time: 3.61 seconds, Data scanned: 303.38MB
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Analyzing Apache ORC: orcdumpfile
Spin up a single node/ master EMR Cluster and use thehive command:
hive --orcfiledump file://<absolutepath>/file.orc
[…]
Column 7: count: 210141 hasNull: false min: -76.96324157714844 max: 0.0 sum: -1.5329986951126099E7
Column 8: count: 210141 hasNull: false min: 2018-08-30T00:13:48.573Z max: 2018-08-30T00:28:49.564Z sum: 5043384
[…]
SUMM IT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Option 1: Extract, Transform, Load with AWS Glue
Lookupschema
Create tablepartitions
Query data
Read/Write
S3 Bucket
Glue Job
Glue Datacatalog
Amazon AthenaLog Data
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Example Zeppelin/ AWS Glue Notebook
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Option 2: Amazon Athena – Create Table as Select
Lookupschema
Create tablepartitions
Query data
Read/Write
S3 Bucket
AthenaCTAS
Glue Datacatalog
Amazon AthenaLog Data
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Athena CTAS
Repartition one day:
Pay for data scanned
4.86GB / 1TB * 5 USD= 0.0243 USD
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Example query VII (repartitioned)What was the distribution of passenger loadon 2nd September 10-11h?SELECT passenger_count, count(*) countFROM "partitioned_by_hour"WHERE year = 2018
AND month = 9AND day = 2AND hour = 10
GROUP BY passenger_count
Run time: 3.21 seconds, Data scanned: 300.7MBRun time: 2.68 seconds, Data scanned: 1.41 MB
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Example query VIII (aggregated)What was the distribution of passenger loadon 2nd September 10-11h?SELECT passenger_count, trip_countFROM "aggregates_by_hour"WHERE year = 2018
AND month = 9AND day = 2AND hour = 10
Run time: 2.42 seconds, Data scanned: 2.06MBRun time: 1.84 seconds, Data scanned: 1.9 KB
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT Photo by Benjamin Davies on Unsplash
I applied these simple tricks when storing datafor Amazon Athena andyou won‘t believe whathappened next...
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Measure.Then optimize. There‘s no silver bullet.
Photo by Cesar Carlevarino Aragon on Unsplash
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Optimize for cost and performance 1/2
• Use Athena in the region of your buckets.• Compress your data for less storage & query cost.• Use LIMIT in queries for faster results.• Partition your data based on data access patterns.• Use partitions in your queries.• Add partitions by crawling or S3 triggers.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Optimize for cost and performance 2/2
• Columnar formats as ORC & parquet reduce scanneddata: faster, less cost
• Pick format depending on data, access patterns, clients• Inspect/ verify the results• Create aggregates for frequent queries• Try Amazon Athena's Create Table as Select first• Shorten turnaround times for Glue job development:
• Use a provisioned development endpoint• Use small subset of your data (think KB!)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Get started!
Find more detailsand deploy a sample application forAmazon CloudFront logs:
http://a.co/aJFQuMI
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM IT
Thank you!
SUMM IT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Steffen Grunwald
steffeng@amazon.deTwitter: @steffeng
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SUMM ITSUMM IT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
top related