awssummit berlin day2 steffeng query your data in …...including csv, json, orc, avro, and parquet...

S U MM I TBERL IN

Query your data in S3 with SQL andoptimize for cost and performance

Steffen GrunwaldSolutions Architect, Amazon Web Services

steffeng@amazon.deTwitter: @steffeng

every 5 years

There is more data than people think.

live for

Data platforms need to

scalegrows

Traditionally, analytics looked like this

Expensive: Large initial capex + $10k $50k/TB/year

GBs-TBs scale [not designed for PB/EBs]

Relational data

90% of data was thrown away because of cost

OLTP ERP CRM LOB

Data Warehouse

Business Intelligence

Snowball

Snowmobile KinesisData Firehose

KinesisData Streams

S3Redshift

EMRAthena Kinesis

Elasticsearch Service

Data Lakes on AWS

KinesisVideo Streams

AI Services

QuickSight

Exabyte scale

Store and analyze relational and non-relational data

Purpose-built analytic tools

Analytics run on the same data without movement

Cost effective• Store at $0.0245 per GB-month in S3 Standard (Frankfurt)• Query with Athena at $0.05/GB scanned• DW with Redshift for $1,000/TB/year

Ingest data with Amazon Kinesis Firehose

• Stores stream of records as files in a bucket• "YYYY/MM/DD/HH“ prefix or

(New!) custom pattern based on ingestion time (UTC)• Optionally compress (GZIP, ZIP, Snappy)• Optionally store as columnar format (ORC, Parquet)• Optionally transform records with AWS Lambda

Amazon Kinesis Data Firehose

Amazon Simple Storage Service (S3)

Amazon Athena

Fast. Really Fast.

Interactive performance even for large datasets. Athena automatically

executes queries in parallel, so most results come back

within seconds.

Open. Powerful. Standard

Start Querying Instantly Pay Per Query

Athena is serverless. Just point to your data in

Amazon S3, define the schema, and start querying

using the built-in query editor.

Amazon Athena uses Presto with ANSI SQL support and

works with a variety of standard data formats,

including CSV, JSON, ORC, Avro, and Parquet

With Amazon Athena, you pay only for the queries that you run. You are charged $5 per terabyte scanned by your

queries.

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.

Familiar technologies under the covers

Used for SQL QueriesIn-memory distributed query engineANSI-SQL compatible with extensions(Eg. SELECT * FROM tableName)

Used for DDL functionalityComplex data typesMultitude of formats Supports data partitioning(Eg. CREATE TABLE, ALTER TABLE,

MSCK REPAIR)

Movable Ink provides real-time personalization of marketing emails based on a wide range of user, device, and contextual data

Movable Ink has been collecting data on user actions since 2011, and this database grows by 75 to 100GB per day. To reduce time-to-insight, optimize costs, and increase flexibility for its analytics, the company recently adopted the serverless Amazon Athena query service.Since the company began using Amazon Athena, it has realized both cost savings and improved performance for analytics related to user actions.

“Using Amazon Athena, we’re able to query seven years’ worth of data—adding up to hundreds of terabytes—get results at least 50 percent faster, and save nearly $15,000 per month, all without keeping a cluster running.”

Matt Chesler, director of DevOps at Movable Ink

The example data: NYC Taxi trips

Data is generated by kinesis-taxi-stream-produceravailable at http://a.co/3KyaGLV

java -jar kinesis-taxi-stream-producer.jar-speedup 400 -statisticsFrequency 10000-stream nyctlc-ingestion –noWatermark-region eu-central-1 -adaptTime ingestion

à ~2GB/h of raw data, 11 days, 487 GB total

Firehose (raw)

Firehose (gzip)

Firehose (parquet)

Test setup: Ingesting data with different settings

(max. Firehose bufferinghints: 128MB & 900s)

S3Amazon Kinesis Data Streams

Instance

Example query I

Show some rides on 2nd September 10-11h:SELECT *FROM "128mb"WHERE pickup_datetime

BETWEEN '2018-09-02T10' AND '2018-09-02T11'LIMIT 10

Run time: 3.53 seconds, Data scanned: 4.62GB

Example query II (gzip)gzip reduces 487GB to 76GB.

Show some rides on 2nd September 10-11h:SELECT *FROM "128mbgz"WHERE pickup_datetime

BETWEEN '2018-09-02T10' AND '2018-09-02T11'LIMIT 10

Run time: 3.53 seconds, Data scanned: 4.62GBRun time: 2.45 seconds, Data scanned: 303.04KB

Example query III (without LIMIT 10)What was the distribution of passenger load on 2nd September 10-11h?

SELECT passenger_count, count(*) countFROM "128mbgz"WHERE pickup_datetime

BETWEEN '2018-09-02T10' AND '2018-09-02T11'GROUP BY passenger_count

Run time: 50.36 seconds, Data scanned: 76.5GB

Partitions to theRescue

AWS Glue crawlers addpartitions based on fileprefixes/ directories.

Example query IV

What was the distribution of passenger load on 2nd September 10-11h?

SELECT passenger_count, count(*) countFROM "128mbgz"WHERE pickup_datetime

BETWEEN '2018-09-02T10' AND '2018-09-02T11'AND partition_0 || partition_1 || partition_2 ||

partition_3BETWEEN '2018090210' AND '2018090215'

GROUP BY passenger_count

Run time: 50.36 seconds, Data scanned: 76.5GBRun time: 5.59 seconds, Data scanned: 1.77GB

Option 1: Crawl with AWS Glue

Why? Just schedule the crawler, no need to code!Deals with schema evolution.

Lookupschema

Create tablepartitions

Query data

Crawldata

S3 Bucket

Glue Crawler

Glue Datacatalog

Amazon AthenaLog Data

Option 2: Use Apache Hive-style file format in S3Move/ copy:

YYYY/MM/DD/HH/file

year=YYYY/month=MM/day=DD/hours=HH/file

Why? Format easy to create on write, easy to move.Reload partitions by: msck repair table.

Option 3: Add partitions with AWS Lambda

Why? Add partitions instantly, just AWS Lambda cost.

Lookupschema

Add tablepartition

Query data

New filetrigger

S3 Bucket

LambdaFunction

Glue Datacatalog

Option 4: Prepopulate partitions

Issue statements with Amazon Athena every hour/ day:

ALTER TABLE mytableADD PARTITION(year='2015',month='01',day='01')LOCATION 's3://[...]/2015/01/01/'

Why? Easy for predictable paths.Data is instantly available in queries.

Last_NameFischerMeierSchulzMalerMüller

Age3425456329

GenderFemFemFemMalMal

First_NameHeidiAnnaNicoleKlausRainer

Flat file sample layout

MIN: FischerMAX: Schulz

Age3425456329

MIN: 25MAX: 63

MIN: FemMAX: Mal

First_NameHeidiAnnaNicoleKlaus

RainerMIN: Anna

MAX: Rainer

Columnar formats layout (Parquet & ORC)

SELECT * FROM ... WHERE Age > 30

Age3425456329

MIN: 25MAX: 63

MIN: FemMAX: Mal

RainerMIN: Anna

MAX: Rainer

Benefit 1: predicate pushdown

Age3425456329

MIN: 25MAX: 63

MIN: FemMAX: Mal

RainerMIN: Anna

MAX: Rainer

SELECT First_Name FROM ... WHERE Age > 30

Benefit 2: projection pushdown/ column pruning

Example query V (parquet)What was the distribution of passenger loadon 2nd September 10-11h?SELECT passenger_count, count(*) countFROM "128mbparquet"WHERE pickup_datetime

Run time: 5.59 seconds, Data scanned: 1.77GBRun time: 3.21 seconds, Data scanned: 300.7MB

Analyzing an Apache parquet file

• parquet-tools• head – view data in file• meta – get metadata summary• dump -d -n – get detailed metadata down to page

level stats included

Schema information

1. Download and build https://github.com/apache/parquet-mr/2. run: java -jar parquet-tools.jar meta parquetfile

Row Count Total Byte Size Size in Bytes Value Count Encoding

parquet-tools dump: encoding & statisticstotal_amount:- DOUBLE SNAPPY DO:0 FPO:4155231 SZ:329324/338501/1.03 [more]... ST:[min: -76.8, max: 1121.3, num_nulls: 0]dropoff_datetime:- BINARY SNAPPY DO:0 FPO:3315979 SZ:839131/5540639/6.60 [more]... ST:[no stats for this column]

Use (unix epoch) or partition by timestamp for time seriesdata.

Example query VI (ORC)What was the distribution of passenger loadon 2nd September 10-11h?SELECT passenger_count, count(*) countFROM "128mborc"WHERE pickup_datetime

Run time: 3.21 seconds, Data scanned: 300.7MBRun time: 3.61 seconds, Data scanned: 303.38MB

Analyzing Apache ORC: orcdumpfile

Spin up a single node/ master EMR Cluster and use thehive command:

hive --orcfiledump file://<absolutepath>/file.orc

Column 7: count: 210141 hasNull: false min: -76.96324157714844 max: 0.0 sum: -1.5329986951126099E7

Column 8: count: 210141 hasNull: false min: 2018-08-30T00:13:48.573Z max: 2018-08-30T00:28:49.564Z sum: 5043384

Option 1: Extract, Transform, Load with AWS Glue

Lookupschema

Query data

Read/Write

S3 Bucket

Glue Job

Glue Datacatalog

Example Zeppelin/ AWS Glue Notebook

Option 2: Amazon Athena – Create Table as Select

Lookupschema

Query data

Read/Write

S3 Bucket

AthenaCTAS

Glue Datacatalog

Athena CTAS

Repartition one day:

Pay for data scanned

4.86GB / 1TB * 5 USD= 0.0243 USD

Example query VII (repartitioned)What was the distribution of passenger loadon 2nd September 10-11h?SELECT passenger_count, count(*) countFROM "partitioned_by_hour"WHERE year = 2018

AND month = 9AND day = 2AND hour = 10

Run time: 3.21 seconds, Data scanned: 300.7MBRun time: 2.68 seconds, Data scanned: 1.41 MB

Example query VIII (aggregated)What was the distribution of passenger loadon 2nd September 10-11h?SELECT passenger_count, trip_countFROM "aggregates_by_hour"WHERE year = 2018

AND month = 9AND day = 2AND hour = 10

Run time: 2.42 seconds, Data scanned: 2.06MBRun time: 1.84 seconds, Data scanned: 1.9 KB

I applied these simple tricks when storing datafor Amazon Athena andyou won‘t believe whathappened next...

Measure.Then optimize. There‘s no silver bullet.

Photo by Cesar Carlevarino Aragon on Unsplash

Optimize for cost and performance 1/2

• Use Athena in the region of your buckets.• Compress your data for less storage & query cost.• Use LIMIT in queries for faster results.• Partition your data based on data access patterns.• Use partitions in your queries.• Add partitions by crawling or S3 triggers.

Optimize for cost and performance 2/2

• Columnar formats as ORC & parquet reduce scanneddata: faster, less cost

• Pick format depending on data, access patterns, clients• Inspect/ verify the results• Create aggregates for frequent queries• Try Amazon Athena's Create Table as Select first• Shorten turnaround times for Glue job development:

• Use a provisioned development endpoint• Use small subset of your data (think KB!)

Get started!

Find more detailsand deploy a sample application forAmazon CloudFront logs:

http://a.co/aJFQuMI

Thank you!

Steffen Grunwald

steffeng@amazon.deTwitter: @steffeng

awssummit berlin day2 steffeng query your data in …...including csv, json, orc, avro, and parquet...

Documents

avro introduction

apache avro 1.10.2 specification

the avro arrow, sputnik and sdi

fly model nº 09 - avro lancaster

avro rj flows

3 avro hug-2010-07-21

avro manhatten-the vaticans holocaust,2007

avro flying insect killers

avro 146-rj70 - avionmar

aerodata international 10 avro lancaster

annual report - avro

serialization (avro, message pack, kryo)

1957- avro cf-105 arrow avweak

avro intro

avro furniture

avro lancaster assembly instructions

avro rj70 - m-stry

#awssummit démarrer sur...

avro - tutorialspoint.com · 2019-03-01 · avro v to...

contents...avro 621 tutor/sea tutor 22 avro 626 prefect 23...