(bdt205) your first big data application on aws | aws re:invent 2014

November 12th, 2014 | Las Vegas, NV

Matt Yanchyshyn, Principal Solutions Architect

RedshiftEMR EC2

Process & Analyze

AWS Direct Connect

Amazon Kinesis

Glacier

AWS Import/Export

DynamoDB

Collect

AutomateAWS Data Pipeline

EMR-Kinesis Connector

Hive with

Amazon S3Amazon Redshift

parallel COPY from

Amazon S3

Amazon Kinesis

processing state

Launch a 3-instance Hadoop 2.4 cluster with Hive installed:

m3.xlarge

YOUR-AWS-REGION

YOUR-AWS-SSH-KEY

YOUR-BUCKET-NAME

Create an Amazon Kinesis stream to hold incoming data:

aws kinesis create-stream \

--stream-name AccessLogStream \

--shard-count 2

CHOOSE-A-REDSHIFT-PASSWORD

YOUR-IAM-ACCESS-KEYYOUR-IAM-SECRET-KEY

YOUR-AWS-SSH-KEYYOUR-EMR-MASTER-PRIVATE-DNS

YOUR-EMR-MASTER-PRIVATE-DNSYOUR-EMR-HOSTNAME

Start Hive:

YOUR-IAM-ACCESS-KEY

YOUR-IAM-SECRET-KEY;

YOUR-AWS-REGION

STORED BY 'com.amazon.emr.kinesis.hive.KinesisStorageHandler'

TBLPROPERTIES("kinesis.stream.name"="AccessLogStream");

-- return the first row in the stream

-- return count all items in the Stream

-- return count of all rows with given hosthive>

http://127.0.0.1:19026/cluster

http://127.0.0.1:19101

YOUR-S3-BUCKET/emroutput

-- set up Hive's "dynamic partioning"

-- splits output files when writing to Amazon S3

-- compress output files on Amazon S3 using Gzip

-- convert the Apache log timestamp to a UNIX timestamp

-- split files in Amazon S3 by the hour in the log lines

Hive with

Amazon S3

YOUR-S3-BUCKET

# using the PostgreSQL CLI

YOUR-REDSHIFT-ENDPOINT

Or use any JDBC or ODBC SQL client with the PostgreSQL

8.x drivers or native Redshift support

• Aginity Workbench for Amazon Redshift

• SQL Workbench/J

YOUR-S3-BUCKET

YOUR-IAM-ACCESS_KEY

YOUR-IAM-SECRET-KEY

-- show all requests from a given IP address

-- count all requests on a given day

-- show all requests referred from other sites

Hive with

parallel COPY from

Amazon S3

Bonus:

-- Create an external table on Amazon S3

-- to hold query results.

-- Partition (split files on Amazon S3) by iteration

YOUR-S3-BUCKET

-- set up a first iteration

-- create OS-ERROR_COUNT result (404 error codes) under dynamic partition 0

-- set up a second iteration over the data in the Kinesis Stream

-- create OS-ERROR_COUNT result under dynamic partition 1. -- if file is empty, the previous iteration read all remaining stream data

Hive with

parallel COPY from

Amazon S3

Amazon Kinesis

processing state

YOUR-S3-BUCKET

YOUR-S3-BUCKET YOUR-PREFIX.gz .

YOUR-PREFIX.gz

Big Data software on AWS Marketplace:

http://amzn.to/1va4KQ6

http://bit.ly/aws-bdt205

Learn from AWS big data experts

blogs.aws.amazon.com/bigdata

http://bit.ly/awsevals

(bdt205) your first big data application on aws | aws re:invent 2014

Technology

transforming big data with spark and shark - aws re:invent...

big tv means big infrastructure and big data (ent304) | aws...

introducing amazon kinesis: real-time processing of...

(sov209) introducing aws directory service | aws re:invent...

aws re:invent 2016: building big data applications with the...

aws re:invent 2016: big data mini con state of the union...

big data integration & analytics data flows with aws data...

(bdt310) big data architectural patterns and best practices...

amazon kinesis: real-time streaming big data processing...

aws re:invent 2016: visualizing big data insights with...

aws re:invent 2017 recap

aws re:invent 2017 re:view

how the weather company monetizes weather: the big data...

aws re:invent 2016: how to build a big data analytics data...

(bdt401) big data orchestra - harmony within data analysis...

aws re:invent 2016: securing enterprise big data workloads...

(pfc302) performance benchmarking on aws | aws re:invent...

aws re:invent 2016: metering big data at aws: from 0 to 100...

(sov203) understanding aws storage options | aws re:invent...

big data 'state of the union' (bdt101) | aws re:invent 2013