(bdt205) your first big data application on aws | aws re:invent 2014

Post on 02-Jul-2015

667 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Want to get ramped up on how to use Amazon's big data web services and launch your first big data application on AWS? Join us on our journey as we build a big data application in real-time using Amazon EMR, Amazon Redshift, Amazon Kinesis, Amazon DynamoDB, and Amazon S3. We review architecture design patterns for big data solutions on AWS, and give you access to a take-home lab so that you can rebuild and customize the application yourself.

TRANSCRIPT

November 12th, 2014 | Las Vegas, NV

Matt Yanchyshyn, Principal Solutions Architect

RedshiftEMR EC2

Process & Analyze

Store

AWS Direct Connect

S3

Amazon Kinesis

Glacier

AWS Import/Export

DynamoDB

Collect

AutomateAWS Data Pipeline

Log4J

EMR-Kinesis Connector

Hive with

Amazon S3Amazon Redshift

parallel COPY from

Amazon S3

Amazon Kinesis

processing state

Launch a 3-instance Hadoop 2.4 cluster with Hive installed:

m3.xlarge

YOUR-AWS-REGION

YOUR-AWS-SSH-KEY

YOUR-BUCKET-NAME

Create an Amazon Kinesis stream to hold incoming data:

aws kinesis create-stream \

--stream-name AccessLogStream \

--shard-count 2

\

CHOOSE-A-REDSHIFT-PASSWORD

YOUR-IAM-ACCESS-KEYYOUR-IAM-SECRET-KEY

Log4J

YOUR-AWS-SSH-KEYYOUR-EMR-MASTER-PRIVATE-DNS

YOUR-EMR-MASTER-PRIVATE-DNSYOUR-EMR-HOSTNAME

Start Hive:

hive

YOUR-IAM-ACCESS-KEY

YOUR-IAM-SECRET-KEY;

YOUR-AWS-REGION

hive>

hive>

hive>

hive>

hive>

hive>

hive>

STORED BY 'com.amazon.emr.kinesis.hive.KinesisStorageHandler'

TBLPROPERTIES("kinesis.stream.name"="AccessLogStream");

-- return the first row in the stream

hive>

-- return count all items in the Stream

hive>

-- return count of all rows with given hosthive>

Log4J

EMR-Kinesis Connector

http://127.0.0.1:19026/cluster

http://127.0.0.1:19101

hive>

YOUR-S3-BUCKET/emroutput

-- set up Hive's "dynamic partioning"

-- splits output files when writing to Amazon S3

hive>

hive>

-- compress output files on Amazon S3 using Gzip

hive>

hive>

hive>

hive>

-- convert the Apache log timestamp to a UNIX timestamp

-- split files in Amazon S3 by the hour in the log lines

hive>

Log4J

EMR-Kinesis Connector

Hive with

Amazon S3

YOUR-S3-BUCKET

YOUR-S3-BUCKET

# using the PostgreSQL CLI

YOUR-REDSHIFT-ENDPOINT

Or use any JDBC or ODBC SQL client with the PostgreSQL

8.x drivers or native Redshift support

• Aginity Workbench for Amazon Redshift

• SQL Workbench/J

YOUR-S3-BUCKET

YOUR-IAM-ACCESS_KEY

YOUR-IAM-SECRET-KEY

-- show all requests from a given IP address

-- count all requests on a given day

-- show all requests referred from other sites

Log4J

EMR-Kinesis Connector

Hive with

Amazon S3Amazon Redshift

parallel COPY from

Amazon S3

Bonus:

hive>

hive>

hive>

hive>

hive>

-- Create an external table on Amazon S3

-- to hold query results.

-- Partition (split files on Amazon S3) by iteration

hive>

YOUR-S3-BUCKET

-- set up a first iteration

-- create OS-ERROR_COUNT result (404 error codes) under dynamic partition 0

-- set up a second iteration over the data in the Kinesis Stream

-- create OS-ERROR_COUNT result under dynamic partition 1. -- if file is empty, the previous iteration read all remaining stream data

Log4J

EMR-Kinesis Connector

Hive with

Amazon S3Amazon Redshift

parallel COPY from

Amazon S3

Amazon Kinesis

processing state

YOUR-S3-BUCKET

YOUR-S3-BUCKET

YOUR-S3-BUCKET YOUR-PREFIX.gz .

YOUR-PREFIX.gz

Big Data software on AWS Marketplace:

http://amzn.to/1va4KQ6

http://bit.ly/aws-bdt205

Learn from AWS big data experts

blogs.aws.amazon.com/bigdata

http://bit.ly/awsevals

top related