(bdt308) using amazon elastic mapreduce as your scalable data warehouse | aws re:invent 2014

35
November 14, 2014 | Las Vegas, NV Steve McPherson

Upload: amazon-web-services

Post on 02-Jul-2015

691 views

Category:

Technology


2 download

DESCRIPTION

In this presentation, we will demonstrate how to use Amazon Elastic MapReduce as your scalable data warehouse. Amazon EMR supports clusters with thousands of nodes and is used to access petabyte scale data warehouses. Amazon EMR is not only fast, but it is also easy to use for rapid development and adhoc analysis. We will show you how access the large scale data warehouses with emerging tools such as Hue, Hive, low latency SQL applications like Presto, and alternative execution engines like Apache Spark. We will also show you how these tools integrate directly with other AWS big data services such as Amazon S3, Amazon DynamoDB, and Amazon Kinesis.

TRANSCRIPT

Page 1: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

November 14, 2014 | Las Vegas, NV

Steve McPherson

Page 2: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

instance AMI DB on

instance

instance with

CloudWatch

Elastic IP optimized

instance

Amazon

WorkSpaces

assignment/

task

Amazon EMR cluster MapR M3

engine

MapR M5

engine

MapR M7

engine

engine

Kinesis-enabled

appnew!

Amazon

Route 53

hosted zone route table

solid state disks

AWS Direct Connect

router

Amazon RDS

customer

gateway

attribute

VPC peering

Auto Scaling

Amazon S3 bucket with

objects

object AWS Import/Export

AWS Storage

Gateway

volume snapshotAmazon EBS cached

volume

virtual tape

library

Elastic Beanstalk

Amazon Glacier archive vault

CloudFront download

distribution Node.js

streaming

distribution

items

tableDynamoDB attributes global

secondary

index

Amazon

KinesisRDS DB

instance

RDS DB

instance standby

(Multi-AZ) Oracle DB

instance

MS SQL

instance

PostgreSQL

instance

PIOP MemcachedRedis

new! new! new! new!

AWS CloudTrail

instances

domain Amazon RedshiftAmazon SimpleDB

new!

DW1

Dense Compute

ElastiCache

DW2

Dense Compute

edge location

AWS Toolkit for

Visual Studio

JavaScriptapplication

stack

Amazon VPC VPN

connection

virtual private

gateway

alarm

stack

Internet

gateway

.NET

RDS DB

instance read

replica

IAMJava Python (boto)

AWS CLI

permissions role

MFA token

new!

new! new!

AWS OpsWorks

elastic network

instance

PHPdata encryption

keyAWS Data Pipeline

monitoring

new!

new!

deployment CloudWatch

Elastic Load

Balancing

SQL master

new!new!

Amazon EC2

new!

SQL slave

encrypted

data

AWS Tools for

Windows

PowerShellnon-cached

volume

users

IAM add-on

deployments

bucketdeployments

new!

permissions

iOS

resources

cache node

stack

AWS OpsWorks layers

apps

new!

new! apps

new!

Amazon SNS

new!

Human Intelligence

Tasks (HIT)

AWS Simple Icons: Deployment & Management

instances

new!

new!new!

Ruby

new!

instances

new!

permissionsresources

new!

topic

new!

templateAWS Toolkit

for Eclipse

Amazon SES

traditional server

Elastic

Transcoder

email

monitoring

Requester

email notification HTTP notification

Amazon

CloudSearchSDF metadata

Amazon SQSitem

message

Amazon SWF

decider

layers

worker

tape storagedisk

userInternet

Amazon

Mechanical Turk

client mobile client multimedia

workers

corporate

data centergeneric database

Android

AWS Security

Token Service

AWS cloud

AWS Management

Console

virtual private cloud forums

MySQL DB

instance

queueAMAZON

EMR

Page 3: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

Big decisions need Big Data

Server

Purchase Social

Media

Page 4: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

Extract Transform Load to

Data Warehouse

Report Generation

Ad Hoc Analysis

Page 5: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

Hadoop

Hadoop can help

Page 6: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

Difficult, expensive, and time consuming to operate

Hadoop

But Hadoop needs help

Page 7: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

Amazon EMR makes Hadoop easy

Page 8: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

Extract Transform & Load Data Warehouse Report Generation & Ad Hoc Analysis

Amazon S3

• MapReduce API

• Scoop

• Spark

• Cascading

• Pig

• MR

• Hive

• Spark

• Cascading

• Pig

• Presto

• Hive

• Spark-SQL

• Lingual

• Parquet

• ORC

• SEQ

• Text

Extract Transform & Load

Data Warehouse Report Generation

Ad Hoc Analysis

write read

Page 9: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

Amazon S3 is your Data Lake

Amazon S3

Page 10: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

Amazon EMR with Amazon S3 is your Data Warehouse

Hive, Pig,

Cascading

Spark

Presto HBase

Amazon S3

Page 11: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

Disaster Recovery built in

Cluster 1 Cluster 2

Cluster 3 Cluster 4

Amazon S3

Availability Zone Availability Zone

Page 12: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

Amazon EMR reads from and writes to AWS data sources

Amazon S3

bucket

Amazon

Kinesis

Amazon

DynamoDB

Amazon S3

bucket

Amazon

DynamoDB

Amazon

Redshift

Page 13: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014
Page 14: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014
Page 15: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

Client/Sensor Recording Service

Aggregator/ Sequencer

Continuous Processor

Data Warehouse

Analytics and Reporting

Page 16: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

Client/Sensor Recording Service

Aggregator/ Sequencer

Continuous Processor

Data Warehouse

Analytics and Reporting

Kafka

Page 17: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

Streaming Data Repository

Amazon Kinesis

Page 18: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

Amazon Kinesis + Amazon EMR= Fewer Moving Parts

Client/ Sensor Recording Service

Aggregator/ Sequencer

Continuous Processor for Dashboard

Data Warehouse

Analytics and Reporting

Amazon Kinesis Amazon EMR

Streaming Data RepositoryLogging Data Processing

Log4J

Page 19: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

Processing

Input

•User

•Dev

push to

HivePig

Cascading

pull from

Spark

Amazon Kinesis

Amazon DynamoDB

Page 20: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

Processing Amazon Kinesis data from Amazon EMR using Hive

private static final KinesisAppender.class

SELECT

FROM

WHERE

InstanceTime | InstnaceID | Message

11/13/2014:07:51 InstanceID123 Cannot find resource XYZ

Page 21: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014
Page 22: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014
Page 23: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014
Page 24: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

Amazon S3

Page 25: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

Long-Running Clusters Scheduled Jobs

Page 26: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

Amazon EMR integrates with your tools

Page 27: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

Recent Integrations

Page 28: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

Recent Integrations

Page 29: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

http://emr.looker.com

Page 30: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

Flexible and cost effective – Burst when you

need to

672

0.113

75.936

On Demand pricing

672

14.784

Reserved Instance

672

accepted

10.08

Spot Instance

0.015

481

0.022

Bill

$10.08

Page 31: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

Setup for security

Page 32: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

Flexible, Reliable, Scalable, Secure, and Low-Cost

Data Warehouse

Page 33: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

AWS Big Data Blog

• R

• Amazon Kinesis

• Visualization with

Tableau

• Bootstrap actions and

steps

http://blogs.aws.amazon.com/bigdata/

Page 34: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

Get started today

http://aws.amazon.com/elasticmapreduce/

Page 35: (BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

http://bit.ly/awsevals