(bdt308) using amazon elastic mapreduce as your scalable data warehouse | aws re:invent 2014
DESCRIPTION
In this presentation, we will demonstrate how to use Amazon Elastic MapReduce as your scalable data warehouse. Amazon EMR supports clusters with thousands of nodes and is used to access petabyte scale data warehouses. Amazon EMR is not only fast, but it is also easy to use for rapid development and adhoc analysis. We will show you how access the large scale data warehouses with emerging tools such as Hue, Hive, low latency SQL applications like Presto, and alternative execution engines like Apache Spark. We will also show you how these tools integrate directly with other AWS big data services such as Amazon S3, Amazon DynamoDB, and Amazon Kinesis.TRANSCRIPT
November 14, 2014 | Las Vegas, NV
Steve McPherson
instance AMI DB on
instance
instance with
CloudWatch
Elastic IP optimized
instance
Amazon
WorkSpaces
assignment/
task
Amazon EMR cluster MapR M3
engine
MapR M5
engine
MapR M7
engine
engine
Kinesis-enabled
appnew!
Amazon
Route 53
hosted zone route table
solid state disks
AWS Direct Connect
router
Amazon RDS
customer
gateway
attribute
VPC peering
Auto Scaling
Amazon S3 bucket with
objects
object AWS Import/Export
AWS Storage
Gateway
volume snapshotAmazon EBS cached
volume
virtual tape
library
Elastic Beanstalk
Amazon Glacier archive vault
CloudFront download
distribution Node.js
streaming
distribution
items
tableDynamoDB attributes global
secondary
index
Amazon
KinesisRDS DB
instance
RDS DB
instance standby
(Multi-AZ) Oracle DB
instance
MS SQL
instance
PostgreSQL
instance
PIOP MemcachedRedis
new! new! new! new!
AWS CloudTrail
instances
domain Amazon RedshiftAmazon SimpleDB
new!
DW1
Dense Compute
ElastiCache
DW2
Dense Compute
edge location
AWS Toolkit for
Visual Studio
JavaScriptapplication
stack
Amazon VPC VPN
connection
virtual private
gateway
alarm
stack
Internet
gateway
.NET
RDS DB
instance read
replica
IAMJava Python (boto)
AWS CLI
permissions role
MFA token
new!
new! new!
AWS OpsWorks
elastic network
instance
PHPdata encryption
keyAWS Data Pipeline
monitoring
new!
new!
deployment CloudWatch
Elastic Load
Balancing
SQL master
new!new!
Amazon EC2
new!
SQL slave
encrypted
data
AWS Tools for
Windows
PowerShellnon-cached
volume
users
IAM add-on
deployments
bucketdeployments
new!
permissions
iOS
resources
cache node
stack
AWS OpsWorks layers
apps
new!
new! apps
new!
Amazon SNS
new!
Human Intelligence
Tasks (HIT)
AWS Simple Icons: Deployment & Management
instances
new!
new!new!
Ruby
new!
instances
new!
permissionsresources
new!
topic
new!
templateAWS Toolkit
for Eclipse
Amazon SES
traditional server
Elastic
Transcoder
monitoring
Requester
email notification HTTP notification
Amazon
CloudSearchSDF metadata
Amazon SQSitem
message
Amazon SWF
decider
layers
worker
tape storagedisk
userInternet
Amazon
Mechanical Turk
client mobile client multimedia
workers
corporate
data centergeneric database
Android
AWS Security
Token Service
AWS cloud
AWS Management
Console
virtual private cloud forums
MySQL DB
instance
queueAMAZON
EMR
Big decisions need Big Data
Server
Purchase Social
Media
Extract Transform Load to
Data Warehouse
Report Generation
Ad Hoc Analysis
Hadoop
Hadoop can help
Difficult, expensive, and time consuming to operate
Hadoop
But Hadoop needs help
Amazon EMR makes Hadoop easy
Extract Transform & Load Data Warehouse Report Generation & Ad Hoc Analysis
Amazon S3
• MapReduce API
• Scoop
• Spark
• Cascading
• Pig
• MR
• Hive
• Spark
• Cascading
• Pig
• Presto
• Hive
• Spark-SQL
• Lingual
• Parquet
• ORC
• SEQ
• Text
Extract Transform & Load
Data Warehouse Report Generation
Ad Hoc Analysis
write read
Amazon S3 is your Data Lake
Amazon S3
Amazon EMR with Amazon S3 is your Data Warehouse
Hive, Pig,
Cascading
Spark
Presto HBase
Amazon S3
Disaster Recovery built in
Cluster 1 Cluster 2
Cluster 3 Cluster 4
Amazon S3
Availability Zone Availability Zone
Amazon EMR reads from and writes to AWS data sources
Amazon S3
bucket
Amazon
Kinesis
Amazon
DynamoDB
Amazon S3
bucket
Amazon
DynamoDB
Amazon
Redshift
Client/Sensor Recording Service
Aggregator/ Sequencer
Continuous Processor
Data Warehouse
Analytics and Reporting
Client/Sensor Recording Service
Aggregator/ Sequencer
Continuous Processor
Data Warehouse
Analytics and Reporting
Kafka
Streaming Data Repository
Amazon Kinesis
Amazon Kinesis + Amazon EMR= Fewer Moving Parts
Client/ Sensor Recording Service
Aggregator/ Sequencer
Continuous Processor for Dashboard
Data Warehouse
Analytics and Reporting
Amazon Kinesis Amazon EMR
Streaming Data RepositoryLogging Data Processing
Log4J
Processing
Input
•User
•Dev
push to
HivePig
Cascading
pull from
Spark
Amazon Kinesis
Amazon DynamoDB
Processing Amazon Kinesis data from Amazon EMR using Hive
private static final KinesisAppender.class
SELECT
FROM
WHERE
InstanceTime | InstnaceID | Message
11/13/2014:07:51 InstanceID123 Cannot find resource XYZ
Amazon S3
Long-Running Clusters Scheduled Jobs
Amazon EMR integrates with your tools
Recent Integrations
Recent Integrations
http://emr.looker.com
Flexible and cost effective – Burst when you
need to
672
0.113
75.936
On Demand pricing
672
14.784
Reserved Instance
672
accepted
10.08
Spot Instance
0.015
481
0.022
Bill
$10.08
Setup for security
Flexible, Reliable, Scalable, Secure, and Low-Cost
Data Warehouse
AWS Big Data Blog
• R
• Amazon Kinesis
• Visualization with
Tableau
• Bootstrap actions and
steps
http://blogs.aws.amazon.com/bigdata/
Get started today
http://aws.amazon.com/elasticmapreduce/
http://bit.ly/awsevals