february 2016 webinar series - architectural patterns for big data on aws
TRANSCRIPT
![Page 1: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/1.jpg)
Siva Raghupathy, Sr. Manager, Solutions Architecture, AWS
February, 2016
Architectural Patterns for Big Data on AWS
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
![Page 2: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/2.jpg)
Agenda
Big data challengesHow to simplify big data processingWhat technologies should you use?
• Why?• How?
Reference architectureDesign patterns
![Page 3: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/3.jpg)
Ever Increasing Big Data
![Page 4: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/4.jpg)
Big Data Evolution
Batch
Report
Real-time
Alerts
Prediction
Forecast
![Page 5: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/5.jpg)
Plethora of Tools
Amazon Glacier
S3 DynamoDB
RDS
EMR
Amazon Redshift
Data PipelineAmazon Kinesis CloudSearch
Kinesis-enabled app
Lambda ML
SQS
ElastiCache
DynamoDBStreams
![Page 6: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/6.jpg)
Is there a reference architecture?What tools should I use?How? Why?
![Page 7: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/7.jpg)
Architectural Principles
Decoupled “data bus”• Data → Store → Process → Answers
Use the right tool for the job• Data structure, latency, throughput, access patterns
Use Lambda architecture ideas• Immutable (append-only) log, batch/speed/serving layer
Leverage AWS managed services• No/low admin
Big data ≠ big cost
![Page 8: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/8.jpg)
Simplify Big Data Processing
ingest / collect store process /
analyzeconsume / visualize
data answers
Time to Answer (Latency)Throughput
Cost
![Page 9: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/9.jpg)
Ingest / Collect
![Page 10: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/10.jpg)
Types of Data
Transactional• Database reads & writes (OLTP)• Cache
Search• Logs• Streams
File• Log files (/var/log)• Log collectors & frameworks
Stream• Log records• Sensors & IoT data
Database
FileStorage
StreamStorage
A
iOS Android
Web Apps
Logstash
Logg
ing
IoT
Appl
icat
ions
Transactional Data
File Data
Stream Data
Mobile Apps
Search Data
Search
Collect StoreLo
ggin
gIo
T
![Page 11: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/11.jpg)
Store
![Page 12: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/12.jpg)
Stream Storage
A
iOS Android
Web Apps
Logstash
Amazon RDS
Amazon DynamoDB
AmazonES
AmazonS3
ApacheKafka
AmazonGlacier
AmazonKinesis
AmazonDynamoDB
AmazonElastiCache
Sear
ch
SQL
NoS
QL
Ca
che
Stre
am
Stor
age
File
St
orag
e
Transactional Data
File Data
Stream Data
Mobile Apps
Search Data
Database
FileStorage
Search
Collect StoreLo
ggin
gIo
TAp
plic
atio
ns
![Page 13: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/13.jpg)
Stream Storage Options
AWS managed services• Amazon Kinesis → streams• DynamoDB Streams → table + streams• Amazon SQS → queue• Amazon SNS → pub/sub
Unmanaged• Apache Kafka → stream
![Page 14: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/14.jpg)
Why Stream Storage?
Decouple producers & consumers
Persistent buffer
Collect multiple streams
Preserve client ordering
Streaming MapReduce
Parallel consumption
4 4 3 3 2 2 1 14 3 2 1
4 3 2 1
4 3 2 1
4 3 2 14 4 3 3 2 2 1 1
Producer 1
Shard 1 / Partition 1
Shard 2 / Partition 2
Consumer 1Count of Red = 4
Count of Violet = 4
Consumer 2Count of Blue = 4
Count of Green = 4
Producer 2
Producer 3
Producer N
Key = Red
Key = Green
Key = Blue
Key = Violet
DynamoDB Stream Kinesis Stream Kafka Topic
![Page 15: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/15.jpg)
What About Queues & Pub/Sub ? • Decouple producers &
consumers/subscribers• Persistent buffer• Collect multiple streams• No client ordering• No parallel consumption for
Amazon SQS• Amazon SNS can route
to multiple queues or ʎ functions
• No streaming MapReduce
Consumers
Producers
Producers
Amazon SNS
Amazon SQS
queue
topic
function
ʎ
AWS Lambda
Amazon SQSqueue
Subscriber
![Page 16: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/16.jpg)
Which stream storage should I use?AmazonKinesis
DynamoDB Streams
Amazon SQSAmazon SNS
Kafka
Managed Yes Yes Yes No
Ordering Yes Yes No Yes
Delivery at-least-once exactly-once at-least-once at-least-once
Lifetime 7 days 24 hours 14 days Configurable
Replication 3 AZ 3 AZ 3 AZ Configurable
Throughput No Limit No Limit No Limit ~ Nodes
Parallel Clients Yes Yes No (SQS) Yes
MapReduce Yes Yes No Yes
Record size 1MB 400KB 256KB Configurable
Cost Low Higher(table cost) Low-Medium Low (+admin)
![Page 17: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/17.jpg)
FileStorage
A
iOS Android
Web Apps
Logstash
Amazon RDS
Amazon DynamoDB
AmazonES
AmazonS3
ApacheKafka
AmazonGlacier
AmazonKinesis
AmazonDynamoDB
AmazonElastiCache
Sear
ch
SQL
NoS
QL
Ca
che
Stre
am
Stor
age
File
St
orag
e
Transactional Data
File Data
Stream Data
Mobile Apps
Search Data
Database
Search
Collect StoreLo
ggin
gIo
TAp
plic
atio
ns
![Page 18: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/18.jpg)
Why is Amazon S3 Good for Big Data?
• Natively supported by big data frameworks (Spark, Hive, Presto, etc.) • No need to run compute clusters for storage (unlike HDFS)• Can run transient Hadoop clusters & Amazon EC2 Spot instances• Multiple distinct (Spark, Hive, Presto) clusters can use the same data• Unlimited number of objects • Very high bandwidth – no aggregate throughput limit• Highly available – can tolerate AZ failure• Designed for 99.999999999% durability• Tiered storage (Standard, IA, Amazon Glacier) via life-cycle policy• Secure – SSL, client/server-side encryption at rest• Low cost
![Page 19: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/19.jpg)
What about HDFS & Amazon Glacier?
• Use HDFS for very frequently accessed (hot) data
• Use Amazon S3 Standard for frequently accessed data
• Use Amazon S3 Standard – IA for infrequently accessed data
• Use Amazon Glacier for archiving cold data
![Page 20: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/20.jpg)
Database + Search
Tier
A
iOS Android
Web Apps
Logstash
Amazon RDS
Amazon DynamoDB
AmazonES
AmazonS3
ApacheKafka
AmazonGlacier
AmazonKinesis
AmazonDynamoDB
AmazonElastiCache
Sear
ch
SQL
NoS
QL
Ca
che
Stre
am
Stor
age
File
St
orag
e
Transactional Data
File Data
Stream Data
Mobile Apps
Search Data
Collect Store
![Page 21: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/21.jpg)
Database + Search Tier Anti-pattern
RDBMS
Database + Search Tier
Applications
![Page 22: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/22.jpg)
Best Practice - Use the Right Tool for the Job
Data TierSearch
Amazon Elasticsearch Service
Amazon CloudSearch
Cache
RedisMemcached
SQL
Amazon AuroraMySQLPostgreSQLOracleSQL Server
NoSQL
CassandraAmazon
DynamoDBHBaseMongoDB
Applications
Database + Search Tier
![Page 23: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/23.jpg)
Materialized Views
![Page 24: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/24.jpg)
What Data Store Should I Use?
Data structure → Fixed schema, JSON, key-value
Access patterns → Store data in the format you will access it
Data / access characteristics → Hot, warm, cold
Cost → Right cost
![Page 25: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/25.jpg)
Data Structure and Access PatternsAccess Patterns What to use?Put/Get (Key, Value) Cache, NoSQL
Simple relationships → 1:N, M:N NoSQL
Cross table joins, transaction, SQL SQL
Faceting, Search Search
Data Structure What to use?Fixed schema SQL, NoSQL
Schema-free (JSON) NoSQL, Search
(Key, Value) Cache, NoSQL
![Page 26: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/26.jpg)
What Is the Temperature of Your Data / Access ?
![Page 27: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/27.jpg)
Hot Warm ColdVolume MB–GB GB–TB PBItem size B–KB KB–MB KB–TBLatency ms ms, sec min, hrsDurability Low–High High Very HighRequest rate Very High High LowCost/GB $$-$ $-¢¢ ¢
Hot Data Warm Data Cold Data
Data / Access Characteristics: Hot, Warm, Cold
![Page 28: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/28.jpg)
Cache SQL
Request RateHigh Low
Cost/GBHigh Low
LatencyLow High
Data VolumeLow High
GlacierS
truct
ure
NoSQL
Hot Data Warm Data Cold Data
Low
High
S3
Search
HDFS
![Page 29: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/29.jpg)
Amazon ElastiCache
AmazonDynamoDB
AmazonAurora
AmazonElasticsearch
Amazon EMR (HDFS)
Amazon S3 Amazon Glacier
Average latency
ms ms ms, sec ms,sec sec,min,hrs ms,sec,min(~ size)
hrs
Data volume GB GB–TBs(no limit)
GB–TB(64 TB Max)
GB–TB GB–PB(~nodes)
MB–PB(no limit)
GB–PB(no limit)
Item size B-KB KB(400 KB max)
KB(64 KB)
KB(1 MB max)
MB-GB KB-GB(5 TB max)
GB(40 TB max)
Request rate High - Very High
Very High(no limit)
High High Low – Very High
Low –Very High(no limit)
Very Low
Storage cost GB/month
$$ ¢¢ ¢¢ ¢¢ ¢ ¢ ¢/10
Durability Low - Moderate
Very High Very High High High Very High Very High
Hot Data Warm Data Cold Data
Hot Data Warm Data Cold DataWhat Data Store Should I Use?
![Page 30: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/30.jpg)
Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB?
“I’m currently scoping out a project that will greatly increase my team’s use of Amazon S3. Hoping you could answer some questions. The current iteration of the design calls for many small files, perhaps up to a billion during peak. The total size would be on the order of 1.5 TB per month…”
Request rate (Writes/sec)
Object size(Bytes)
Total size(GB/month)
Objects per month
300 2048 1483 777,600,000
![Page 31: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/31.jpg)
Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB?
https://calculator.s3.amazonaws.com/index.html
Simple Monthly Calculator
![Page 32: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/32.jpg)
Request rate (Writes/sec)
Object size(Bytes)
Total size(GB/month)
Objects per month
300 2,048 1,483 777,600,000
Amazon S3 orAmazon DynamoDB?
![Page 33: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/33.jpg)
Request rate (Writes/sec)
Object size(Bytes)
Total size(GB/month)
Objects per month
Scenario 1300 2,048 1,483 777,600,000
Scenario 2300 32,768 23,730 777,600,000
Amazon S3
Amazon DynamoDB
use
use
![Page 34: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/34.jpg)
Process /Analyze
![Page 35: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/35.jpg)
AnalyzeA
iOS Android
Web Apps
Logstash
Amazon RDS
Amazon DynamoDB
AmazonES
AmazonS3
ApacheKafka
AmazonGlacier
AmazonKinesis
AmazonDynamoDB
Amazon Redshift
Impala
Pig
Amazon ML
AmazonKinesis
AWSLambda
Amaz
on E
last
ic
Map
Redu
ce
AmazonElastiCache
Sear
ch
SQL
NoS
QL
Ca
che
Stre
am
Proc
essi
ngBa
tch
Inte
ract
ive
Logg
ing
Stre
am
Stor
age
IoT
Appl
icat
ions
File
St
orag
e
Hot
Cold
WarmHot
Hot
ML
Transactional Data
File Data
Stream Data
Mobile Apps
Search Data
Collect Store Analyze
Streaming
![Page 36: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/36.jpg)
Process / AnalyzeAnalysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making.
ExamplesInteractive dashboards → Interactive analyticsDaily/weekly/monthly reports → Batch analyticsBilling/fraud alerts, 1 minute metrics → Real-time analyticsSentiment analysis, prediction models → Machine learning
![Page 37: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/37.jpg)
Interactive Analytics
Takes large amount of (warm/cold) dataTakes seconds to get answers back
Example: Self-service dashboards
![Page 38: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/38.jpg)
Batch Analytics
Takes large amount of (warm/cold) dataTakes minutes or hours to get answers back
Example: Generating daily, weekly, or monthly reports
![Page 39: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/39.jpg)
Real-Time AnalyticsTake small amount of hot data and ask questions Takes short amount of time (milliseconds or seconds) to get your answer back
Real-time (event)• Real-time response to events in data streams• Example: Billing/Fraud Alerts
Near real-time (micro-batch)• Near real-time operations on small batches of events in data
streams• Example: 1 Minute Metrics
![Page 40: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/40.jpg)
Predictions via Machine Learning
ML gives computers the ability to learn without being explicitly programmed
Machine Learning Algorithms:Supervised Learning ← “teach” program
- Classification ← Is this transaction fraud? (Yes/No) - Regression ← Customer Life-time value?
Unsupervised Learning ← let it learn by itself- Clustering ← Market Segmentation
![Page 41: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/41.jpg)
Analysis Tools and Frameworks
Machine Learning• Mahout, Spark ML, Amazon ML
Interactive Analytics• Amazon Redshift, Presto, Impala, Spark
Batch Processing• MapReduce, Hive, Pig, Spark
Stream Processing• Micro-batch: Spark Streaming, KCL, Hive, Pig• Real-time: Storm, AWS Lambda, KCL
Amazon Redshift
Impala
Pig
Amazon Machine Learning
AmazonKinesis
AWSLambda
Amaz
on E
last
ic
Map
Redu
ce
Stre
am
Proc
essi
ngBa
tch
Inte
ract
ive
ML
Analyze
Streaming
![Page 42: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/42.jpg)
What Stream Processing Technology Should I Use?Spark Streaming Apache Storm Amazon Kinesis
Client LibraryAWS Lambda Amazon EMR (Hive,
Pig)
Scale / Throughput
~ Nodes ~ Nodes ~ Nodes Automatic ~ Nodes
Batch or Real-time
Real-time Real-time Real-time Real-time Batch
Manageability Yes (Amazon EMR) Do it yourself Amazon EC2 + Auto Scaling
AWS managed Yes (Amazon EMR)
Fault Tolerance Single AZ Configurable Multi-AZ Multi-AZ Single AZ
Programming languages
Java, Python, Scala
Any language via Thrift
Java, via MultiLangDaemon ( .Net, Python, Ruby, Node.js)
Node.js, Java, Python
Hive, Pig, Streaming languages
Query Latency Low High
(Low is better)
Low Low
![Page 43: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/43.jpg)
What Data Processing Technology Should I Use?AmazonRedshift
Impala Presto Spark Hive
Query Latency
Low Low Low Low Medium (Tez) – High (MapReduce)
Durability High High High High High
Data Volume 1.6 PB Max
~Nodes ~Nodes ~Nodes ~Nodes
Managed Yes Yes (EMR) Yes (EMR) Yes (EMR) Yes (EMR)
Storage Native HDFS / S3A* HDFS / S3 HDFS / S3 HDFS / S3
SQL Compatibility
High Medium High Low (SparkSQL) Medium (HQL)
Query Latency Low High
(Low is better)
Low Low Medium
![Page 44: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/44.jpg)
What about ETL?
Store Analyze
https://aws.amazon.com/big-data/partner-solutions/
ETL
![Page 45: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/45.jpg)
Consume / Visualize
![Page 46: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/46.jpg)
Collect Store Analyze Consume
A
iOS Android
Web Apps
Logstash
Amazon RDS
Amazon DynamoDB
AmazonES
AmazonS3
ApacheKafka
AmazonGlacier
AmazonKinesis
AmazonDynamoDB
Amazon Redshift
Impala
Pig
Amazon ML
AmazonKinesis
AWSLambda
Amaz
on E
last
ic
Map
Redu
ce
AmazonElastiCache
Sear
ch
SQL
NoS
QL
Ca
che
Stre
am
Proc
essi
ngBa
tch
Inte
ract
ive
Logg
ing
Stre
am
Stor
age
IoT
Appl
icat
ions
File
St
orag
e
Anal
ysis
& V
isua
lizat
ion
Hot
Cold
WarmHot
Slow
Hot
ML
Fast
Fast
Transactional Data
File Data
Stream Data
Not
eboo
ks
Predictions
Apps & APIs
Mobile Apps
IDE
Search Data
ETL
Streaming
Amazon QuickSight
![Page 47: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/47.jpg)
Consume
Predictions
Analysis and Visualization
Notebooks IDE
Applications & API
Consume
Anal
ysis
& V
isua
lizat
ion
Not
eboo
ks
Predictions
Apps & APIs
IDE
Store Analyze ConsumeETL
Business users
Data Scientist, Developers
Amazon QuickSight
![Page 48: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/48.jpg)
Putting It All Together
![Page 49: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/49.jpg)
Collect Store Analyze Consume
A
iOS Android
Web Apps
Logstash
Amazon RDS
Amazon DynamoDB
AmazonES
AmazonS3
ApacheKafka
AmazonGlacier
AmazonKinesis
AmazonDynamoDB
Amazon Redshift
Impala
Pig
Amazon ML
AmazonKinesis
AWSLambda
Amaz
on E
last
ic
Map
Redu
ce
AmazonElastiCache
Sear
ch
SQL
NoS
QL
Ca
che
Stre
am
Proc
essi
ngBa
tch
Inte
ract
ive
Logg
ing
Stre
am
Stor
age
IoT
Appl
icat
ions
File
St
orag
e
Anal
ysis
& V
isua
lizat
ion
Hot
Cold
WarmHot
Slow
Hot
ML
Fast
Fast
Transactional Data
File Data
Stream Data
Not
eboo
ks
Predictions
Apps & APIs
Mobile Apps
IDE
Search Data
ETL
Reference Architecture
Streaming
Amazon QuickSight
![Page 50: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/50.jpg)
Design Patterns
![Page 51: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/51.jpg)
Multi-Stage Decoupled “Data Bus”
Multiple stagesStorage decoupled from processing
Store Process Store ProcessData Answers
processstore
![Page 52: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/52.jpg)
Multiple Processing Applications (or Connectors) Can Read from or Write to Multiple Data Stores
Amazon Kinesis
AWS LambdaData Amazon
DynamoDB
Amazon Kinesis S3Connector
processstore
Amazon S3
![Page 53: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/53.jpg)
Processing Frameworks (KCL, Storm, Hive, Spark, etc.) Could Read from Multiple Data Stores
Amazon Kinesis
AWS Lambda
Amazon S3Data Amazon
DynamoDB
Hive Spark
Answers
Storm
Answers
Amazon Kinesis S3Connector
processstore
![Page 54: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/54.jpg)
Spark Streaming Apache StormAWS Lambda
KCLAmazon Redshift Spark
Impala Presto
Hive
AmazonRedshift
Hive
Spark PrestoImpala
Amazon KinesisApache Kafka
Amazon DynamoDB Amazon S3data
Hot ColdData Temperature
Proc
essi
ng L
aten
cy
Low
High Answers
Amazon EMR (HDFS)
Hive
NativeKCLAWS Lambda
Data Temperature vs. Processing Latency
InteractiveReal-time
Interactive
Batch
Batch
![Page 55: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/55.jpg)
Real-time Analytics
Producer ApacheKafka
KCL
AWS Lambda
SparkStreaming
Apache Storm
Amazon SNS
AmazonML
Notifications
AmazonElastiCache
(Redis)
AmazonDynamoDB
AmazonRDS
AmazonES
Alert
App state
Real-time Prediction
KPI
processstore
DynamoDB Streams
Amazon Kinesis
![Page 56: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/56.jpg)
Interactive & Batch Analytics
Producer Amazon S3
Amazon EMR
Hive
Pig
Spark
AmazonML
processstore
Consume
Amazon Redshift
Amazon EMRPresto
Impala
Spark
Batch
Interactive
Batch Prediction
Real-time Prediction
![Page 57: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/57.jpg)
Batch Layer
AmazonKinesis
data
processstore
Amazon Kinesis S3 Connector
Amazon S3
Applications
Amazon Redshift
Amazon EMR
Presto
Hive
Pig
Spark answer
Speed Layer
answer
Serving Layer
AmazonElastiCache
AmazonDynamoDB
AmazonRDS
AmazonES
answer
AmazonML
KCL
AWS Lambda
Spark Streaming
Storm
Lambda Architecture
![Page 58: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/58.jpg)
Summary
Build decoupled “data bus”• Data → Store ↔ Process → Answers
Use the right tool for the job• Latency, throughput, access patterns
Use Lambda architecture ideas• Immutable (append-only) log, batch/speed/serving layer
Leverage AWS managed services• No/low admin
Be cost conscious • Big data ≠ big cost
![Page 59: February 2016 Webinar Series - Architectural Patterns for Big Data on AWS](https://reader033.vdocuments.us/reader033/viewer/2022042907/5878fc3e1a28ab49608b6c25/html5/thumbnails/59.jpg)
Thank you!
Find Getting Started Guides | Tutorials | Labsaws.amazon.com/big-data