big data on aws - meetupfiles.meetup.com/11363042/big_data_meetup.pdf · gb tb pb zb eb the world...
TRANSCRIPT
![Page 1: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/1.jpg)
Peter-Mark Verwoerd
Big Data on AWS
Solutions Architect
![Page 2: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/2.jpg)
What to get out of this talk
• Non-technical:
– Big Data processing stages: ingest, store, process, visualize
– Hot vs. Cold data
– Low latency processing vs. high latency processing
• Technical:
– Concepts above
– Big Data reference architectures and design patterns
![Page 3: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/3.jpg)
GB TB PB
ZB
EB
The World is Producing Ever-Larger Volumes of
Big Data
• IT/ Application server logs IT Infrastructure logs, Metering, Audit logs, Change logs
• Web sites / Mobile Apps/ Ads Clickstream, User Engagement
• Sensor data Weather, Smart Grids, Wearables
• Social Media, User Content 450MM+ Tweets/day
![Page 4: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/4.jpg)
Big Data
• Hourly server logs: how your systems were misbehaving an hour ago
• Weekly / Monthly Bill: What you spent this past billing cycle?
• Daily customer-preferences report from
your web-site’s click stream: tells you what deal or ad to try next time
• Daily fraud reports: tells you if there was fraud yesterday
Real-time Big Data
• CloudWatch metrics: what just went
wrong now
• Real-time spending alerts/caps:
guaranteeing you can’t overspend
• Real-time analysis: tells you what to offer
the current customer now
• Real-time detection: blocks fraudulent
use now
Big Data : Best Served Fresh
![Page 5: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/5.jpg)
The Challenge
Data Big Data Real-time Big Data = Plethora of tools
![Page 6: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/6.jpg)
The Zoo
Apache Kafka
Amazon Kinesis
Apache Flume
Storm
Apache Spark
Apache Spark
Streaming
Hadoop/EMR
Redshift S3
DynamoDB
Hive Pig Shark
HDFS
Impala
?
![Page 7: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/7.jpg)
Partners
Flume, Sqoop
HParser
![Page 8: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/8.jpg)
Simplify
Kinesis
Flume
Scribe
Jaspersoft
Kafka Tableau
Ingest Visualize
Data Answers
Storm
SharkSpark
Spark Streaming
Hive/PigHadoop/
EMR
Process
HDFS
DynamoDB
Redshift
S3
Store
![Page 9: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/9.jpg)
Ingest
IngestData
![Page 10: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/10.jpg)
Ingest
• The act of collecting and storing data
Ingest
Ingest
![Page 11: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/11.jpg)
Why Data Ingest Tools?
• Collect random and high velocity data
– Many different sources
– High TPS
• Collecting random and high velocity data is a challenging task
– Hard to durably store data at scale
– Hard to keep highly available
– Hard to scale
![Page 12: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/12.jpg)
Why Data Ingest Tools?
• Data ingest tools convert random streams of data into
fewer set of sequential streams
– Sequential streams are easier to process
– Easier to scale
– Easier to persist
Processing
Kafk
aO
rKin
esis
Processing
![Page 13: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/13.jpg)
Data Ingest Tools
• Facebook Scribe Data collectors
• Amazon Kinesis Data collectors
• Apache Kafka Data collectors
• Apache Flume Data Movement and Transformation
![Page 14: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/14.jpg)
Partners – Data Load and Transformation
Big Data Edition
Flume, Sqoop
HParser
![Page 15: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/15.jpg)
Storage
Ingest StoreData
![Page 16: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/16.jpg)
Storage
Structured – Complex Query
• SQL
– Amazon RDS (MySQL, Oracle, SQL Server, Postgres)
• Data Warehouse
– Amazon Redshift
• Search
– Amazon CloudSearch
Unstructured – Custom Query
• Hadoop/HDFS
– Amazon Elastic MapReduce
(EMR)
Structured – Simple Query
• NoSQL
– Amazon DynamoDB
• Cache
– Amazon ElastiCache (Memcached, Redis)
Unstructured – No Query
• Cloud Storage
– Amazon S3
– Amazon Glacier
![Page 17: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/17.jpg)
Amazon RDS
Amazon Redshift
Amazon S3
Request rate High Low
Cost/GB High Low
Latency Low High
Data Volume Low High
Amazon Glacier
Amazon EMR
Stru
ctu
re
Low
High
Amazon DynamoDB
Amazon ElastiCache
![Page 18: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/18.jpg)
Elasti- Cache
Amazon DynamoDB
Amazon RDS
Cloud Search
Amazon Redshift Amazon EMR (Hive)
Amazon S3 Amazon Glacier
Average latency
ms ms ms,sec ms,sec sec,min sec,min, hrs
ms,sec,min (~ size)
hrs
Data volume GB GB–TBs (no limit)
GB–TB (3 TB Max)
GB–TB TB–PB (1.6 PB max)
GB–PB (~nodes)
GB–PB (no limit)
GB–PB (no limit)
Item size B-KB KB (64 KB max)
KB (~rowsize)
KB (1 MB max)
KB (64 K max)
KB-MB KB-GB (5 TB max)
GB (40 TB max)
Request rate Very High Very High High High Low Low Low– Very High (no limit)
Very Low (no limit)
Cost ($/GB/month)
$$ ¢¢ ¢¢ $ ¢
¢ ¢ ¢
Durability Low - Moderate
Very High High High High High Very High Very High
![Page 19: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/19.jpg)
Process
Ingest Store ProcessData
![Page 20: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/20.jpg)
Process
• Answering questions about data
• Questions
– Analytics: Think SQL/Data warehouse
– Classification: Think Sentiment Analysis
– Predication: Think page-views Prediction
– Etc
![Page 21: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/21.jpg)
Processing Frameworks
• Generally come in two major types
– Batch processing
– Stream processing
![Page 22: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/22.jpg)
Processing Frameworks
• Batch Processing
– Take large amount (>100TB) of cold data and ask questions
– Takes hours to get answers back
Example: Generating Monthly AWS Billing Reports
![Page 23: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/23.jpg)
Processing Frameworks
• Stream Processing (aka. Real-time)
– Take small amount of hot data and ask questions
– Takes short amount of time to get your answer back
Example: Cloudwatch 1min metrics
![Page 24: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/24.jpg)
Processing Frameworks
• Hadoop/EMR Batch Processing
• Spark Batch Processing
• Spark Streaming Stream Processing
• Storm Stream Processing
• Redshift Batch Processing
![Page 25: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/25.jpg)
Impala
Partners – Advanced Analytics
![Page 26: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/26.jpg)
Visualize
Ingest Store ProcessData Visualize
![Page 27: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/27.jpg)
Which country consumes the most oil?
What countries are oil exporters?
Is there a trend of increasing oil consumption
over time?
Order countries by oil consumption/production?
Is there a cluster of oil producers?
What is the oil consumption of USA per day?
What is the average oil consumption per day of
Europe?
Are there any
outliers?
What is the rage of oil production?
What is the distribution of oil producing countries?
Activities of Data Visualization Users
![Page 28: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/28.jpg)
![Page 29: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/29.jpg)
Partners – BI & Data Visualization
![Page 30: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/30.jpg)
Putting it all together (coupled architecture)
• Ingest/Store and processing tightly coupled
• Examples:
– S3 + EMR/Hadoop
– HDFS + EMR/Hadoop
– S3 + Redshift
![Page 31: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/31.jpg)
Putting it all together (coupled architecture)
• Coupled systems provide Less flexibility
– Cold data vs. Hot
– High latency processing vs. Low latency processing
• Example
– EMR+HDFS/S3
• Cold: Can handle processing 100 records/sec
• Hot: processing 1000000 records/sec ??
– Redshift + S3
• High latency: Generate reports once a day
• Low latency: Generate reports every minute
![Page 32: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/32.jpg)
Putting it all together (de-coupled architecture)
• Multi-tier data processing architecture
– Similar to multi-tier web-application architectures
• Ingest & Store de-coupled from Processing
– Concept of “databus”
DatabusData Process Answers
![Page 33: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/33.jpg)
Putting it all together (de-coupled architecture)
• Ingest tools write to multiple data stores within “data-bus”
• Processing frameworks (Hadoop, Spark, etc) consume from “databus”
• Consumers can decide which data store to read from depending on
their data processing requirement
Ingest Store
Data Process AnswersKafka
S3
HDFS
![Page 34: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/34.jpg)
Data temperature & processing latency
![Page 35: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/35.jpg)
Pattern 1: Redshift (cold & high)
![Page 36: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/36.jpg)
Pattern 2: DynamoDB (warm and low)
![Page 37: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/37.jpg)
Pattern 3: Hadoop (cold and high)
![Page 38: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/38.jpg)
Pattern 4: Hadoop (warm and low)
![Page 39: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/39.jpg)
Pattern 5: Spark (cold and low)
![Page 40: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/40.jpg)
Pattern 6: Stream Processing (hot and low)
![Page 41: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/41.jpg)
Putting it All Together
![Page 42: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/42.jpg)
What to get out of this talk
• Non-technical:
– Big Data processing stages: ingest, store, process, visualize
– Hot vs. Cold data
– Low latency processing vs. high latency processing
• Technical:
– Concepts above
– Big Data reference architectures and design patterns
![Page 43: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure](https://reader031.vdocuments.us/reader031/viewer/2022030510/5aba33777f8b9ab1118b9139/html5/thumbnails/43.jpg)
Questions?