(arc202) real-world real-time analytics | aws re:invent 2014
DESCRIPTION
Working with big volumes of data is a complicated task, but it's even harder if you have to do everything in real time and try to figure it all out yourself. This session will use practical examples to discuss architectural best practices and lessons learned when solving real-time social media analytics, sentiment analysis, and data visualization decision-making problems with AWS. Learn how you can leverage AWS services like Amazon RDS, AWS CloudFormation, Auto Scaling, Amazon S3, Amazon Glacier, and Amazon Elastic MapReduce to perform highly performant, reliable, real-time big data analytics while saving time, effort, and money. Gain insight from two years of real-time analytics successes and failures so you don't have to go down this path on your own.TRANSCRIPT
![Page 1: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/1.jpg)
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
November 13, 2014 | Las Vegas, NV
ARC202
Real-World Real-Time Analytics
Gustavo Arjones | @arjones
CTO, Socialmetrix
Sebastian Montini | @sebamontini
Solutions Architect, Socialmetrix
![Page 2: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/2.jpg)
• SaaS Company—since 2008
• Social media analytics track and measure activity
of brands and personality, providing information to
market research and brand comparison
• Multilanguage technology (English, Portuguese,
and Spanish)
• Leader in Latin America, with operations in 5
countries, customers in Latin America and US
• 1 out of 34 Twitter Certified Program worldwide
![Page 3: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/3.jpg)
Our customers
![Page 4: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/4.jpg)
![Page 5: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/5.jpg)
![Page 6: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/6.jpg)
Ranking Brand 1 Brand 2 Brand 3
Q2 Q3 Q2 Q3 Q2 Q3
1° Flavor Breakfast Flavor Flavor Advertising Flavor
2° Healthy Flavor Packaging Brand I love Flavor Breakfast
3° Components Components Healthy Packaging Healthy Healthy
4° Advertising Healthy Components Addiction Components Advertising
5° Enquires Desire Prices Consumption Prices Components
TOTAL 1.401 8.189 463 5.519 1.081 2.445
Share of topics
Which conversations are my brand and my competitors’ brands driving?
![Page 7: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/7.jpg)
smx.io/reinvent #reinvent
![Page 8: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/8.jpg)
Challenges
![Page 9: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/9.jpg)
Challenges: Variety
• Different data sources
• Different API
• SLA
• Method (pull or push)
• Rate-limit, backoff strategy
![Page 10: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/10.jpg)
Challenges: Velocity• Updates every second
• Top users, top hashtags each
minute
• After event analysis are made
with batch over complete
dataset
• Spikes of 20,000+ tweets per
minute
![Page 11: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/11.jpg)
Last TV
Debate
Results
Announced
Challenges: Velocity
![Page 12: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/12.jpg)
Challenges: Meaning
•Disambiguation
•Data Enrichment– Demographics
– Sentiment
– Influencers
•Human analysis
PAN
Orange Telecom
Oi Telecom Hi!
![Page 13: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/13.jpg)
Challenges: Alert and report
•Clear and
understandable UI
•Slice-dice for business
(not BI experts)
•Real-time alerts for
anomalies
![Page 14: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/14.jpg)
Architecture evolution
![Page 15: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/15.jpg)
Drivers for architecture evolution
• More customers, bigger customers
• Add new features
• Keep costs under control
![Page 16: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/16.jpg)
Architecture evolution
0
20
40
60
80
100
120
#1 #2 #3 #4
Acti
ve C
usto
mers
![Page 17: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/17.jpg)
Architecture—1st iteration
What we needed:
• Complete data isolation
• Trying different solutions/offerings
![Page 18: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/18.jpg)
Architecture—1st iteration
What we did:
• All-in-one approach
• Multi-instance architecture
• Simple vertical scalability
• MySQL performance tuning
![Page 19: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/19.jpg)
Architecture—1st iteration
What we've learned:
• Multi-instance is harder to administrate, but
minimizes instability impact on customers
• Vertical scalability: poor resource management
• MySQL schema changes translate into downtime
![Page 20: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/20.jpg)
Architecture—2nd iteration
What we needed:
• Separation of responsibilities (crawling, processing)
• Horizontal scalability
• Fast provisioning
• Cost reduction
![Page 21: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/21.jpg)
Architecture—2nd iteration
What we changed:
• Migrated to AWS
• RabbitMQ (Single Node)
• Replace MySQL for
Amazon RDS
• AWS CloudFormation
• Auto Scaling groups
![Page 22: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/22.jpg)
Architecture—2nd iteration
What we've learned:
• PIOPS
• Tuning the Auto Scaling policies can be hard
• AWS CloudFormation: great for migration, not
enough for daily ops
![Page 23: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/23.jpg)
Architecture—3rd iteration
What we needed:
• Deliver new features (NRT, more complex analytics)
• Scale fast
• Be resilient against failure
• Adding and improving data sources
• Keep costs under control (always)
![Page 24: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/24.jpg)
Architecture—3rd iteration
What we changed:
• Apache Storm
• RabbitMQ HA
• Amazon Elastic MapReduce
(Hadoop/Hive)
• AWS CloudFormation + Chef
• Amazon Glacier + Amazon S3
lifecycles policies
![Page 25: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/25.jpg)
Architecture—3rd iteration
What we've learned:
• Spot Instances + Reserved Instances
• Hive = SQL SQL scripts are hard to test
• Bulk upserts on Amazon RDS can be expensive (PIOPS)
• Amazon DynamoDB is great, but expensive (for
our use-case)
![Page 26: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/26.jpg)
Dashboard
![Page 27: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/27.jpg)
Architecture—4th iteration
What we needed:
• Monitor millions of social media profiles
• Make data accessible (exploration, PoC)
• Improve UI response times
• Testing our data pipelines
• Reprocessing (faster)
![Page 28: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/28.jpg)
Architecture—4th iteration
What we changed:
• Cassandra (DSE)
• MongoDB MMS
• Apache Spark
![Page 29: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/29.jpg)
What we've learned:
• Leverage AWS ecosystem
• Datastax AMI + Opscenter integration
• MongoDB MMS: automation magic!
• Apache Spark unit testing + Amazon EC2
launch scripts
• Amazon EMR doesn’t have the latest stable
versions
Architecture—4th iteration
![Page 30: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/30.jpg)
![Page 31: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/31.jpg)
Architecture evolution
-
20
40
60
80
100
120
140
160
0
20
40
60
80
100
120
#1 #2 #3 #4
Acti
ve C
usto
mers
Costs Customers
![Page 32: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/32.jpg)
Lessons learned
![Page 33: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/33.jpg)
Lessons learned
• Automate since Day 1 (CloudFormation + Chef)
• Monitor systems activity, understand your data
patterns, e.g. LogStash (ELK)
• Always have a Source of Truth (Amazon S3 +
Glacier)
• Make your Source of Truth searchable
![Page 34: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/34.jpg)
Lessons Learned (II)
•Approximation is a good thing: HLL, CMS, Bloom
•Write your pipelines considering reprocessing
needs
• Avoid at all costs framework explosion
•AWS ecosystem allows rapid prototype
![Page 35: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/35.jpg)
Socialmetrix NextGen
2015
![Page 36: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/36.jpg)
Architecture evolution
0
20
40
60
80
100
120
#1 #2 #3 #4
Acti
ve C
usto
mers
![Page 37: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/37.jpg)
Architecture nextgen
• Reduce moving parts
• Apache Spark as central processing framework
– Realtime (Micro-batch)
– Batch-processing
• Kafka or Amazon Kinesis (Message Broker)
• Cassandra (Time-series storage)
• ElasticSearch (Content Indexer)
![Page 38: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/38.jpg)
To infinity …
and beyond!Architecture evolution
0
20
40
60
80
100
120
#1 #2 #3 #4 NextGen
Acti
ve C
usto
mers
![Page 39: (ARC202) Real-World Real-Time Analytics | AWS re:Invent 2014](https://reader033.vdocuments.us/reader033/viewer/2022052622/5591afe91a28ab39518b466d/html5/thumbnails/39.jpg)
Gustavo Arjones, CTO
@arjones | [email protected]
Sebastian Montini, Solutions Architect
@sebamontini | [email protected]
Feedback and QandA