large scale load testing amazon.com’s traffic on aws (cpn102) | aws re:invent 2013
DESCRIPTION
It’s 4am and you don’t know it, but you're about to get three times the traffic you were expecting. Is your service ready to handle it? Systems are only as scalable as their weakest component. Large scale load testing in production is the best (and surest) way to ensure that services can truly scale to the unexpected. But the load generator itself can be difficult to scale, expensive to run on hundreds or thousands of hosts, challenging to keep the data secure, and time consuming to develop. The Amazon.com retail site is one of most heavily used sites in the world, and has to be ready for anything, at anytime. How do you design a load test for this in record time while keeping it cost effective? Well, you use AWS! Come learn Best Practices on how you can use Amazon SQS, Amazon S3, Amazon EC2, Amazon CloudWatch, Auto Scaling, and Amazon DynamoDB to design horizontally scalable large-scale load tests that can simulate the load that millions of users are putting onto your site. We met a tight schedule and did it under budget thanks to AWS and you can too!TRANSCRIPT
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Large-Scale Load Testing
Amazon.com’s Traffic on AWS
Carlos Arguelles, Amazon.com
November 15, 2013
What I’d like you to get out of this
Load and performance issues cost
What I’d like you to get out of this
What I’d like you to get out of this
How you can leverage AWS for load and stress tests
About me
Amazon.com retail site
Amazon.com receives a LOT of traffic
Amazon.com retail site
Significant fluctuation throughout the day
(not to scale)
Amazon.com retail site
Significant fluctuation throughout the year
(not to scale)
(not to scale)
Amazon.com retail site
Significant growth year to year
(not to scale)
(not to scale)
Some load-related issues
can
regular day (off-peak)
1st test (cancelled)
2nd test (successful)
50%
100%
85%
CPU Usage on our fleet
Some load-related issues
can only
Ingestion
Fleet
Database
Output
Fleet Amazon S3
Hadoop
Database Amazon
DynamoDB
Some load-related issues
cannot
5%
Disk
Usage
20%
5 hours
Start load…
What do you really want to do?
Resilience
Testing
Load
Testing
Stress
Testing
Performance
Testing
Load Testing
Stress Testing
Resilience Testing
Performance Testing
How does AWS help us?
Generating load
Replays from real-world traffic Artificial rate, blend of operations
Most useful AWS design pattern, ever
Distributing load, the hard way
Slave
Slave
Slave
Slave
Master
12,000 TPS
3000 TPS
3000 TPS
3000 TPS
3000 TPS
4000 TPS
4000 TPS
4000 TPS
0 TPS
Distributing load, the easy way
Controller
Job
Job
Job
Job
Job
Job
Job
Worker Worker
Controller Worker Worker Worker Worker Worker
Metrics &
Dashboards
Replaying traffic to generate load
Test Data
Repository
Controller
Job
Job
Job
Job
Job
Job
Job Controller
Worker Worker Worker Worker Worker Worker Worker
Service
under test
Metrics &
Dashboards Test Data
Repository
Controller
Job
Job
Job
Job
Job
Job
Job Controller
Worker Worker Worker Worker Worker Worker Worker
Amazon S3 for storing data
Amazon DynamoDB for
indexing
Amazon SQS
for state,
resilience
Amazon EC2 & Auto Scaling
for hardware
Amazon
CloudWatch
Reactive
auto scaling
based on
queue size
Generating load
Replays from real-world traffic Artificial rate, blend of operations
Artificial traffic to generate load
• Why?
– You do not have
real-world data
– You expect a
change in traffic
• How?
– Control rate
– Control blend
– Control duration
Artificial traffic to generate load
50,000 TPS
for 20 minutes 99% Read, 1% Writes
95,000 TPS
for 3 hours 80% Read, 20% Writes
85,000 TPS
for 45 minutes 90% Read, 10% Writes
Minute#1: 50,000 TPS, 99% 1%
Minute#20: 50,000 TPS, 99% 1%
…
Minute#1
10 TPS
for 1 minute,
99% R 1% W
10 TPS
for 1 minute,
99% R 1% W
10 TPS
for 1 minute,
99% R 1% W
…
1
2
5000
Artificial traffic to generate load
Controller
Job
Job
Job
Job
Job
Job
Job
Worker Worker
Controller Worker Worker Worker Worker Worker
Amazon EC2 Spot Instances
• A great way to inexpensively test – Up to 90% off regular price (name your price)
– Interruption-tolerant, time-flexible tasks
• Approaches – Combine with on-demand instances (burst)
– Try Spot Instances first, then fallback to on-demand
Takeaways
Please give us your feedback on this
presentation
As a thank you, we will select prize
winners daily for completed surveys!
CPN102