(pfc403) maximizing amazon s3 performance | aws re:invent 2014
DESCRIPTION
This session drills deep into the Amazon S3 technical best practices that help you maximize storage performance for your use case. We provide real-world examples and discuss the impact of object naming conventions and parallelism on Amazon S3 performance, and describe the best practices for multipart uploads and byte-range downloads.TRANSCRIPT
in data transfer from S3
not including Amazon Web Services use
Architecture
Choosing a region
Building a naming scheme
Considering LISTs
Optimizing PUTs
Multipart upload
Demo
Optimizing GETs
Using CloudFront
Range-based GETs
Demo
Customer Case
BigData Corp
1 2
58
100/8 = 12.5 events/sec
100,000 users @ 10 events an hour = 224 TPS
<my_bucket>/2013_11_13-164533125.jpg<my_bucket>/2013_11_13-164533126.jpg<my_bucket>/2013_11_13-164533127.jpg<my_bucket>/2013_11_13-164533128.jpg<my_bucket>/2013_11_12-164533129.jpg<my_bucket>/2013_11_12-164533130.jpg<my_bucket>/2013_11_12-164533131.jpg<my_bucket>/2013_11_12-164533132.jpg<my_bucket>/2013_11_11-164533133.jpg<my_bucket>/2013_11_11-164533134.jpg<my_bucket>/2013_11_11-164533135.jpg<my_bucket>/2013_11_11-164533136.jpg
1 2 N1 2 N
Partition Partition Partition Partition
<my_bucket>/521335461-2013_11_13.jpg<my_bucket>/465330151-2013_11_13.jpg<my_bucket>/987331160-2013_11_13.jpg<my_bucket>/465765461-2013_11_13.jpg<my_bucket>/125631151-2013_11_13.jpg<my_bucket>/934563160-2013_11_13.jpg<my_bucket>/532132341-2013_11_13.jpg<my_bucket>/565437681-2013_11_13.jpg<my_bucket>/234567460-2013_11_13.jpg<my_bucket>/456767561-2013_11_13.jpg<my_bucket>/345565651-2013_11_13.jpg<my_bucket>/431345660-2013_11_13.jpg
1 2 N1 2 N
Partition Partition Partition Partition
• Store objects as a hash of their name– add the original name as metadata
• “deadmau5_mix.mp3” 0aa316fb000eae52921aab1b4697424958a53ad9
– prepend key name with short hash
• 0aa3-deadmau5_mix.mp3
• Epoch time (reverse)– 5321354831-deadmau5_mix.mp3
<my_bucket>/images/521335461-2013_11_13.jpg<my_bucket>/images/465330151-2013_11_13.jpg<my_bucket>/movies/293924440-2013_11_13.jpg<my_bucket>/movies/987331160-2013_11_13.jpg<my_bucket>/thumbs-small/838434842-2013_11_13.jpg<my_bucket>/thumbs-small/342532454-2013_11_13.jpg<my_bucket>/thumbs-small/345233453-2013_11_13.jpg<my_bucket>/thumbs-small/345453454-2013_11_13.jpg
faster flexible
set of parts
presents all parts as
a single object
parallel pausing resuming
beginning uploads before
you know the total object size
DEMOMultipart Uploads
DEMOAmazon CloudFront vs. Amazon S3 download performance
• Align your ranges with your parts!
DEMORange based GETs
DynamoDB Amazon RDS Amazon
CloudSearchAmazon EC2
Maestro
(Reserved Instance)
List of crawl
URLs Main workers
Execute crawling
and process data
Spot Instances
Secondary workers
(queue listeners)
Reprocess data,
query additional
services, store
data on MongoDB
Spot Instances
Secondary
work queues –
processed data
MongoDB
cluster
Command and
Control Queue
Architecture
Choosing a region
Building a naming scheme
Considering LISTs
Optimizing PUTs
Multipart upload
Demo
Optimizing GETs
Using CloudFront
Range-based GETs
Demo
Customer Case
BigData Corp
Please give us your feedback on this
presentation