the challenges of live events scalability
TRANSCRIPT
THE CHALLENGES OF
SUPPORTING ONLINE LIVE
EVENTS WITH TV
PARTICIPATION NUMBERS
Presentation for B.Sc students from IDC
By Guy Tomer, November 2011
A STARTUP PERSPECTIVE
Hello
• I’m Guy Tomer
• Founding and working in start-ups for the last
13 years
• Founder & CTO of attracTV for the last 4 years
• This Presentation is about
• Building a scalable system for “a lot” of users
• More specifically for handling usage peaks of live TV
events on the internet
• Even more specifically – how we tackle it as a small start-up
attracTV
Web based self-service solution and tools for
managing viewers’ engagement and
interaction on the online screen
Social Information Advertisement eCommerce
Our Use Case – MTV European Music
Awards• One of the biggest online live streams ever
• Can’t expose precise numbers but• 7 digits ( > 1,000,000) – number of streams
• 6 digits (> 100,000) –number of concurrent users
• 5 digits (> 10,000) –number of users joining every minute at peak
• In addition• International event, 20 sites, viewers from >150 countries
• 9 languages
What Are The Challenges
1. Scaling for these numbers
2. Handling very steep ramp-up
3. Big data
4. High availability
5. Testing & preparing for such numbers
6. The cost of the above – how to do it and still make
money
We’ll Discuss mainly 1,5 & 6
Some Big “Internet Scale” Examples
• Google Uses About 900,000 Servers
• (Map-Reduce) Google completed sorting a ten petabyte
input set took 6 hours and 27 minutes to complete on
8000 computers.
• Facebook serves 1 trillion pages per months
• (2010) 30 billion – Pieces of content (links, notes, photos,
etc.) shared on Facebook per month.
• (2010) 2 billion – The number of videos watched per day
on YouTube.
• Akamai, the “CDN to the starts” has 95811 (Q2 2011)
servers, 1000 networks, 70 countries
Challenge 1 – Handling The Scale
• We are prepared for 400,000 concurrent viewers
• HTTP polling every 10<=N<=30 seconds
• This means ~20,000 HTTP R/S (requests per
second)
• For comparison
• Stack overflow recently reported 800 R/S
• Sify.com (leading portal in India)
reported 3900 R/S
• Jobs' death resulted in a record
breaking 10,000 tweets/s(they do have a lot more requests,
that’s just to feel the scale)
What Is Scalability
• From Wikipedia “Scalability is the ability of a system, network, or process, to handle growing amounts of work in a graceful manner or its ability to be enlarged to accommodate that growth.”
Performance ≠ ScalabilityThe fact that your code runs very fast for X users doesn’t mean your architecture supports 100*X users.
Vertical Scalability (scale up)
• “Get a bigger server”
• “Use faster CPUs”
• Cons
• Can only help so much (with bad scale/$ value).
• A server twice as fast is more than
twice as expensive
• Pros
• Easier to manager less computers
• Can use virtualization
Horizontal Scalability (scale out)
• “Just add another box” (or another thousand or
...)
• Plan the architecture right first, do micro
optimizations later
• Pros
• Unlimited theoretically
• Works well with the cloud services elasticity
• Cons
• More complex to manage
• More complex programming models
Challenge #2 – Steep Ramp-up
• Live Event - Everyone comes at the same time
• A car can drive 250k/h doesn’t mean it can do 0-
100km/h in 4 seconds
≠
Standard website example (wikimedia)Steep ramp-up
Challenge #3 – Big Data
• From Wikipedia:
“Big data are datasets that grow so large
that they become awkward to work with
using on-hand database management tools”• One of the biggest hypes in the industry today
• During this even we had ~10,000,000 records written to
our analytics system per hour
• We’re not “Big Data” yet but
it’s coming
Challenge #4 – High Availability
“High availability refers to a system or
component that is continuously operational
for a desirably long length of time.”
• We need to meet a Service
Level of 99.9%
• Backup, failover systems
are expensive
• The cloud is at our help
High availability in the cloud
Challenge #5 – Testing
• Simulating 100s of thousands of concurrent
users… not trivial
• Requires 10s of strong servers
• Very difficult to collect the data
• The cloud is at our help
Challenge #6 – Handling The Costs
Of Such Event (Hint- Elasticity)
• For production we used ~50 servers that have 4 cores
with 2GH and 15GB RAM (m1.xl)
• Some options (rough estimation) for this are:
• Buy - ~$3500 per box = $175,000. Not for us…
• Dedicated server for a month - ~$1000 per instance = $50,000
• VPS (Virtual private server) monthly - ~300$ per box = $15,000
• Solution: Cloud on-demand (Amazon AWS) - ~$500 per
instance = $25,000 for a month…. BUT … no need to take it for a month,
we activate it on demand for 12 hours
and it costs $416!
Our #1 Lesson - Think Horizontal!
• Why not vertical?• We don’t want it to be our business’s bottleneck at any
point in time
• We don’t want to buy giant servers
• We wanted a cheap start
• We want elasticity
• We don’t want to buy anything at this point
• How? (deserves a separate lecture)• Everything in the architecture
• No state shared between the web/appservers
• No relation between the # of users and the load on the Database
Lesson #2 KISS
• Keep It Simple Stupid• Your system architecture
• Your code
• Your features
• Your business model
• If you don’t
you won’t scale,
from personal
experience
Hug out all the complexity in your system
Lesson #3 – Load Test Everything, Focus
On Real World Usage Patterns• We did massive stress testing
• We launched tens of servers just for stress testing
• Automated with Jmeter and monitored the same way as
production
Why?
• The only way to test your scaling capabilities
• Looking at the code and manual tests are irrelevant
• Measure the capacity of a single app server
• Test the specific ramp-up scenario because
• Example 1 app server = 5000 users, we need
to support 200,000 users so we need to
prepare at least 40 servers
Lesson #4 – S*t Happens, Don’t Save On
Real-Time Monitoring and Support• We had a series of successful big events before this one
• We launched tens of servers just for the stress testing
• And yet we had two problems during the event
Why?
• Murphy is always (eventually) right…
• Because of a feature no one uses (see lesson #2 - KISS)
that wasn’t active in the stress tests
• The specific usage of 9 languages caused unexpected load
(see lesson #3 – stress real world scenarios)
Luckily the whole team was in
monitoring mode and the issues
were quickly handled on the fly.
Lesson #5 – Use The Cloud (startups)
• It’s Elastic, pay on demand
• Flexible when you don’t know your parameters
• Solution for affordable High Availability & Testing
• Focus on development
• I am not getting paid by Amazon – check
others as well!
Summary - What To Remember?
• Scalability is the ability of a system to handle growing
amount of work with additional resources
• Think horizontal
• Keep It Simple (Stupid) – everything
• Stress test everything, focus on real world scenarios
• Monitor and Real-Time support
• Cloud is great for start-ups
The End
• Questions? Comments? Consulting Preguntas?
问题 ?
• Just Shy? Think you should be working in attracTV?
Contact me:
www.guytomer.com
Special Thanks (presentations, websites I “borrowed” from)
• Ask Bjørn Hansen
(http://groups.google.com/group/scalable)
• High Scalability blog http://highscalability.com/
• http://royal.pingdom.com
• Google images
• Entourage (http://www.hbo.com/entourage/index.html)