the challenges of live events scalability

THE CHALLENGES OF

SUPPORTING ONLINE LIVE

EVENTS WITH TV

PARTICIPATION NUMBERS

Presentation for B.Sc students from IDC

By Guy Tomer, November 2011

A STARTUP PERSPECTIVE

Hello

• I’m Guy Tomer

• Founding and working in start-ups for the last

13 years

• Founder & CTO of attracTV for the last 4 years

• This Presentation is about

• Building a scalable system for “a lot” of users

• More specifically for handling usage peaks of live TV

events on the internet

• Even more specifically – how we tackle it as a small start-up

attracTV

Web based self-service solution and tools for

managing viewers’ engagement and

interaction on the online screen

Social Information Advertisement eCommerce

Our Use Case – MTV European Music

Awards• One of the biggest online live streams ever

• Can’t expose precise numbers but• 7 digits ( > 1,000,000) – number of streams

• 6 digits (> 100,000) –number of concurrent users

• 5 digits (> 10,000) –number of users joining every minute at peak

• In addition• International event, 20 sites, viewers from >150 countries

• 9 languages

What Are The Challenges

1. Scaling for these numbers

2. Handling very steep ramp-up

3. Big data

4. High availability

5. Testing & preparing for such numbers

6. The cost of the above – how to do it and still make

money

We’ll Discuss mainly 1,5 & 6

Some Big “Internet Scale” Examples

• Google Uses About 900,000 Servers

• (Map-Reduce) Google completed sorting a ten petabyte

input set took 6 hours and 27 minutes to complete on

8000 computers.

• Facebook serves 1 trillion pages per months

• (2010) 30 billion – Pieces of content (links, notes, photos,

etc.) shared on Facebook per month.

• (2010) 2 billion – The number of videos watched per day

on YouTube.

• Akamai, the “CDN to the starts” has 95811 (Q2 2011)

servers, 1000 networks, 70 countries

Challenge 1 – Handling The Scale

• We are prepared for 400,000 concurrent viewers

• HTTP polling every 10<=N<=30 seconds

• This means ~20,000 HTTP R/S (requests per

second)

• For comparison

• Stack overflow recently reported 800 R/S

• Sify.com (leading portal in India)

reported 3900 R/S

• Jobs' death resulted in a record

breaking 10,000 tweets/s(they do have a lot more requests,

that’s just to feel the scale)

What Is Scalability

• From Wikipedia “Scalability is the ability of a system, network, or process, to handle growing amounts of work in a graceful manner or its ability to be enlarged to accommodate that growth.”

Performance ≠ ScalabilityThe fact that your code runs very fast for X users doesn’t mean your architecture supports 100*X users.

Vertical Scalability (scale up)

• “Get a bigger server”

• “Use faster CPUs”

• Cons

• Can only help so much (with bad scale/$ value).

• A server twice as fast is more than

twice as expensive

• Pros

• Easier to manager less computers

• Can use virtualization

Horizontal Scalability (scale out)

• “Just add another box” (or another thousand or

...)

• Plan the architecture right first, do micro

optimizations later

• Pros

• Unlimited theoretically

• Works well with the cloud services elasticity

• Cons

• More complex to manage

• More complex programming models

Challenge #2 – Steep Ramp-up

• Live Event - Everyone comes at the same time

• A car can drive 250k/h doesn’t mean it can do 0-

100km/h in 4 seconds

≠

Standard website example (wikimedia)Steep ramp-up

Challenge #3 – Big Data

• From Wikipedia:

“Big data are datasets that grow so large

that they become awkward to work with

using on-hand database management tools”• One of the biggest hypes in the industry today

• During this even we had ~10,000,000 records written to

our analytics system per hour

• We’re not “Big Data” yet but

it’s coming

Challenge #4 – High Availability

“High availability refers to a system or

component that is continuously operational

for a desirably long length of time.”

• We need to meet a Service

Level of 99.9%

• Backup, failover systems

are expensive

• The cloud is at our help

High availability in the cloud

Challenge #5 – Testing

• Simulating 100s of thousands of concurrent

users… not trivial

• Requires 10s of strong servers

• Very difficult to collect the data

• The cloud is at our help

Challenge #6 – Handling The Costs

Of Such Event (Hint- Elasticity)

• For production we used ~50 servers that have 4 cores

with 2GH and 15GB RAM (m1.xl)

• Some options (rough estimation) for this are:

• Buy - ~$3500 per box = $175,000. Not for us…

• Dedicated server for a month - ~$1000 per instance = $50,000

• VPS (Virtual private server) monthly - ~300$ per box = $15,000

• Solution: Cloud on-demand (Amazon AWS) - ~$500 per

instance = $25,000 for a month…. BUT … no need to take it for a month,

we activate it on demand for 12 hours

and it costs $416!

Our #1 Lesson - Think Horizontal!

• Why not vertical?• We don’t want it to be our business’s bottleneck at any

point in time

• We don’t want to buy giant servers

• We wanted a cheap start

• We want elasticity

• We don’t want to buy anything at this point

• How? (deserves a separate lecture)• Everything in the architecture

• No state shared between the web/appservers

• No relation between the # of users and the load on the Database

Lesson #2 KISS

• Keep It Simple Stupid• Your system architecture

• Your code

• Your features

• Your business model

• If you don’t

you won’t scale,

from personal

experience

Hug out all the complexity in your system

Lesson #3 – Load Test Everything, Focus

On Real World Usage Patterns• We did massive stress testing

• We launched tens of servers just for stress testing

• Automated with Jmeter and monitored the same way as

production

Why?

• The only way to test your scaling capabilities

• Looking at the code and manual tests are irrelevant

• Measure the capacity of a single app server

• Test the specific ramp-up scenario because

• Example 1 app server = 5000 users, we need

to support 200,000 users so we need to

prepare at least 40 servers

Lesson #4 – S*t Happens, Don’t Save On

Real-Time Monitoring and Support• We had a series of successful big events before this one

• We launched tens of servers just for the stress testing

• And yet we had two problems during the event

Why?

• Murphy is always (eventually) right…

• Because of a feature no one uses (see lesson #2 - KISS)

that wasn’t active in the stress tests

• The specific usage of 9 languages caused unexpected load

(see lesson #3 – stress real world scenarios)

Luckily the whole team was in

monitoring mode and the issues

were quickly handled on the fly.

Lesson #5 – Use The Cloud (startups)

• It’s Elastic, pay on demand

• Flexible when you don’t know your parameters

• Solution for affordable High Availability & Testing

• Focus on development

• I am not getting paid by Amazon – check

others as well!

Summary - What To Remember?

• Scalability is the ability of a system to handle growing

amount of work with additional resources

• Think horizontal

• Keep It Simple (Stupid) – everything

• Stress test everything, focus on real world scenarios

• Monitor and Real-Time support

• Cloud is great for start-ups

The End

• Questions? Comments? Consulting Preguntas?

问题 ?

• Just Shy? Think you should be working in attracTV?

Contact me:

[email protected]

www.guytomer.com

mailto:[email protected]

Special Thanks (presentations, websites I “borrowed” from)

• Ask Bjørn Hansen

(http://groups.google.com/group/scalable)

• High Scalability blog http://highscalability.com/

• http://royal.pingdom.com

• Google images

• Entourage (http://www.hbo.com/entourage/index.html)

http://groups.google.com/group/scalable

http://groups.google.com/group/scalable

http://highscalability.com/

http://highscalability.com/

http://royal.pingdom.com/

http://royal.pingdom.com/

http://www.hbo.com/entourage/index.html

http://www.hbo.com/entourage/index.html

the challenges of live events scalability

Technology