scaling

SCALING

Òscar Vilaplana @grimborg http://oscarvilaplana.cat

http://oscarvilaplana.cat

WHAT’S THIS ABOUT?

People

Technology

Tools

PEOPLECare

Focus

Automate & Test.

Shared brain

Finish & DRY.

TECHDesign to clone

Separate pieces

API

Offload everything

Measure

VIRTUAL QUEUE

Queue Instance

Queue Instance

Queue Instance

Queue Instance

VIRTUAL QUEUE

Queue Instance

Queue Instance

Queue Instance

VIRTUAL QUEUE

Queue Instance

Queue Instance

Queue Instance

Queue Instance

TECH• Design to clone

• Separate pieces

• API

• Offload everything

• Measure

TYPES OF TASKS

• Realtime

• ASAP

• When you have time } Async!

INSTAGRAM’S FEED

• Redis queue per follower.

• New media: push to queues

• Small chained tasks

INSTAGRAM’S FEED

harro wouter orestis siebejan oscar

Schedulenext

batch

SMALL TASKS• 10k followers per task

• < 2s

• Finer-grained load balancing

• Lower penalty of failure/reload

CELERY: REDIS• Good: Fast

• Bad:

• Polling for task distribution

• Messy non-synchronous replication

• Memory limits task capacity

CELERY: BEANSTALK• Good:

• Fast

• Push to consumers

• Writes to disk

• Bad:

• No replication

• Only useful for Celery

CELERY: RABBITMQ• Fast

• Writes to disk

• Low-maintenance synchronous replication

• Excellent Celery compatibility

• Supports other use cases

RESERVATIONS• UI

• Room locking

• Room availability

• Registration manager

• Email, PDF invoice

• Payment

• Login

• …

WE DON’T DO THISdef do_everything(request): hotel_id = request.GET.hotel_id room_number = request.GET.room_number with room_mutex(hotel_id, room_number): room = (session.query(Room) .filter(Room.hotel_id == hotel_id) .filter(Room.room_number == room_number).one()) if not room.available: return Response("Room not available”, template=room_template) reservation = Reservation(client=request.client, room=room) session.add(reservation) room.available = False price = # price_calculation payment = Payment(reservation=reservation, price=price) session.add(payment) session.commit() url = payment.get_psp_url() return Redirect(url)

BUT WE DO THIS• Frontend UI

• Locking rooms

• Calculating room availability

• Temporarily locking rooms

• Payment processing

• Mail

• PDF invoice generation

BUT WE CAN SCALE!

SCALE DB: HARD• Slaves

• Master-Master?

• Sharding?

SCALING

MINOR SCALE

MAJOR SCALE

FRONTEND

Everything Frontend

Externalpaymentproviders

User

Everything Frontend

Master

Read slaves

SPLIT

• Responsibility

• Stateful/stateless

• Type of system

TYPES OF SYSTEMS

• Unique (mutex, datastore)

• Multiple

TYPES OF TASKS

• Realtime

• ASAP

• When you have time } Async!

SPLIT THIS

Everything Frontend


User

Everything Frontend

Master

Read slaves

AUTONOMOUS SYSTEMS

Payment


Locking

InvoicePDF

Mailer

UI Reservations ManagerUser

SessionStorage

DatawarehouseReporting

Configuration

Payout

CLONABILITY

CLONABILITY

Frontend

CLONABILITY

Everything Frontend


User

Everything Frontend

Master

Read slaves

WHAT’S IN AN EASY STEPAs little change as possible.

Reuse.

Unintrusive.

Measure.

Go on the right direction.

SMALL STEPS

PROBLEMS? !

Oversells Configuration Reporting Payout

Everything FrontendEverything Frontend



SMALL STEPSPROBLEMS? !

Oversells Configuration Reporting Payout SessionsRoom

Availability

Lock

ReadEverything FrontendEverything Frontend



ISOLATED SYSTEM Best technology

Decoupled

API

Testable


Oversells Configuration Reporting Payout Sessions

Everything FrontendConfig Backend

Settings




INITIAL SYSTEM

Everything Frontend

INITIAL SYSTEM (MODIFIED)

Everything Frontend Sales

Sync

INITIAL SYSTEM (MODIFIED)

Sales Backend


Oversells Configuration Reporting Payout Sessions

Everything FrontendSales Backend

Sales

Main DB





Oversells Configuration Reporting Payout SessionsSession

Storage Everything FrontendEverything Frontend



WHEN?• Difficult.

• Measure everything.

• Find patterns.

• Define thresholds.

• Design: address as risk.

• Don’t overenigneer — Don’t ignore.

EVENTBRITE

• 2012: $600M ticket sales

• Accumulated: $1B

TECHNOLOGY• Monitoring: nagios, ganglia, pingdom

• Email: offloaded to StrongMail

• Load-balanced read slave pool

• Feature flags

• Automated server configuration and release with Puppet and Jenkins

TECHNOLOGY• Feature flags

• Develop on Vagrant

• Celery + RabbitMQ

• Virtual customer queue

• Big data for reporting, fraud, spam, event recommendations

TECHNOLOGY

• Hadoop

• Cassandra

• HBase

• Hive

• Separated into independent services

TIPS

• Instrument and monitor everything

• Lean

HOW BIG?

• 2Gb/day database transactions

• 3.5Tb/day social data analyzed

• 15Gb/day logs

ORDER PROCESSOR

• Pub/sub queue with Cassandra and Zookeeper

PUBLISHING

Publisher

Get queue lock+last batch id

Create new batch“process orders 10, 11, 12”

Store batch id, release lock

SUBSCRIBING

Subscriber

Get my latest processed batch id

Store result

Update my latest processed batch id

SCALING STORAGE• Move to NoSQL

• Aggressively move queries to slaves

• Different indexes per slave

• Better hardware

• Most optimal tables for large and highly-utilized datasets

EMAIL ADDRESSES

• Users have many email addresses.

• Lookup by email, join to users table

FIRST ATTEMPTCREATE TABLE ùser_emails` (

ìd` int NOT NULL AUTO_INCREMENT,

èmail_address` varchar(255) NOT NULL,

... --other columns about the user

ùser_id` int, --foreign key to users

KEY (èmail_address`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8;

FIRST ATTEMPT

LOOKUP

CAN IT BE IMPROVED?

INDEX VS PK• InnoDB: B+trees, O(log n)

• Known user id: index on email not needed.

• Small win on lookup: O(1)

• Big win on not storing the index.

INNODB INDEXES

HASH TABLE

DISQUS• >165K messages per second

• <10ms latency

• 1.3B unique visitors

• 10B page views

• 500M users in discussions

• 3M communitios

• 25M comments

ORIGINAL REALTIME BACKEND

• Python + gevent

• NginxPushStream

• Network IO: great

• CPU: choking at peaks

• <15ms latency

CURRENT REALTIME BACKEND

• Go

• Handles all users

• Normal load:3200 connections/machine/sec

• <10ms latency

• Only 10%-20% CPU

Workers

CURRENT REALTIME BACKEND

Subscribed to results

Push result to userNginxPushStream

TESTING

• Test with real traffic

• Measure everything

LESSONS• Do work once, distribute results.

• Most likely to fail: your code. Don’t reinvent. Keep team small.

• End-to-end ACKs are expensive. Avoid.

• Understand use cases when load testing.

• Tune architecture to scale.

LEARN MORE• Instagram

• Braintree

• highscalability.com

• VelocityConf (youtube, nov 2014 @ bcn?)

QUESTIONS? ANSWERS?

THANKS!

Òscar Vilaplana @grimborg http://oscarvilaplana.cat

http://oscarvilaplana.cat