tales of the black knight - keeping everythingme running

16

Upload: dvir-volk

Post on 02-Jul-2015

617 views

Category:

Software


0 download

DESCRIPTION

Tips, tricks and strategies we use at EverythingMe to scale and keep our servers always running, no matter what

TRANSCRIPT

Page 1: Tales Of The Black Knight - Keeping EverythingMe running
Page 2: Tales Of The Black Knight - Keeping EverythingMe running

• “One Tap Happiness”• Smart and Contextual Launcher / Phone UI• App Organization • In-Phone Search• App Search• App Recommendations (+ sponsored)• Contextual Content Discovery (cards)

What This means:• Lots of algorithms and data from the servers• Millions of downloads, Hundreds of sustained R/S • 1B use events collected per month• Fucking up means fucking up the users’ phones

Page 3: Tales Of The Black Knight - Keeping EverythingMe running

• 100% Cloud-based (EC2)• 100% Automated Infrastructure• Third-party software is FOSS only• Continuous Deployment• Loads of metrics and logging• Databases: Redis, MySQL, Cassandra• Languages: Python, Go, C++ (Java on Android)• Important Tools: Tornado, Thrift, Scribe,

Statsd, Kibana, Celery, ZooKeeper, Chef, Docker

Page 4: Tales Of The Black Knight - Keeping EverythingMe running

• Servers may be terminated at any moment• Disks may fail at any moment• The LAN may fail or hiccup• Services you query may crash or be restarted• Your code may crash at any moment• The Client is an idiot that might send garbage• The Server is an idiot that might return garbage• We Are All Idiots• We should accept and embrace all that

Page 5: Tales Of The Black Knight - Keeping EverythingMe running

• Separation into many small services• No SPOFs• Little reliance on disk• Aspire for statelessness• Dynamic Endpoint Management• In App Failover and LB• Aggressive Timeouts• Multi tiered alerting• Sane Fallback Values• Graceful Degradation

Page 6: Tales Of The Black Knight - Keeping EverythingMe running

C*

API

Search

AdsImages

Geo

RedisRedis

RedisRedis

Context

● Thin API Layer○ Input validation○ Connection funnelling

● Many smaller services● Many redis instances

○ “database” = instance● Thrift for internal APIs● Deploys are less scary● Scaling is easier● Well defined contracts

Auto-Complete

Page 7: Tales Of The Black Knight - Keeping EverythingMe running

if machine_is_down:# All is wellreturn KeepFighting

elif fucked_services_count == 1:log.info(“Tis But a Scratch”)return KeepFighting

elif fucked_services_count == 2:log.info(“Just a Flesh Wound!”)return KeepFighting

elif num_running_services >= 2:log.info(“I’ll Bite Your Legs Off!”)return KeepFighting

else:log.info(“All Right, We’ll Call It A Draw”)return SwitchDataCenter_PLZ_KTXBAI

Page 8: Tales Of The Black Knight - Keeping EverythingMe running

• No Database Master ⇒ remain read only• No Queue ⇒ Write to log for future processing• No Service X ⇒ Return a default response, not

an exception• No MySQL ⇒ Everything is ready for serving in

Redis anyway• No internal service - fall back to external service• Etc...

Page 9: Tales Of The Black Knight - Keeping EverythingMe running

• Multi Edge / single Central DCs• Geo-DNS based• Edges are read only• Central is write only

• API / Logs• All edges are data-symmetrical• Any Edge may be taken out• Central May be taken out without

service disruption

Page 10: Tales Of The Black Knight - Keeping EverythingMe running

• Zookeeper manages an endpoint tree• Watchdog registers services - no self announce• Changes to endpoints are published to all services• Automatic switching and adding of endpoints• Facilitates no-downtime deploys with downtime :)• A dead machine is deleted from ZK automatically• A static snapshot of endpoints kept on all

machines

Page 11: Tales Of The Black Knight - Keeping EverythingMe running

• Internal “learning” Load Balancing Connection Pool• Protocol Agnostic (kinda...)• Python magic - no code changes• Silent failovers• Automatic fast banning / exploration• Why in-app?

• Application Aware• Less latency• EP management support

Page 12: Tales Of The Black Knight - Keeping EverythingMe running

• Proper timeouts are a key factor for a distributed system• They should be as low as possible while avoiding FP• Internal service timeouts should be < 50ms• Client timeouts can be rather big to support retries• Without proper timeouts any link in the chain can bring you

down• Log them but try to recover• Don’t forget they add up!• Bad Timeout == Point Of Failure

Page 13: Tales Of The Black Knight - Keeping EverythingMe running

• The obvious dark side of all this• Survival Strategies

• Separate non time-critical API calls• Client Side Backoffs• Selective Failover Retries• Capacity barriers in internal services

• Tune well• Fail fast!

• Return sane defaults, not errors

Page 14: Tales Of The Black Knight - Keeping EverythingMe running

• We are constantly improving our infrastructure• Our DevOps team are doing that, not maintaining servers• We accept that all solutions are temporary• Start-up infrastructure is always a compromise• We embrace Post-Mortems as an opportunity to improve

Page 15: Tales Of The Black Knight - Keeping EverythingMe running

Want to improve our infra? We’re hiring ;)

Ping me: Twitter: @dvirsky

[email protected]

Page 16: Tales Of The Black Knight - Keeping EverythingMe running

• MySQL/Redis data abstraction library• Objects can be saved or loaded from either• MySQL is write only, Redis read only• MySQL can be down without disruption• Only MySQL is replicated between DCs • Automatic migrations to redis• Spartacus - a pseudo MySQL slave notifies

on changes

CentralMySQL

Redis

EdgeMySQL

Spartacus

Redis