tales of the black knight - keeping everythingme running

Post on 02-Jul-2015

617 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Tips, tricks and strategies we use at EverythingMe to scale and keep our servers always running, no matter what

TRANSCRIPT

• “One Tap Happiness”• Smart and Contextual Launcher / Phone UI• App Organization • In-Phone Search• App Search• App Recommendations (+ sponsored)• Contextual Content Discovery (cards)

What This means:• Lots of algorithms and data from the servers• Millions of downloads, Hundreds of sustained R/S • 1B use events collected per month• Fucking up means fucking up the users’ phones

• 100% Cloud-based (EC2)• 100% Automated Infrastructure• Third-party software is FOSS only• Continuous Deployment• Loads of metrics and logging• Databases: Redis, MySQL, Cassandra• Languages: Python, Go, C++ (Java on Android)• Important Tools: Tornado, Thrift, Scribe,

Statsd, Kibana, Celery, ZooKeeper, Chef, Docker

• Servers may be terminated at any moment• Disks may fail at any moment• The LAN may fail or hiccup• Services you query may crash or be restarted• Your code may crash at any moment• The Client is an idiot that might send garbage• The Server is an idiot that might return garbage• We Are All Idiots• We should accept and embrace all that

• Separation into many small services• No SPOFs• Little reliance on disk• Aspire for statelessness• Dynamic Endpoint Management• In App Failover and LB• Aggressive Timeouts• Multi tiered alerting• Sane Fallback Values• Graceful Degradation

C*

API

Search

AdsImages

Geo

RedisRedis

RedisRedis

Context

● Thin API Layer○ Input validation○ Connection funnelling

● Many smaller services● Many redis instances

○ “database” = instance● Thrift for internal APIs● Deploys are less scary● Scaling is easier● Well defined contracts

Auto-Complete

if machine_is_down:# All is wellreturn KeepFighting

elif fucked_services_count == 1:log.info(“Tis But a Scratch”)return KeepFighting

elif fucked_services_count == 2:log.info(“Just a Flesh Wound!”)return KeepFighting

elif num_running_services >= 2:log.info(“I’ll Bite Your Legs Off!”)return KeepFighting

else:log.info(“All Right, We’ll Call It A Draw”)return SwitchDataCenter_PLZ_KTXBAI

• No Database Master ⇒ remain read only• No Queue ⇒ Write to log for future processing• No Service X ⇒ Return a default response, not

an exception• No MySQL ⇒ Everything is ready for serving in

Redis anyway• No internal service - fall back to external service• Etc...

• Multi Edge / single Central DCs• Geo-DNS based• Edges are read only• Central is write only

• API / Logs• All edges are data-symmetrical• Any Edge may be taken out• Central May be taken out without

service disruption

• Zookeeper manages an endpoint tree• Watchdog registers services - no self announce• Changes to endpoints are published to all services• Automatic switching and adding of endpoints• Facilitates no-downtime deploys with downtime :)• A dead machine is deleted from ZK automatically• A static snapshot of endpoints kept on all

machines

• Internal “learning” Load Balancing Connection Pool• Protocol Agnostic (kinda...)• Python magic - no code changes• Silent failovers• Automatic fast banning / exploration• Why in-app?

• Application Aware• Less latency• EP management support

• Proper timeouts are a key factor for a distributed system• They should be as low as possible while avoiding FP• Internal service timeouts should be < 50ms• Client timeouts can be rather big to support retries• Without proper timeouts any link in the chain can bring you

down• Log them but try to recover• Don’t forget they add up!• Bad Timeout == Point Of Failure

• The obvious dark side of all this• Survival Strategies

• Separate non time-critical API calls• Client Side Backoffs• Selective Failover Retries• Capacity barriers in internal services

• Tune well• Fail fast!

• Return sane defaults, not errors

• We are constantly improving our infrastructure• Our DevOps team are doing that, not maintaining servers• We accept that all solutions are temporary• Start-up infrastructure is always a compromise• We embrace Post-Mortems as an opportunity to improve

Want to improve our infra? We’re hiring ;)

Ping me: Twitter: @dvirsky

dvir@everything.me

• MySQL/Redis data abstraction library• Objects can be saved or loaded from either• MySQL is write only, Redis read only• MySQL can be down without disruption• Only MySQL is replicated between DCs • Automatic migrations to redis• Spartacus - a pseudo MySQL slave notifies

on changes

CentralMySQL

Redis

EdgeMySQL

Spartacus

Redis

top related