tales of the black knight - keeping everythingme running
DESCRIPTION
Tips, tricks and strategies we use at EverythingMe to scale and keep our servers always running, no matter whatTRANSCRIPT
• “One Tap Happiness”• Smart and Contextual Launcher / Phone UI• App Organization • In-Phone Search• App Search• App Recommendations (+ sponsored)• Contextual Content Discovery (cards)
What This means:• Lots of algorithms and data from the servers• Millions of downloads, Hundreds of sustained R/S • 1B use events collected per month• Fucking up means fucking up the users’ phones
• 100% Cloud-based (EC2)• 100% Automated Infrastructure• Third-party software is FOSS only• Continuous Deployment• Loads of metrics and logging• Databases: Redis, MySQL, Cassandra• Languages: Python, Go, C++ (Java on Android)• Important Tools: Tornado, Thrift, Scribe,
Statsd, Kibana, Celery, ZooKeeper, Chef, Docker
• Servers may be terminated at any moment• Disks may fail at any moment• The LAN may fail or hiccup• Services you query may crash or be restarted• Your code may crash at any moment• The Client is an idiot that might send garbage• The Server is an idiot that might return garbage• We Are All Idiots• We should accept and embrace all that
• Separation into many small services• No SPOFs• Little reliance on disk• Aspire for statelessness• Dynamic Endpoint Management• In App Failover and LB• Aggressive Timeouts• Multi tiered alerting• Sane Fallback Values• Graceful Degradation
C*
API
Search
AdsImages
Geo
RedisRedis
RedisRedis
Context
● Thin API Layer○ Input validation○ Connection funnelling
● Many smaller services● Many redis instances
○ “database” = instance● Thrift for internal APIs● Deploys are less scary● Scaling is easier● Well defined contracts
Auto-Complete
if machine_is_down:# All is wellreturn KeepFighting
elif fucked_services_count == 1:log.info(“Tis But a Scratch”)return KeepFighting
elif fucked_services_count == 2:log.info(“Just a Flesh Wound!”)return KeepFighting
elif num_running_services >= 2:log.info(“I’ll Bite Your Legs Off!”)return KeepFighting
else:log.info(“All Right, We’ll Call It A Draw”)return SwitchDataCenter_PLZ_KTXBAI
• No Database Master ⇒ remain read only• No Queue ⇒ Write to log for future processing• No Service X ⇒ Return a default response, not
an exception• No MySQL ⇒ Everything is ready for serving in
Redis anyway• No internal service - fall back to external service• Etc...
• Multi Edge / single Central DCs• Geo-DNS based• Edges are read only• Central is write only
• API / Logs• All edges are data-symmetrical• Any Edge may be taken out• Central May be taken out without
service disruption
• Zookeeper manages an endpoint tree• Watchdog registers services - no self announce• Changes to endpoints are published to all services• Automatic switching and adding of endpoints• Facilitates no-downtime deploys with downtime :)• A dead machine is deleted from ZK automatically• A static snapshot of endpoints kept on all
machines
• Internal “learning” Load Balancing Connection Pool• Protocol Agnostic (kinda...)• Python magic - no code changes• Silent failovers• Automatic fast banning / exploration• Why in-app?
• Application Aware• Less latency• EP management support
• Proper timeouts are a key factor for a distributed system• They should be as low as possible while avoiding FP• Internal service timeouts should be < 50ms• Client timeouts can be rather big to support retries• Without proper timeouts any link in the chain can bring you
down• Log them but try to recover• Don’t forget they add up!• Bad Timeout == Point Of Failure
• The obvious dark side of all this• Survival Strategies
• Separate non time-critical API calls• Client Side Backoffs• Selective Failover Retries• Capacity barriers in internal services
• Tune well• Fail fast!
• Return sane defaults, not errors
• We are constantly improving our infrastructure• Our DevOps team are doing that, not maintaining servers• We accept that all solutions are temporary• Start-up infrastructure is always a compromise• We embrace Post-Mortems as an opportunity to improve
• MySQL/Redis data abstraction library• Objects can be saved or loaded from either• MySQL is write only, Redis read only• MySQL can be down without disruption• Only MySQL is replicated between DCs • Automatic migrations to redis• Spartacus - a pseudo MySQL slave notifies
on changes
CentralMySQL
Redis
EdgeMySQL
Spartacus
Redis