building a mission critical event system on top of mongodb

22
BUILDING A MISSION CRITICAL EVENT SYSTEM ON TOP OF MONGODB by @shahar_kedar

Upload: bigpanda-inc

Post on 15-Jul-2015

30 views

Category:

Technology


0 download

TRANSCRIPT

BUILDING A MISSION CRITICAL EVENT SYSTEM ON TOP OF MONGODB

by @shahar_kedar

BIGPANDASaaS platform that lets companies aggregate alerts from all their monitoring systems into one place for faster incident discovery and response.

HOW IT WORKS

High CPU on prod-srv-1

18/06/14 16:05 CRITICAL

High CPU on prod-srv-1

18/06/14 16:07 WARNING

Memory usage on prod-srv-1

18/06/14 16:08 CRITICAL

Events EntitiesHigh CPU on prod-srv-1 WARNING

Memory usage on prod-srv-1 CRITICAL

Incidents2 Alerts on prod-srv-1

PRODUCT REQUIREMENTS

• Events need to be processed into incidents and streamed to the user’s browser as fast as possible

• Incidents need to reliably reflect the state as it is in the monitoring system

• The service has to be up and running 24x7

MISSION CRITICAL

• It’s not rocket science, it’s not Google, but:

• It has to be super fast

• It has to be extremely reliable

• It has to always be available

OUR #1 COMPETITOR

WHY MONGO?BECAUSE IT’S WEB SCALE!

WHY MONGO?At first:

• NodeJS shop

• Schemaless

• Easy to master

Later on:

• Reliable

• Easy to evolve

• Partial and atomic updates

• Powerful query language

BECAUSE IT’S WEB SCALE!

SUPER FASTHardware

Schema Design

Lean & Stream

HARDWARE

03/13

3 x m1.medium

02/14

1 x i2.xlarge+

2 x m1.medium

m1.medium: 1 vCPUs, 3.75GB RAM, EBS drive

06/14

2 x i2.xlarge+

1 x m3.xlarge

m3.xlarge: 4 vCPUs, 15GB RAM, EBS drive

i2.xlarge: 4 vCPUs, 30.5GB RAM, SSD 800GB

x3 readsx4 writes

–Eliot Horowitz

“Schema design is … the largest factor when it comes to performance and scalability … more important than hardware, how you shard, or anything else,

schema is by far the most important thing.”

SCHEMA DESIGNEvent

{ timestamp : Date status: String description: String, }

Entity{ start : Date end: Date status: String description: String, events: [ <embedded> ] source_system: String }

Incident{ start : Date end: Date is_active: Boolean description: String, entities: [ { entityId: ObjectId status: String } ] }

DENORMALIZATION• Go over the checklist (http://bit.ly/1vUdz2T)

• Incidents => Entities: partially embedded + ref

• Cardinality: one-to-few

• Direct access to Entities

• Entities are frequently updated

• Entities => Events: embedded

• Events are not directly accessed

• Events are immutable

• Cardinality: one-to-many ~ one-to-gazzilion

INDEXES

• Optimized indexes db.collection.find({..}).explain()

• Removed redundant indexes

• Truncated events collections (TTL index)

LEAN QUERIES

• Use projections to limit fields returned by a query:Model.find().select(‘-events’)

• Mongoose users: use .lean() when possible to gain more than 50% performance boost:Model.find().lean()

• Stream results: Model.find().stream().on(‘data’, function(doc){})

RESULTS• Average latency of all API calls went from 500ms

to under 20ms

• Average latency of full pipeline went from 2s to under 500ms

• Peak time latency of full pipeline went down from 5m(!!) to less than 30s

EXTREMELY RELIABLE

Atomic & Partial Updates

ATOMIC & PARTIAL UPDATES• Several services might try to update the same

document at the same time, but:

• Different systems update different parts of the document

• Updates to the same document are sharded and ordered at the application level (read our awesome blog post: http://bit.ly/1nQVcbS)

IMPOSSIBLE TO KILL

Replica Set

Disaster Recovery

REPLICA SET

• 3 nodes replica set

• Using priorities to enforce master election of stronger nodes

• Deployed on different availability zones

DISASTER RECOVERY

• Cold backup using MMS Backup

• Full production replication on another EC2 region: using mongo’s replication mechanism to continuously sync data to the backup region

THANK YOU!