resilience from theory to practice
TRANSCRIPT
ResilienceFrom Theory to
Practiceby:
Efim Dimenstein - Chief ArchitectOri Cohen - Lead Resilience Engineer
Jan 2016
What is Liveperson
Liveperson transforms the connection between brands
and consumers.
1.5 M Visits concurrent
3BN Visits/month 200BN API calls/month 2 PB data
Our Scale
99.97% Uptime
6 Data Centers1000+ physical servers6000+ VMs
Our Production
Fast release cycle
~250 people R&DConstant InnovationMultiple Technologies
Our Engineering
interruptions per month
on average
33 :)
The Past
The Past
The Present
LiveEngage Platform
Composable
~100 servicesWe keep splittingMuch easier to scale
LiveEngage PlatformServices are grouped into typesThe platform is divided into layers
LiveEngage Platform
Everything That Can Go
Wrong Will Go Wrong
Resilience PyramidDCHW
SERVICECOMPONENT
CODE
DC Resilience - Global
DC Resilience
PrimarySecondary
Service
Nod
e 1
Nod
e N
Nod
e 2
Nod
e 3
...
Service X
Service
Nod
e 1
Nod
e N
Nod
e 2
Nod
e 3
...
Service X
HA Functionality
Service GroupingA
dmin
istr
atio
n &
C
onfig
urat
ion
Real Time
Near Real Time
Offline
Components
Solve once - reuse
The GlueLevel of abstractionIsolates common problems
Components - GuidelinesRetries
Fallback
Cache
@ ground level
trust compan
y
trust enginee
rs
and still evaluate
knowledge is power
tooling
testing
deployment
metrics
logs
E2E
ALERTING
untested ==
unreliable
but… ?
cost effective
visibility
incidentinjectiontesting
process
opt-in
resilience @ scale● multi layered solution
● requires monitoring and testing● ingrained in the company culture● keep things simple● trust and empower your engineers● break stuff
Thankyou!
Q&A