never gonna give you up

37
@EdMcBane 7 lessons learned building HP/HA systems Never gonna give you up Never gonna let you down

Upload: francesco-degrassi

Post on 18-Jul-2015

199 views

Category:

Internet


0 download

TRANSCRIPT

@EdMcBane 7 lessons learned building HP/HA systems

Never gonnagive you up

Never gonna let you down

@EdMcBane

Francesco Degrassi

Enthusiastic yet pragmatic Lean Software Developer.

Uppish and cynical nihilist from time to time.

@EdMcBane

Lean Software Development and team coaching

Continuous Delivery, High availability, performance

Security sensitive & high uncertainty domains

@EdMcBane

The challenge

● Primary european client

● Innovative service for the consumer market

● Large userbase (200K+ users)

● Very high request rate

● Low latency requirement (<< RTT)

@EdMcBane

What we built

@EdMcBane

What did we learn?

@EdMcBane

Make your assumptions explicit

and keep testing them

Don’t eatthe yellow snow

@EdMcBane

Make your assumptions explicit

and keep testing them

#1 Make your

assumptions explicitand keep challenging them

@EdMcBane

Make your assumptions explicit

and keep testing them

#2 Performance &

High Availability are not extra features

@EdMcBane

@EdMcBane

Make your assumptions explicit

and keep testing them

#3 Do not reinvent

the wheel

...but keep things simple

@EdMcBane

@EdMcBane

● Everything was good with the single core scenario

In our case...

@EdMcBane

SO_REUSEPORT

For TCP, so_reuseport allows multiple listener sockets to be bound to the same port.

Received packets are distributed to multiple sockets bound to the same port using a 4-tuple hash.

With so_reuseport the distribution is uniform.

@EdMcBane

Everything should be made as simple as possible, but not simpler

— Albert Einstein

@EdMcBane

LESS(1) General Commands Manual LESS(1)

NAME less - opposite of more

SYNOPSIS less -? less --help less -V less --version less [-[+]aABcCdeEfFgGiIJKLmMnNqQrRsSuUVwWX~] [-b space] [-h lines] [-j line] [-k keyfile] [-{oO} logfile] [-p pattern] [-P prompt] [-t tag] [-T tagsfile] [-x tab,...] [-y lines] [-[z] lines] [-# shift] [+[+]cmd] [--] [filename]... (See the OPTIONS section for alternate option syntax with long option names.)

DESCRIPTION

LESS IS similar to MORE (1), but has many more features. Less does not have to read the entire input file before starting, so with large input files it starts up faster than text editors like vi (1). Less uses termcap (or terminfo on some systems), so it can run on

Manual page less(1) line 1 (press h for help or q to quit) .

@EdMcBane

Make your assumptions explicit

and keep testing them

#4Be wary of

cargo-cult optimization

@EdMcBane

@EdMcBane

TCP_TW_RECYCLE

Enable fast recycling TIME-WAIT sockets. Default value is 0. It should not be changed without advice/request of technical experts.

Linux will drop any segment from the remote host whose timestamp is not strictly bigger than the latest recorded timestamp

TCP_TW_RECYCLE + NAT = MADNESS

@EdMcBane

@EdMcBane

Make your assumptions explicit

and keep testing them

#5High Availability is much more than just redundancy

@EdMcBane

@EdMcBane

● Redundant hardware● Redundant software components

But there’s more!

● Graceful degradation● Incremental rollouts

Failure impact

@EdMcBane

Failure frequency

But then also:

● proven technology

● high quality hardware

● automation (to avoid errors)

@EdMcBane

● Effective monitoring○ realtime○ reliable○ understandable○ thorough○ meaningful○ actionable

● Rollback / rollforward● Automation (for speed)

Time to recover

@EdMcBane

Our response plan goes something like this...

AaaaaAAaaaah

@EdMcBane

...but be prepared to improvise

● In house experience

● Developers on call

● Drills (chaos monkeys)

Processes designed for ordinary times

are not resilient in a crisis and need to be changed.

@EdMcBane

Make your assumptions explicit

and keep testing them

#6 Embrace diversity

@EdMcBane

@EdMcBane

@EdMcBane

Make your assumptions explicit

and keep testing them

#7Monitoring is essential

… and we can do way better

@EdMcBane

No one size fits all

● “Monitor everything”, like “100% test coverage” is a nice slogan.

● Each environment requires a slightly different solution

● Balance between data availability, cost and ability to keep it actionable

@EdMcBane

@EdMcBane

We are doing logging wrong

● Unstructured

● Inconsistent

● Poor defaults

● Complex, obscure components

● A huge waste of computing power

@EdMcBane

We need a complete overview

● Logs

● Metrics

● Alerts

● Together, coherent, cross-referenced

@EdMcBane

Human beings, who are almost unique in having the ability to learn from the experience of others, are also remarkable for their apparent disinclination to do so.

Douglas Adams