scaling an elk stack at bol.com

27
Scaling an ELK stack Elasticsearch NL meetup 2014.09.22, Utrecht

Upload: renzo-toma

Post on 25-Jan-2015

2.957 views

Category:

Technology


3 download

DESCRIPTION

A presentation about the deployment of an ELK stack at bol.com At bol.com we use Elasticsearch, Logstash and Kibana in a logsearch system that allows our developers and operations people to easilly access and search thru logevents coming from all layers of its infrastructure. The presentations explains the initial design and its failures. It continues with explaining the latest design (mid 2014). Its improvements. And finally a set of tips are giving regarding Logstash and Elasticsearch scaling. These slides were first presented at the Elasticsearch NL meetup on September 22nd 2014 at the Utrecht bol.com HQ.

TRANSCRIPT

Page 1: Scaling an ELK stack at bol.com

Scaling an ELK stackElasticsearch NL meetup

2014.09.22, Utrecht

Page 2: Scaling an ELK stack at bol.com

2

Who am I?

Renzo Tomà

• IT operations• Linux engineer• Python developer• Likes huge streams of raw data• Designed metrics & logsearch platform• Married, proud father of two

And you?

Page 3: Scaling an ELK stack at bol.com

3

ELK

Page 4: Scaling an ELK stack at bol.com

4

ELK at bol.comLogsearch platform.

For developers & operations.

Search & analyze log events using Kibana.

Events from many sources (e.g. syslog, accesslog, log4j, …)

Part of our infrastructure.

Why? Faster root cause analyses quicker time-to-repair.

Page 5: Scaling an ELK stack at bol.com

5

Real world examplesCase: release of new webshop version.Nagios alert: jboss processing time.Metrics: increase in active threads (and proctime).=> Inconclusive!

Find all HTTP requests to www.bol.com which were slower than 5 seconds:

@type:apache_access AND @fields.site:”www_bol_com” AND \@fields.responsetimes:[5.000.000 TO *]

=> Hits for 1 URL. Enough for DEV to start its RCA.

Page 6: Scaling an ELK stack at bol.com

6

Real world examplesCase: strange performance spikes on webshop.Looks bad, but cause unknown.

Find all errors in webshop log4j logging:

@fields.application:wsp AND @fields.level:ERROR

Compare errors before vs during spike. Spot the difference.

=> Spikes caused by timeouts on a backend service.

Metrics correlation: timeouts not cause, but symptom of full GC issue.

Page 7: Scaling an ELK stack at bol.com

7

Initial design (mid 2013’ish)

Kibana2

Remote_syslog

pkg

Log4j syslog

appender

Logstash

Elastic searchElastic

search

Log events

Acts as syslog server.Converts linesinto events,into json docs.

Accesslog

SyslogCentral syslog server

Servers, routers, firewalls …

Apache webservers

Java webapplications (JVM)

Using syslog protocolover UDP as transport.Even for accesslog + log4j.

tail

Page 8: Scaling an ELK stack at bol.com

8

Initial attempt #failSingle logstash instance not fast enough.Unable to keep up with events created.

High CPU load, due to intensive grokking (regex).Network buffer overflow. UDP traffic dropped.

Result: missing events.

Page 9: Scaling an ELK stack at bol.com

9

Initial attempt #failLog4j events can be multiline (e.g. stacktraces).

Events are send per line:100 lines = 100 syslog msgs

Merging by Logstash.

Remember the UDP drops?

Result:- unparseable events (if 1st line was missing)- Swiss cheese. Stacktrace lines were missing.

Page 10: Scaling an ELK stack at bol.com

10

Initial attempt #failSyslog RFC3164:

“The total length of the packet MUST be 1024 bytes or less.”

Rich Apache LogFormat + lots of cookies = 4kb easily.

Anything after byte 1024 got trimmed.

Result: unparseable events (mismatch grok pattern)

Page 11: Scaling an ELK stack at bol.com

11

The only way is up.

Improvement proposals:

- Use queuing to make Logstash horizontal scalable.

- Drop syslog as transport (for non-syslog).

- Reduce amount of grokking. Pre-formatting at source scales better. Less complexity.

Page 12: Scaling an ELK stack at bol.com

12

Latest design (mid 2014’ish)

Kibana2 + 3

Local Logsheep

Log4j jsonevent

layout

Elastic searchElastic

search

Log events

Accesslogjsonevent

format

SyslogCentral syslog server

Servers, routers, firewalls …

Apache webservers

Java webapplications (JVM)

Elastic searchRedis

(queue)

Log4j redis

appender

LogstashLocal

Logsheep

Events in jsonevent format.No grokking required.

Many instancesLots of other

sources

Page 13: Scaling an ELK stack at bol.com

13

Current status #win- Logstash: up to 10 instances per env (because of logstash 1.1 version)

- ES cluster (v1.0.1): 6 data + 2 client nodes

- Each datanode has 7 datadisks (striping)

- Indexing at 2k – 4k docs added per second

- Avg. index time: 0.5ms

- Peak: 300M docs = 185GB, per day

- Searches: just a few per hour

- Shardcount: 3 per idx, 1 replica, 3000 total

- Retention: up to 60 days

Page 14: Scaling an ELK stack at bol.com

14

Our lessons learnedBefore anything else!

Start collecting metrics so you get a baseline.No blind tuning. Validate every change fact-based.

Our weapons of choice:• Graphite• Diamond (I am contributor of the ES collector)

• Jcollectd

Alternative: try Marvel.

Page 15: Scaling an ELK stack at bol.com

15

Logstash tip #1Insert Redis as queue between source and logstash instances:- Scale Logstash scale horizontally- High availability (no events get lost)

Redis

Logstash

Logstash

Logstash

Redis

Page 16: Scaling an ELK stack at bol.com

16

Logstash tip #2Tune your workers. Find your chokepoint and increase its workers to improve throughput.

OutputInput Filter

OutputInputFilter

Filter

$ top –H –p $(pgrep logstash)

Page 17: Scaling an ELK stack at bol.com

17

Logstash tip #3Grok is very powerful, but CPU intensive. Hard to write, maintain and debug.

Fix: vertical scaling. Increase filterworkers or add more Logstash instances.

Better: feed Logstash with jsonevent input.

Solutions:• Log4j: use log4j-jsonevent-layout • Apache: define json output with LogFormat

Page 18: Scaling an ELK stack at bol.com

18

Logstash tip #4 (last one)

Use the HTTP protocol Elasticsearch output.

Avoid a version lock in!

HTTP may be slower, but newer ES means:- Lots of new features- Lots of bug fixes- Lots of performance improvements

Most important: you decide what versions to use.

Logstash v1.4.2 (June ‘14) requires ES v1.1.1 (April ‘14).Latest ES version is v1.3.2 (Aug ‘14).

Page 19: Scaling an ELK stack at bol.com

19

Elasticsearch tip #1Do not download a ‘great’ configuration.

Elasticsearch is very complex. Lots of moving parts. Lots of different use-cases. Lots of configuration options. The defaults can not be optimal.

Start with defaults:• Load it (stresstest or pre-launch traffic).• Check your metrics.• Find your chokepoint.• Change setting.• Verify and repeat.

Page 20: Scaling an ELK stack at bol.com

20

Elasticsearch tip #2Increase the ‘index.refresh_interval’ setting.

Refresh: make newly added docs available for search. Default value: one second. High impact on heavy indexing systems (like ours).

Change it at runtime & check the metrics:$ curl -s -XPUT 0:9200/_all/_settings?index.refresh_interval=5s

Page 21: Scaling an ELK stack at bol.com

21

Elasticsearch tip #3Use Curator to keep total shardcount constant.

Uncontrolled shard growth may trigger a sudden hockey stick effect.

Our setup:- 6 datanodes- 6 shards per index- 3 primary, 3 replica

“One shard per datanode” (YMMV)

Page 22: Scaling an ELK stack at bol.com

22

Elasticsearch tip #4Become experienced in rolling cluster restarts:- to roll out new Elasticsearch releases- to apply a config setting (e.g. heap, gc, ..)

- because it will solve an incident.

Control concurrency + bandwidth:cluster.routing.allocation.node_concurrent_recoveriescluster.routing.allocation.cluster_concurrent_rebalanceindices.recovery.max_bytes_per_sec

Get confident enough to trustdoing a rolling restart on aSaturday evening!(To get this graph )

Page 23: Scaling an ELK stack at bol.com

23

Elasticsearch tip #5 (last one)

Cluster restarts improve recovery time.

Recovery: compares replica vs primary shard. If different, recreate the replica. Costly (iowait) and very time consuming.

But … difference is normal. Primary and replica have their own segment merge management: same docs, but different bytes.

After recovery: replica is exact copy of primary.

Note: only works for stale shards (no more updates).You have a lot of those when using daily Logstash indices.

Page 24: Scaling an ELK stack at bol.com

Thank you for listening.Got questions?

You can contact me via:[email protected], or

Page 25: Scaling an ELK stack at bol.com

25

Page 26: Scaling an ELK stack at bol.com

Relocation in action

Page 27: Scaling an ELK stack at bol.com

27

Tools we usehttp://redis.io/Key/value memory store, no-frills queuing, extremely fast.Used to scale logstash horizontally.

https://github.com/emicklei/log4j-redis-appenderSend log4j event to Redis queue, non-blocking, batch, failover

https://github.com/emicklei/log4j-jsonevent-layoutFormat log4j events in logstash event layout.Why have logstash do lots of grokking, if you can feed it with logstash friendly json.

http://untergeek.com/2013/09/11/getting-apache-to-output-json-for-logstash-1-2-x/Format Apache access logging in logstash event layout. Again: avoid grokking.

https://github.com/bolcom/ (SOON)Logsheep: custom multi-threaded logtailer / udp listener, sends events to redis.

https://github.com/BrightcoveOS/Diamond/ Great metrics collector framework with Elasticsearch collector. I am contributor.

https://github.com/elasticsearch/curatorTool for automatic Elasticsearch index management (delete, close, optimize, bloom).