microxchg analyzing response time distributions for microservices

58
Analyzing Response Time Distributions for Microservices Adrian Cockcroft @adrianco Technology Fellow - Battery Ventures February 2016

Upload: adrian-cockcroft

Post on 12-Feb-2017

5.932 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Microxchg Analyzing Response Time Distributions for Microservices

Analyzing Response Time Distributions for Microservices

Adrian Cockcroft @adriancoTechnology Fellow - Battery Ventures

February 2016

Page 2: Microxchg Analyzing Response Time Distributions for Microservices

What does @adrianco do?

@adrianco

Technology Due Diligence on Deals

Presentations at Conferences

Presentations at Companies

Technical Advice for Portfolio

Companies

Program Committee for Conferences

Networking with Interesting PeopleTinkering with

Technologies

Maintain Relationship with Cloud Vendors

Page 3: Microxchg Analyzing Response Time Distributions for Microservices

Challenges for Microservice

Platforms

Page 4: Microxchg Analyzing Response Time Distributions for Microservices

Managing Scale

Page 5: Microxchg Analyzing Response Time Distributions for Microservices

A Possible Hierarchy Continents

Regions Zones

Services Versions

Containers Instances

How Many? 3 to 5

2-4 per Continent 1-5 per Region 100’s per Zone

Many per Service 1000’s per Version

10,000’s

It’s much more challenging than just a large number of

machines

Page 6: Microxchg Analyzing Response Time Distributions for Microservices

Flow

Page 7: Microxchg Analyzing Response Time Distributions for Microservices

Some tools can show the request flow

across a few services

Page 8: Microxchg Analyzing Response Time Distributions for Microservices

Interesting architectures have a lot of microservices! Flow visualization is

a big challenge.

See http://www.slideshare.net/LappleApple/gilt-from-monolith-ruby-app-to-micro-service-scala-service-architecture

Page 9: Microxchg Analyzing Response Time Distributions for Microservices

Simulated Microservices

Model and visualize microservices Simulate interesting architectures Generate large scale configurations Eventually stress test real tools

See github.com/adrianco/spigo Simulate Protocol Interactions in Go Visualize with D3

ELB Load Balancer

Zuul API Proxy

KaryonBusinessLogic

StaashDataAccessLayerPriam CassandraDatastore

ThreeAvailabilityZones

Page 10: Microxchg Analyzing Response Time Distributions for Microservices

Spigo Nanoservice Structurefunc Start(listener chan gotocol.Message) { ... for { select { case msg := <-listener:

flow.Instrument(msg, name, hist) switch msg.Imposition { case gotocol.Hello: // get named by parent ... case gotocol.NameDrop: // someone new to talk to ... case gotocol.Put: // upstream request handler ... outmsg := gotocol.Message{gotocol.Replicate, listener, time.Now(), msg.Ctx.NewParent(), msg.Intention} flow.AnnotateSend(outmsg, name) outmsg.GoSend(replicas) } case <-eurekaTicker.C: // poll the service registry ... } } }

Nanoservice simulation total about 200 lines of Go

Page 11: Microxchg Analyzing Response Time Distributions for Microservices

Flow Trace Recording

riak2us-east-1

zoneC

riak9us-west-2

zoneA

Put s896

Replicate

riak3us-east-1

zoneA

riak8us-west-2

zoneC

riak4us-east-1

zoneB

riak10us-west-2

zoneB

us-east-1.zoneC.riak2 t98p895s896 Put us-east-1.zoneA.riak3 t98p896s908 Replicate us-east-1.zoneB.riak4 t98p896s909 Replicate us-west-2.zoneA.riak9 t98p896s910 Replicate us-west-2.zoneB.riak10 t98p910s912 Replicate us-west-2.zoneC.riak8 t98p910s913 Replicate

staashus-east-1

zoneC

s910 s908s913s909s912

Page 12: Microxchg Analyzing Response Time Distributions for Microservices

Open Zipkin

A common format for trace annotations A Java tool for visualizing traces Standardization effort to fold in other formats Driven by Adrian Cole (currently at Pivotal) Extended to load Spigo generated trace files

Page 13: Microxchg Analyzing Response Time Distributions for Microservices

Zipkin Trace Dependencies

Page 14: Microxchg Analyzing Response Time Distributions for Microservices

Zipkin Trace Dependencies

Page 15: Microxchg Analyzing Response Time Distributions for Microservices

Trace for one Spigo Flow

Page 16: Microxchg Analyzing Response Time Distributions for Microservices

Definition of an architecture

{ "arch": "lamp", "description":"Simple LAMP stack", "version": "arch-0.0", "victim": "webserver", "services": [ { "name": "rds-mysql", "package": "store", "count": 2, "regions": 1, "dependencies": [] }, { "name": "memcache", "package": "store", "count": 1, "regions": 1, "dependencies": [] }, { "name": "webserver", "package": "monolith", "count": 18, "regions": 1, "dependencies": ["memcache", "rds-mysql"] }, { "name": "webserver-elb", "package": "elb", "count": 0, "regions": 1, "dependencies": ["webserver"] }, { "name": "www", "package": "denominator", "count": 0, "regions": 0, "dependencies": ["webserver-elb"] } ] }

Header includeschaos monkey victim

New tier name

Tier package

0 = non Regional

Node count

List of tier dependencies

Page 17: Microxchg Analyzing Response Time Distributions for Microservices

Running Spigo$ ./spigo -a lamp -j -d 2 2016/01/26 23:04:05 Loading architecture from json_arch/lamp_arch.json 2016/01/26 23:04:05 lamp.edda: starting 2016/01/26 23:04:05 Architecture: lamp Simple LAMP stack 2016/01/26 23:04:05 architecture: scaling to 100% 2016/01/26 23:04:05 lamp.us-east-1.zoneB.eureka01....eureka.eureka: starting 2016/01/26 23:04:05 lamp.us-east-1.zoneA.eureka00....eureka.eureka: starting 2016/01/26 23:04:05 lamp.us-east-1.zoneC.eureka02....eureka.eureka: starting 2016/01/26 23:04:05 Starting: {rds-mysql store 1 2 []} 2016/01/26 23:04:05 Starting: {memcache store 1 1 []} 2016/01/26 23:04:05 Starting: {webserver monolith 1 18 [memcache rds-mysql]} 2016/01/26 23:04:05 Starting: {webserver-elb elb 1 0 [webserver]} 2016/01/26 23:04:05 Starting: {www denominator 0 0 [webserver-elb]} 2016/01/26 23:04:05 lamp.*.*.www00....www.denominator activity rate 10ms 2016/01/26 23:04:06 chaosmonkey delete: lamp.us-east-1.zoneC.webserver02....webserver.monolith 2016/01/26 23:04:07 asgard: Shutdown 2016/01/26 23:04:07 lamp.us-east-1.zoneB.eureka01....eureka.eureka: closing 2016/01/26 23:04:07 lamp.us-east-1.zoneA.eureka00....eureka.eureka: closing 2016/01/26 23:04:07 lamp.us-east-1.zoneC.eureka02....eureka.eureka: closing 2016/01/26 23:04:07 spigo: complete 2016/01/26 23:04:07 lamp.edda: closing

-a architecture lamp-j graph json/lamp.json-d run for 2 seconds

Page 18: Microxchg Analyzing Response Time Distributions for Microservices

Riak IoT Architecture{ "arch": "riak", "description":"Riak IoT ingestion example for the RICON 2015 presentation", "version": "arch-0.0", "victim": "", "services": [ { "name": "riakTS", "package": "riak", "count": 6, "regions": 1, "dependencies": ["riakTS", "eureka"]}, { "name": "ingester", "package": "staash", "count": 6, "regions": 1, "dependencies": ["riakTS"]}, { "name": "ingestMQ", "package": "karyon", "count": 3, "regions": 1, "dependencies": ["ingester"]}, { "name": "riakKV", "package": "riak", "count": 3, "regions": 1, "dependencies": ["riakKV"]}, { "name": "enricher", "package": "staash", "count": 6, "regions": 1, "dependencies": ["riakKV", "ingestMQ"]}, { "name": "enrichMQ", "package": "karyon", "count": 3, "regions": 1, "dependencies": ["enricher"]}, { "name": "analytics", "package": "karyon", "count": 6, "regions": 1, "dependencies": ["ingester"]}, { "name": "analytics-elb", "package": "elb", "count": 0, "regions": 1, "dependencies": ["analytics"]}, { "name": "analytics-api", "package": "denominator", "count": 0, "regions": 0, "dependencies": ["analytics-elb"]}, { "name": "normalization", "package": "karyon", "count": 6, "regions": 1, "dependencies": ["enrichMQ"]}, { "name": "iot-elb", "package": "elb", "count": 0, "regions": 1, "dependencies": ["normalization"]}, { "name": "iot-api", "package": "denominator", "count": 0, "regions": 0, "dependencies": ["iot-elb"]}, { "name": "stream", "package": "karyon", "count": 6, "regions": 1, "dependencies": ["ingestMQ"]}, { "name": "stream-elb", "package": "elb", "count": 0, "regions": 1, "dependencies": ["stream"]}, { "name": "stream-api", "package": "denominator", "count": 0, "regions": 0, "dependencies": ["stream-elb"]} ] }

New tier name

Tier package

Node count

List of tier dependencies

0 = non Regional

Page 19: Microxchg Analyzing Response Time Distributions for Microservices

Single Region Riak IoT

Page 20: Microxchg Analyzing Response Time Distributions for Microservices

Single Region Riak IoT

IoT Ingestion Endpoint

Stream Endpoint

Analytics Endpoint

Page 21: Microxchg Analyzing Response Time Distributions for Microservices

Single Region Riak IoT

IoT Ingestion Endpoint

Stream Endpoint

Analytics Endpoint

Load Balancer

Load Balancer

Load Balancer

Page 22: Microxchg Analyzing Response Time Distributions for Microservices

Single Region Riak IoT

IoT Ingestion Endpoint

Stream Endpoint

Analytics Endpoint

Load Balancer

Normalization Services

Load Balancer

Load Balancer

Stream Service

Analytics Service

Page 23: Microxchg Analyzing Response Time Distributions for Microservices

Single Region Riak IoT

IoT Ingestion Endpoint

Stream Endpoint

Analytics Endpoint

Load Balancer

Normalization Services

Enrich Message Queue Riak KV

Enricher Services

Load Balancer

Load Balancer

Stream Service

Analytics Service

Page 24: Microxchg Analyzing Response Time Distributions for Microservices

Single Region Riak IoT

IoT Ingestion Endpoint

Stream Endpoint

Analytics Endpoint

Load Balancer

Normalization Services

Enrich Message Queue Riak KV

Enricher Services

Ingest Message Queue

Load Balancer

Load Balancer

Stream Service

Analytics Service

Page 25: Microxchg Analyzing Response Time Distributions for Microservices

Single Region Riak IoT

IoT Ingestion Endpoint

Stream Endpoint

Analytics Endpoint

Load Balancer

Normalization Services

Enrich Message Queue Riak KV

Enricher Services

Ingest Message Queue

Load Balancer

Load Balancer

Stream Service Riak TS

Analytics Service

Ingester Service

Page 26: Microxchg Analyzing Response Time Distributions for Microservices

Two Region Riak IoT

IoT Ingestion Endpoint

Stream Endpoint

Analytics Endpoint

East Region Ingestion

West Region Ingestion

Multi Region TS Analytics

Page 27: Microxchg Analyzing Response Time Distributions for Microservices

Two Region Riak IoT

IoT Ingestion Endpoint

Stream Endpoint

Analytics Endpoint

East Region Ingestion

West Region Ingestion

Multi Region TS Analytics

What’s the response time of the stream

endpoint?

Page 28: Microxchg Analyzing Response Time Distributions for Microservices

Response Times

Page 29: Microxchg Analyzing Response Time Distributions for Microservices

What’s the response time of a simple service?

memcached

rds-msql

rds-msqlwebservers

elb

www

Page 30: Microxchg Analyzing Response Time Distributions for Microservices

What’s the response time of an even simpler storage backed web service?

memcached

mysql

disk volumeweb service

load generator

Page 31: Microxchg Analyzing Response Time Distributions for Microservices

See http://www.getguesstimate.com/models/1307 https://github.com/getguesstimate/guesstimate-app by Ozzie Gooen

Page 32: Microxchg Analyzing Response Time Distributions for Microservices

See http://www.getguesstimate.com/models/1307 https://github.com/getguesstimate/guesstimate-app by Ozzie Gooen

Page 33: Microxchg Analyzing Response Time Distributions for Microservices

See http://www.getguesstimate.com/models/1307 https://github.com/getguesstimate/guesstimate-app by Ozzie Gooen

Page 34: Microxchg Analyzing Response Time Distributions for Microservices

See http://www.getguesstimate.com/models/1307 https://github.com/getguesstimate/guesstimate-app by Ozzie Gooen

Page 35: Microxchg Analyzing Response Time Distributions for Microservices

Hit rates: memcached 40% mysql 70%

Page 36: Microxchg Analyzing Response Time Distributions for Microservices

memcached hit %

memcached response mysql response

service cpu time

memcached hit mode

mysql cache hit mode

mysql disk access mode

Hit rates: memcached 40% mysql 70%

Page 37: Microxchg Analyzing Response Time Distributions for Microservices

Hit rates: memcached 60% mysql 70%

Page 38: Microxchg Analyzing Response Time Distributions for Microservices

memcached hit %

memcached response mysql response

service cpu time

memcached hit mode

mysql cache hit mode

mysql disk access mode

Hit rates: memcached 60% mysql 70%

Page 39: Microxchg Analyzing Response Time Distributions for Microservices

Hit rates: memcached 20% mysql 90%

Page 40: Microxchg Analyzing Response Time Distributions for Microservices

memcached hit %

memcached response mysql response

service cpu time

memcached hit mode

mysql cache hit mode

mysql disk access mode

Hit rates: memcached 20% mysql 90%

Page 41: Microxchg Analyzing Response Time Distributions for Microservices

Measuring Response Time With

Histograms

Page 42: Microxchg Analyzing Response Time Distributions for Microservices

Changes made to codahale/hdrhistogram

Changes made to go-kit/kit/metrics (today!)

Implementation in adrianco/spigo/collect

Page 43: Microxchg Analyzing Response Time Distributions for Microservices

What to measure?

Client ServerGetRequest

GetResponse

Client Time

Client Send CS

Server Receive SR

Server Send SS

Client Receive CR

Server Time

Page 44: Microxchg Analyzing Response Time Distributions for Microservices

What to measure?

Client ServerGetRequest

GetResponse

Client Time

Client Send CS

Server Receive SR

Server Send SS

Client Receive CR

Response CR-CS

Service SS-SR

Network SR-CS

Network CR-SS

Net Round Trip (SR-CS) + (CR-SS) (CR-CS) - (SS-SR)

Server Time

Page 45: Microxchg Analyzing Response Time Distributions for Microservices

Spigo Histogram Collectionfunc Start(listener chan gotocol.Message) { ... for { select { case msg := <-listener: flow.Instrument(msg, name, nethist) switch msg.Imposition { ... case gotocol.GetResponse: // return path from a request, terminate and log response time in histograms flow.End(msg, resphist, servhist, rthist) case gotocol.Goodbye: collect.SaveHist(nethist, name, "_net") collect.SaveHist(resphist, name, "_resp") collect.SaveHist(servhist, name, "_serv") collect.SaveHist(rthist, name, “_rt") collect.SaveAllGuesses(name) gotocol.Message{gotocol.Goodbye, nil, time.Now(), gotocol.NilContext, name}.GoSend(parent) return } case <-chatTicker.C: ... sm = gotocol.Message{gotocol.GetRequest, listener, now, ctx, "Why"} flow.AnnotateSend(sm, name) sm.GoSend(microindex[m]) // send to a randomly chosen dependency } } }

Page 46: Microxchg Analyzing Response Time Distributions for Microservices

Go-Kit Histogram Collectionconst ( maxHistObservable = 1000000 sampleCount = 500 )

func NewHist(name string) metrics.Histogram { var h metrics.Histogram if name != "" && archaius.Conf.Collect { h = expvar.NewHistogram(name, 1000, maxHistObservable, 1, []int{50, 99}...) if sampleMap == nil { sampleMap = make(map[metrics.Histogram][]int64) } sampleMap[h] = make([]int64, 0, sampleCount) return h } return nil }

func Measure(h metrics.Histogram, d time.Duration) { if h != nil && archaius.Conf.Collect { if d > maxHistObservable { h.Observe(int64(maxHistObservable)) } else { h.Observe(int64(d)) } s := sampleMap[h] if s != nil && len(s) < sampleCount { sampleMap[h] = append(s, int64(d)) } } }

Nanoseconds!

Median and 99%ile

Slice for first 500 values as samples for export to Guesstimate

Page 47: Microxchg Analyzing Response Time Distributions for Microservices

Spigo Histogram Resultsname: storage.*.*.load00....load.denominator_resp count: 1978 gauges: map[50:126975 99:278527] From, To, Count, Prob, Bar 28672, 29695, 1, 0.0005, : 31744, 32767, 1, 0.0005, : 34816, 36863, 2, 0.0010, :# 36864, 38911, 8, 0.0040, |###### 38912, 40959, 13, 0.0066, |########## 40960, 43007, 18, 0.0091, |############## 43008, 45055, 12, 0.0061, |######### 45056, 47103, 26, 0.0131, |#################### 47104, 49151, 24, 0.0121, |################## 49152, 51199, 33, 0.0167, |######################### 51200, 53247, 29, 0.0147, |###################### 53248, 55295, 35, 0.0177, |########################### 55296, 57343, 39, 0.0197, |############################## 57344, 59391, 35, 0.0177, |########################### 59392, 61439, 43, 0.0217, |################################# 61440, 63487, 31, 0.0157, |######################## 63488, 65535, 39, 0.0197, |############################## 65536, 69631, 74, 0.0374, |######################################################### 69632, 73727, 65, 0.0329, |################################################## 73728, 77823, 57, 0.0288, |############################################ 77824, 81919, 37, 0.0187, |############################ 81920, 86015, 37, 0.0187, |############################ 86016, 90111, 30, 0.0152, |####################### 90112, 94207, 39, 0.0197, |############################## 94208, 98303, 28, 0.0142, |##################### 98304, 102399, 30, 0.0152, |####################### 102400, 106495, 31, 0.0157, |######################## 106496, 110591, 20, 0.0101, |############### 110592, 114687, 26, 0.0131, |#################### 114688, 118783, 44, 0.0222, |################################## 118784, 122879, 41, 0.0207, |############################### 122880, 126975, 54, 0.0273, |########################################## 126976, 131071, 51, 0.0258, |####################################### 131072, 139263, 114, 0.0576, |######################################################################################## 139264, 147455, 123, 0.0622, |############################################################################################### 147456, 155647, 127, 0.0642, |################################################################################################### 155648, 163839, 102, 0.0516, |############################################################################### 163840, 172031, 90, 0.0455, |###################################################################### 172032, 180223, 65, 0.0329, |################################################## 180224, 188415, 43, 0.0217, |################################# 188416, 196607, 60, 0.0303, |############################################## 196608, 204799, 54, 0.0273, |########################################## 204800, 212991, 29, 0.0147, |###################### 212992, 221183, 21, 0.0106, |################ 221184, 229375, 25, 0.0126, |################### 229376, 237567, 18, 0.0091, |############## 237568, 245759, 15, 0.0076, |########### 245760, 253951, 9, 0.0046, |####### 253952, 262143, 8, 0.0040, |###### 262144, 278527, 10, 0.0051, |####### 278528, 294911, 6, 0.0030, |#### 294912, 311295, 2, 0.0010, |# 327680, 344063, 2, 0.0010, :# 344064, 360447, 1, 0.0005, | 376832, 393215, 1, 0.0005, :

name: storage.*.*.load00....load.denominator_resp count: 1978 gauges: map[50:126975 99:278527] From, To, Count, Prob, Bar 28672, 29695, 1, 0.0005, : 31744, 32767, 1, 0.0005, : 34816, 36863, 2, 0.0010, :# 36864, 38911, 8, 0.0040, |###### 38912, 40959, 13, 0.0066, |##########

Normalized probability

Response time distribution measured in nanoseconds using High Dynamic Range Histogram

:# Zero counts skipped|# Contiguous buckets

Total count, median and 99th percentile values

Page 48: Microxchg Analyzing Response Time Distributions for Microservices

Go Guesstimate Exporthttps://github.com/adrianco/goguesstimate

{ "space": { "name": "gotest", "description": "Testing", "is_private": "true", "graph": { "metrics": [ {"id": "AB", "readableId": "AB", "name": "memcached", "location": {"row": 2, "column":4}}, {"id": "AC", "readableId": "AC", "name": "memcached percent", "location": {"row": 2, "column":3}}, {"id": "AD", "readableId": "AD", "name": "staash cpu", "location": {"row": 3, "column":3}}, {"id": "AE", "readableId": "AE", "name": "staash", "location": {"row": 3, "column":2}} ], "guesstimates": [ {"metric": "AB", "input": null, "guesstimateType": "DATA", "data": [119958,6066,13914,9595,6773,5867,2347,1333,9900,9404,13518,9021,7915,3733,10244,5461,12243,7931,9044,11706,5706,22861,9022,48661,15158,28995,16885,9564,17915,6610,7080,7065,12992,35431,11910,11465,14455,25790,8339,9991]}, {"metric": "AC", "input": "40", "guesstimateType": "POINT"}, {"metric": "AD", "input": "[1000,4000]", "guesstimateType": "NORMAL"}, {"metric": "AE", "input": "=100+((randomInt(0,100)>AC)?AB:AD)", "guesstimateType": "FUNCTION"} ] } } }

Page 49: Microxchg Analyzing Response Time Distributions for Microservices

See http://www.getguesstimate.com

Page 50: Microxchg Analyzing Response Time Distributions for Microservices

See http://www.getguesstimate.com

Response time distributions exported directly from Spigo as 500 samples to json_metrics/storage.guess then posted to guesstimate.

Conference driven development not quite complete, go-kit PR in place to provide full names of histograms

Relationship between services will also be exported soon.

Page 51: Microxchg Analyzing Response Time Distributions for Microservices

What’s Next?

Page 52: Microxchg Analyzing Response Time Distributions for Microservices

Trends to watch for 2016:

Serverless Architectures - AWS Lambda

Teraservices - using terabytes of memory

Page 53: Microxchg Analyzing Response Time Distributions for Microservices

Teraservices

Page 54: Microxchg Analyzing Response Time Distributions for Microservices

Terabyte Memory Directions

Engulf dataset in memory for analytics

Balanced config for memory intensive workloads

Replace high end systems at commodity cost point

Explore non-volatile memory implications

Page 55: Microxchg Analyzing Response Time Distributions for Microservices

Terabyte Memory Options

Now: Diablo DDR4 DIMM containing flash 64/128/256GB Migrates pages to/from companion DRAM DIMM Shipping now as volatile memory, future non-volatile

Announced but not shipped for 2016 AWS X1 Instance Type - over 2TB RAM Easy availability should drive innovation

Page 56: Microxchg Analyzing Response Time Distributions for Microservices

Diablo Memory1: Flash DIMM

NO CHANGES to CPU or Server

NO CHANGES to Operating System

NO CHANGES to Applications✓ UP TO 256GB DDR4 MEMORY PER MODULE

✓ UP TO 4TB MEMORY IN 2 SOCKET SYSTEM

TM

Page 57: Microxchg Analyzing Response Time Distributions for Microservices

Q&AAdrian Cockcroft @adrianco

http://slideshare.com/adriancockcroftTechnology Fellow - Battery Ventures

See www.battery.com for a list of portfolio investments

Page 58: Microxchg Analyzing Response Time Distributions for Microservices

Security

Visit http://www.battery.com/our-companies/ for a full list of all portfolio companies in which all Battery Funds have invested.

Palo Alto Networks

Enterprise IT

Operations & Management

Big DataCompute

Networking

Storage