performance monitoring and call tracing in microservice environments

Performance Analysis and

Call Tracingin Microservice environments

Martin Gutenbrunner

Dynatrace Innovation Lab

@MartinGoowell

Microservice Meetup Berlin – 2016-06-30

About me

Started with Commodore 8-bit (VC-20 and C-64)

Built Null-Modem connections for playing Doom and WarCraft I

Went on to IPX/SPX networks between MS-DOS 6.22 and

WfW 3.11

Did DevOps before it was a thing (mainly Java and Web)

for ~ 10 years

Now at Dynatrace Innovation Lab

Tech Lead for Azure and Microservices

Find me on Twitter: @MartinGoodwell

Passionate about life, technology and the people behind both of them.

Agenda

Traditional monitoring

What‘s wrong with it?

Performance in your code

The dramatic dilemma

Happy end

@MartinGoodwell

Questions

Please, ask and interrupt anytime!

What‘s your occupation?

Dev, Ops, BinExec?

What‘s your technology stack?

Java, .net

Node.js

Who of you knows what APM is/does?

A lil` bit o`history

Traditional monitoring was for Ops only

APM (incl. Call Tracing) is also for devs, debugging, pre-prod

@MartinGoodwell

Monitoring

@MartinGoodwell

Host performance

CPU-usage

Memory-usage

Disk IO

Network performance

@MartinGoodwellNagios

What‘s wrong with it?

Nothing is wrong

Some things might just be out of scope

No insight into your application‘s performance

@MartinGoodwell

Performance in your codea.k.a. Application Performance Management

@MartinGoodwell

Add monitoring code

@MartinGoodwell

Use statsd

@MartinGoodwell

statsd real quick

http://www.slideshare.net/DatadogSlides/dev-opsdays-tokyo2013effectivestatsdmonitoring@MartinGoodwell

Use JMX

@MartinGoodwell

@MartinGoodwell

Aspect oriented programming

http://veerasundar.com/blog/2010/01/spring-aop-example-profiling-method-execution-time-tutorial/@MartinGoodwell

Graphite Visualization

@MartinGoodwell

Any downsides here?

Basic approaches are subject to polluting your code

AOP is the better choice, but requires advanced skills

If you‘re not using something like statsd, it‘s hard to have a central spot for

all your performance data of different components

Great for performance insights of single components

What about 3rd parties?

Or distributed systems?

Like, microservices, maybe

@MartinGoodwell

What about components which we

can‘t modify?like databases, message queues, ...

@MartinGoodwell

Best case: use readily available APIs or integrations (statsd, JMX, etc)

For open-source: apply same technique as to your own code

Keeping in sync with original code can become tedious

try to make your changes part of the original project

Use dedicated monitoring tools

Very common for databases

BUT even the best tool is an additional tool

How long does it take to get a new team member up-to-speed?

@MartinGoodwell

Microservices

@MartinGoodwell

Microservices vs SOA

Microservices

fit the scope of a single application

Service Oriented Architecture

is scoped to fit enterprises / environments / infrastructures

@MartinGoodwell

For a dev, microservices hardly pose any downsides

On the upside, the code-size and scope of the domain becomes smaller

Any best practices for analyzing performance of a single microservice are still

valid

The real challenge of microservices is proper operation

@MartinGoodwell

What‘s the challenge about monitoring

microservice?

The big challenge of well performing microservices is the communication

between the microservices

Not in the high-performance of a single microservice

Tracing calls between services is very difficult

@MartinGoodwell

@MartinGoodwell

Source: http://theburningmonk.com/2015/05/a-consistent-approach-to-track-correlation-ids-through-microservices/

Call Tracing

@MartinGoodwell

@MartinGoodwell

Source: http://theburningmonk.com/2015/05/a-consistent-approach-to-track-correlation-ids-through-microservices/

In Java

https://taidevcouk.wordpress.com/category/experiments/

@MartinGoodwell

C#

http://theburningmonk.com/2015/05/a-consistent-

approach-to-track-correlation-ids-through-microservices/ @MartinGoodwell

Leverage on existing tools

https://github.com/ordina-jworks/microservices-dashboard

@MartinGoodwell

Spring Cloud Sleuth

@MartinGoodwell

Sleuth: https://github.com/spring-cloud/spring-cloud-sleuth

Spring Cloud Sleuth implements a distributed tracing

solution for Spring Cloud.

https://github.com/spring-cloud/spring-cloud-sleuth

http://cloud.spring.io/

@MartinGoodwell

Zipkin

@MartinGoodwell

Trace

https://trace.risingstack.com/

So, here we got everything we need?

Usually, one tracing solution only covers a single technology

Besides visualization, you‘ll also want log analysis

ELK stack does this really well, especially in connection with correlation Ids

But ELK stack does no visualization

And your visualization does no log analysis

yet another tool

Don‘t get me started about integrating all this with host monitoring...

The trace ends, where your code ends

No correlation IDs for database calls

@MartinGoodwell

What‘s next?

@MartinGoodwell

Considerations for custom

implementations

Multitude of languages

Open-source tools can get expensive

Manual configuration

Often only applicable to a single technology

Keep the pace with new technology

Serverless code (eg AWS Lambda, Azure Functions)

@MartinGoodwell

http://de.slideshare.net/InfoQ/netflix-built-its-own-monitoring-system-and-

why-you-probably-shouldnt

@MartinGoodwell

http://de.slideshare.net/InfoQ/netflix-built-its-own-monitoring-system-and-why-you-probably-shouldnt

The Ops‘ dilemmahow to handle all this in production

how to identify production issues

how to tell the devs, what they should look into, w/o tearing down everything

@MartinGoodwell

All fine?

While the Dev can leverage on a huge number of tools, libs and frameworks,

it‘s still up to the Ops to integrate it into a single, unified, well-integrated

solution that allows to draw the right conclusions

@MartinGoodwell

From Dev to Prod

Dev

Single transaction

Deal with a specific problem

No impact on real users and business

Can concentrate on single component

„perfect world“

A dev‘s deadline is made of Sprints

A couple of weeks, usually

Ops

100s or 1000s of transactions

No idea, what the prob is

Slow or bad requests impact real

users and business

Lots of components that might not

be under your control

An Op‘s deadline is made of SLAs

Hours, maybe just minutes

@MartinGoodwell

The Dev-Ops-Dev-Ops-Dev-Ops dilemma

Dev

Ops

@MartinGoodwell

Sprint

(days / weeks)

SLA

(hours / minutes)

From Prod to Dev

Dev

Single transaction

Deal with a specific problem

No impact on real users and business

Can concentrate on single component

„perfect world“

Ops

100s or 1000s of transactions

No idea, what the prob is

Slow or bad requests impact real users and

business

Lots of components that might not be under

your control

Which?

Which?

Time!

Reproduce

?

@MartinGoodwell

Commercial solutionsDynatrace Ruxit

@MartinGoodwell

@MartinGoodwell

Dynatrace Ruxit

@MartinGoodwell

Set-up in 5 minutes

Install a single monitoring agent per host

Everything is auto-detected

No changes to your source-code

No changes to runtime configuration

Supports a wide array of technologies

http://www.dynatrace.com/en/ruxit/technologies/

@MartinGoodwell


Traditional metrics

@MartinGoodwell

Service metrics

@MartinGoodwell

Does not end at your custom

components

@MartinGoodwell

Baselining

Automatically detects and correlates problems without setting thresholds

@MartinGoodwell

Includes the Client-side

Browser auto-injection

Includes client-side JavaScript in traces and problem-correlation

@MartinGoodwell

Visualization

@MartinGoodwell

Call Tracing

@MartinGoodwell

Solving a dilemma

Include this URL in a

trouble ticket and the Dev

can jump in right away

@MartinGoodwell

Supporting most popular technologies

• Java

• .NET

• Node.js

• PHP

• Databases via

• JDBC

• ADO.NET

• PDO

• Message Queues

• Caches

• Cloud Infrastructure Metrics

• See more at


@MartinGoodwell


Dynatrace Ruxit

2016 hours for free

@MartinGoodwell

http://bit.ly/monitoring-2016

References

https://www.nagios.org

https://github.com/etsy/statsd/wiki

http://veerasundar.com/blog/2010/01/spring-aop-example-profiling-method-execution-time-tutorial/

http://theburningmonk.com/2015/05/a-consistent-approach-to-track-correlation-ids-through-microservices/

http://apmblog.dynatrace.com/2014/06/17/software-quality-metrics-for-your-continuous-delivery-pipeline-part-iii-logging/

https://blog.buoyant.io/2016/05/17/distributed-tracing-for-polyglot-microservices/

https://blog.init.ai/distributed-tracing-the-most-wanted-and-missed-tool-in-the-micro-service-world-c2f3d7549c47#.93r1dj6ah

@MartinGoodwell

https://www.nagios.org/

https://github.com/etsy/statsd/wiki

http://veerasundar.com/blog/2010/01/spring-aop-example-profiling-method-execution-time-tutorial/

http://theburningmonk.com/2015/05/a-consistent-approach-to-track-correlation-ids-through-microservices/

http://apmblog.dynatrace.com/2014/06/17/software-quality-metrics-for-your-continuous-delivery-pipeline-part-iii-logging/

https://blog.buoyant.io/2016/05/17/distributed-tracing-for-polyglot-microservices/

https://blog.init.ai/distributed-tracing-the-most-wanted-and-missed-tool-in-the-micro-service-world-c2f3d7549c47#.93r1dj6ah

performance monitoring and call tracing in microservice environments

Technology