driving service ownership with distributed tracing

40
Daniel “Spoons” Spoonhower, CTO and Co-founder Driving Service Ownership with Distributed Tracing

Upload: others

Post on 22-Apr-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Driving Service Ownership with Distributed Tracing

Daniel “Spoons” Spoonhower, CTO and Co-founder

Driving Service Ownership with Distributed Tracing

Page 2: Driving Service Ownership with Distributed Tracing

spoons (aka Daniel Spoonhower)CTO and Co-founder, Lightstep

2

@save_spoons

[email protected]

Who am I?

Page 3: Driving Service Ownership with Distributed Tracing

What Changed?

3

Cloud Native, Microservices & DevOps

Page 4: Driving Service Ownership with Distributed Tracing

4

Page 5: Driving Service Ownership with Distributed Tracing

Observe

Controlkustomize

? ??

Page 6: Driving Service Ownership with Distributed Tracing

What you can control

What you areresponsible for

Stress (n): responsibility without control

6

Page 7: Driving Service Ownership with Distributed Tracing

Closing the gap between control and responsibility

Responsibility for delivering performance and reliability

Like many problems, the solution requires:- Having the right data- Setting the right goals- Giving teams ownership

Distributed tracing is essential to closing this gap

7

Page 8: Driving Service Ownership with Distributed Tracing

8

Accountability+

Agency

Distributed Tracing

Ownership means...

Loosely coupled services require…

Page 9: Driving Service Ownership with Distributed Tracing

Distributed Tracing

9

Page 10: Driving Service Ownership with Distributed Tracing
Page 11: Driving Service Ownership with Distributed Tracing

Relationships matter

11

Traces encode causal relationships between callers and callees

callsreturns

Page 12: Driving Service Ownership with Distributed Tracing

Traces are the raw material, not the finished product

Distributed traces – basically just structs

Distributed tracing – the art and science of deriving value from traces

12

Page 13: Driving Service Ownership with Distributed Tracing

Service Ownership

13

Benefits & Risks

Page 14: Driving Service Ownership with Distributed Tracing

Service ownership, defined

Ownership means teams are responsible for delivery of their software and services

Includes responsibilities like:- Incident response- Cost management- Fixing bugs

14

Compare to DevOps…- As a engineering culture- As a set of tools- As a feedback loop between

developers and their users

Page 15: Driving Service Ownership with Distributed Tracing

Benefits

Team are more independent ⟹ higher developer velocityEngineering team performance tied to real business metrics- Application developers- Platform engineers

15

Page 16: Driving Service Ownership with Distributed Tracing

Risks

Teams are more independent ⟹ more frameworks & tools- Higher vendor costs- More training new team members & internal transfers- Harder to get a “big picture” view of your application

16

Page 17: Driving Service Ownership with Distributed Tracing

Managing trade-offs around benefits and risks

Allowing for independence

17

Define clear responsibilities and goals

…while ensuring consistencyMeasure progress toward goals and hold teams accountable

Page 18: Driving Service Ownership with Distributed Tracing

Driving Toward Ownership

18

Page 19: Driving Service Ownership with Distributed Tracing

3 pieces to the puzzle

19

DocsOncall

SLOs

Page 20: Driving Service Ownership with Distributed Tracing

Start with expertise then ownership

Make it easy to find related…

- Telemetry & dashboards- Alert definitions- Playbooks

Use a template!

Track last-modified dates

- Require periodic audits & updates

Centralized documentation

20

Page 21: Driving Service Ownership with Distributed Tracing

Make documentation machine-readable

Use it to generate:

- Dashboard config- Escalation policy config- Deployment pipeline config- …

Make documentation necessary for day-to-day work

Centralized documentation

21

Machine-readable

Page 22: Driving Service Ownership with Distributed Tracing

It’s hard to keep service dependencies up to date manually…

So don’t!

- Use telemetry from the application

Centralized documentation

22

Machine-readable

& Dynamic!

Page 23: Driving Service Ownership with Distributed Tracing

Record who is accountable

Automate many mundane tasks

Train new team members

Build confidence

Why is documentation important?

23

Page 24: Driving Service Ownership with Distributed Tracing

Oncall is (often) responsible for…

- Incident response- Communicating status internally &

externally- Production change management

- Deploying new code - Pushing infrastructure changes

- Monitoring dashboards- Low-urgency alert triage- Customer requests

- And other interrupt driven work- Shift handoffs- Writing postmortems

Oncall

24

Page 25: Driving Service Ownership with Distributed Tracing

How to improve incident response:

- Make pages more actionable- Deliver pages to the right teams- Reduce the number of pages

Handling alerts

25

Page 26: Driving Service Ownership with Distributed Tracing

More context means faster root cause analysis

26

Page 27: Driving Service Ownership with Distributed Tracing

Send alerts directly to the teams that are responsible for taking action!

Dynamic alert delivery

27

Are We All on the Same Page? Let’s Fix That Luis Mineiro @voidmaze SRE Zalando, SREcon EMEA 2019https://www.usenix.org/conference/srecon19emea/presentation/mineiro

Page 28: Driving Service Ownership with Distributed Tracing

“How will we do better next time?”

- Make sure root causes are fixed- Improve responses for novel issues

Make sure postmortems are blameless

- By establishing what happened objectively using traces

Improving postmortems

28

Page 29: Driving Service Ownership with Distributed Tracing

Direct impact on customer experience (revenue, reputation, etc.)Time spent handing pages, writing postmortems, handling interrupts is... time not spent building new features, proactive optimizationStress of oncall has major impact on job satisfaction

Why is improving oncall important?

29

Page 30: Driving Service Ownership with Distributed Tracing

Service Level Objectives

- Promises service owners make to their customers

- Both external and internal customers- Stated in a way that can be

measured on short time scales- Days, hours, or even minutes

SLOs

30

p99 latency < 5s over the last 5 minutes

threshold

service level indicator (SLI measurement window

Common indicators

- Latency (p50, p99, etc.)- Error rate- Availability

Page 31: Driving Service Ownership with Distributed Tracing

31

“instantaneous” latency

latency over 5 minute window

Page 32: Driving Service Ownership with Distributed Tracing

Ask:

- What do your customers expect?- What can you provide today?- How do you expect that to change?

Determining SLOs

32

Page 33: Driving Service Ownership with Distributed Tracing

Derive internal SLOs using tracing

33

A

B

C

A p99 latency < 5s

B p99 latency < 5s

C p99 latency < 2.5s

D

~p99.5 latency < 2.5s

Page 34: Driving Service Ownership with Distributed Tracing

They measure success in delivering serviceTeams use them as a guide to prioritize workConsistency across your org- Hold teams accountable in a consistent way- Do it transparently

Why are SLOs important?

34

Page 35: Driving Service Ownership with Distributed Tracing

3-piece puzzle review

Documentation- Establishes ownership: who you will hold accountable- Also provides critical information for building confidence

Oncall- Oncall (and service ownership) are more than incident response

SLOs- How you measure success!

Tracing provides key information for each

35

DocsOncall

SLOs

Page 36: Driving Service Ownership with Distributed Tracing

Giving teams agency to improve reliability

- Will help them hit their goals- Lower their stress

But having agency requires

- The right information- The time to act on it

Ownership doesn’t come for free!

Budgeting for ownership

36

Page 37: Driving Service Ownership with Distributed Tracing

Next steps…

37

Page 38: Driving Service Ownership with Distributed Tracing

Making changes

Rolling out new processes/tools in a DevOps org is hard- Process/tools must provide value to app dev teams- Ideally, they are necessary parts of their day-to-day work

To establish and maintain service ownership- Use a combination of docs, oncall process, and SLOs- Manufacture a need for those process/tools where necessary

38

Page 39: Driving Service Ownership with Distributed Tracing

Ownership = Accountability + Agency

Accountability- Set deliverables and goals for service owners- Judge their performance based on those deliverables and goals

Agency- Offer the information, confidence, and budget to improve

39

Page 40: Driving Service Ownership with Distributed Tracing

Daniel “Spoons” Spoonhower, CTO and Co-founder

Thank you

@save_spoons lightstep.com