trumpet: timely and precise triggers in data...

47
Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat

Upload: others

Post on 11-Feb-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

Trumpet: Timely and Precise Triggers in Data Centers

Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat

Page 2: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

The Problem

2

Human-in-the-loop failure assessment and repair

Long failure repair times in large networks

Evolve or Die, SIGCOMM 2016

Page 3: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

Humans in the Loop

3

Detect

Locate

Inspect

Fix

Page 4: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

Programs in the Loop

4

Detect

Locate

Inspect

Fix

Programs in the loop

Page 5: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

Our Focus

5

Detect

A framework for programmed detection of events in large datacenters

Page 6: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

Events

6

Linkfailure

DDoS

Trafficsu

rge

Packetdelay

Lostpacket

Packetburst

Switchfailure

IncastLoadimbalance

Blackhole

Congestion

Traffichijack

Loop

Middleboxfailure

❖Availability❖Performance❖Security

BurstLoss

Page 7: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

Our Focus

7

Detect

Aggregated, often sampledmeasures of network health

Page 8: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

8

Fine Timescale

Events

40 ms burst

Timeouts lasting several 100 ms

Detecting Transient Congestion

Page 9: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

Fine Timescale

Events

9

Did this tenant see a sudden increase in traffic over the last few milliseconds?

Detecting Attack Onset

Page 10: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

Inspect Every

Packet

10

Linkfailure

DDoS

Trafficsu

rge

Packetdelay

Lostpacket

Packetburst

Switchfailure

IncastLoadimbalance

Blackhole

Congestion

Traffichijack

Loop

Middleboxfailure

Some event definitions may require inspecting every packet

BurstLoss

Page 11: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

Eventing Framework Requirements

Expressivity▸ Set of possible

events not known a priori

Fine timescale eventing▸ Capture transient

and onset events

Per-packet processing▸ Precise event

determination

11

Because data centers will require high availability and high utilization

Page 12: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

12

A Key Architectural

QuestionWhere do we place eventing

functionality?

Switches HostsNICs

❖ Are programmable❖ Have processing power for fine-time scale

eventing❖ Already inspect every packet

Page 13: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

13

We explore the design of a host-based eventing

framework

Page 14: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

Research Questions

What eventing architecture permits programmability and visibility?

How can we achieve precise eventing at fine timescales?

What is the performance envelopeof such an eventing framework?

14

Page 15: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

Research Questions

What eventing architecture permits programmability and visibility?

How can we achieve precise eventing at fine timescales?

What is the performance envelope of such an eventing framework?

15

Trumpet has a logically centralized event manager that aggregates local events

from per-host packet monitors

Page 16: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

For each packet matching

group by

and report every

each group that satisfies

Filter

Predicate

Time-interval

Flow-granularity

16

Event Definition

Flow volumes, loss rate, loss pattern (bursts), delay

Page 17: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

17

For each packet matching

group by

and report every

any flow whose

Event Example

Service IP Prefix

5-tuple

10ms

sum (is_lost & is_burst) > 10%

Is there any flow sourced by a service that sees a burst of losses in a small interval?

Page 18: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

18

For each packet matching

group by

and report every

any job whose

Event Example

Cluster IP Prefix and Port

Job IP Prefix

10ms

sum (volume) > 100MB

Is there a job in a cluster that sees abnormal traffic volumes in a small interval?

Page 19: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

19

Server

Controller

ServerVM

VM

HypervisorTrumpet Packet Monitor

Software switch

Trumpet Event Manager

Triggers

Trigger Reports

Event

Report

Trumpet Design

Page 20: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

20

Trumpet Event Manager

Trumpet Event

Manager

Congestion?

CongestionTriggers

Contains event attributes, detects local events

Page 21: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

21

Trumpet Event Manager

Trumpet Event

Manager

Page 22: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

22

Trumpet Event Manager

Trumpet Event

Manager

Large flow?

Large FlowTriggers

Trumpet can be used by programs to drill-down to

potential root causes

Page 23: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

Research Questions

What eventing architecture permits programmability and visibility?

How can we achieve precise eventing at fine timescales?

What is the performance envelopeof such an eventing framework?

23

The monitor optimizes packet processing to inspect every packet and evaluate predicates

at fine timescales

Page 24: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

The Packet Monitor

24

ServerVM

VM

HypervisorTrumpet Packet Monitor

Software switch

Page 25: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

A Key Assumption

25

ServerVM

VM

HypervisorTrumpet Packet Monitor

Software switch

Piggyback on CPU core used by software switch❖ Conserves server CPU resources❖ Avoids inter-core synchronization

Page 26: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

26

Can a single core monitor thousands of triggers at full packet rate (14.8

Mpps) on a 10G NIC?

Page 27: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

Two Obvious Tricks

Use kernel bypass▸ Avoid kernel stack

overhead

Use polling to have tighter scheduling▸ Trigger time intervals

at 10ms

27

Necessary, but far from sufficient….

Page 28: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

28

Packet Match Updatestatistics at Check

Source IP = 10.1.1.0/24Source IP = 20.2.2.0/24

PredicateTime intervalFilterSum(loss) > 10%Sum(size) < 10MB

Flow granularity10ms

100msService IP prefix5-tuple

filters flow granularity predicatetime-interval

Monitor Design

at

With 1000s of triggers

Page 29: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

29

Packet Match Updatestatistics at Check

filters flow granularity predicatetime-interval

Design Challenges

at

Which of these should be performed❖On-path❖Off-path

Page 30: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

30

Packet Match Updatestatistics at Check

filters flow granularity predicatetime-interval

Design Challenges

at

Which operations to do on-path?❖70ns to forward and inspect packet

Page 31: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

31

Packet Match Updatestatistics at Check

filters flow granularity predicatetime-interval

Design Challenges

at

How to schedule off-path operations?❖Off-path on same core, can delay packets❖Bound delay to a few µs

Page 32: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

32

Packet

Match Updatestatistics at Check

filters flow granularity predicatetime-interval

Strawman Design

at

PacketHistory

On-Path

Off-Path

Doesn’t scale to large numbers of triggers

Page 33: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

33

PacketMatch Update

statistics at

Check

filters flow granularity

predicatetime-interval

Strawman Design

at

On-Path

Off-Path

Still cannot reach goal❖Memory subsystem becomes a bottleneck

Page 34: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

34

PacketMatch Update

statistics at

Check

filters 5-tuple granularity

predicatetime-interval

Trumpet Monitor Design

at

On-Path

Off-Path

Gatherstatistics at

flow granularity

Page 35: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

35

PacketMatch Update

statistics atfilters 5-tuple granularity

Optimizations

On-Path

❖ Use tuple-space search for matching❖Match on first packet, cache match❖ Lay out tables to enable cache prefetch❖ Use TLB huge pages for tables

Page 36: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

36

Checkpredicate

time-interval

Optimizations

at

Off-Path

Gatherstatistics at

flow granularity

❖ Lazy cleanup of statistics across intervals❖ Lay out tables to enable cache prefetch❖ Bounded-delay cooperative scheduling

Page 37: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

Bounded Delay

Cooperative Scheduling

37

Off-Path On-Path

Bounded Delay

Bound delay to a few µs

Page 38: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

Research Questions

What eventing architecture permits programmability and visibility?

How can we achieve precise eventing at fine timescales?

What is the performance envelopeof such an eventing framework?

38

Trumpet can monitor thousands of triggers at full packet rate on a 10G NIC

Page 39: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

39

Trumpet is expressive❖Transient congestion❖Burst loss❖Attack onset

Trumpet scales to thousands of triggers

Trumpet is DoS-Resilient

Evaluation

Page 40: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

Detecting Transient

Congestion

40

Congestion

Large Flow (Reactive)

Trumpet can detect

millisecond scale

congestion events

40 ms

Page 41: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

Scalability

41

Trumpet can process❉ 14.8 Mpps❖64 byte packets at 10G❖650 byte packets at 4x10G

… while evaluating 16K triggers at 10ms granularity

❉Xeon ES-2650, 10-core 2.3 Ghz, Intel 82599 10G NIC

Page 42: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

Performance Envelope

42

Triggers matched by each flow

How often each predicate is checked

Above this rate, Trumpet would miss

events

Page 43: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

Performance Envelope

43

At moderate packet rates, can detect events at 1ms

Number of <trigger, flow> pairs increases statistics gathering overhead

Page 44: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

Performance Envelope

44

Need to profile and provision Trumpet deployment

Above 10ms, CPU can sustain full packet rate

Page 45: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

Conclusion

Future datacenters will need fast and precise eventing▸ Trumpet is an

expressive system for host-based eventing

Trumpet can process 16K triggers at full packet rate▸ … without delaying

packets by more than 10 µs

Future work: scale to 40G NICs▸ … perhaps with

NIC or switch support

45https://github.com/USC-NSL/Trumpet

Page 46: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

A Big Discrepancy

46

Outage budget for five 9savailability

24 seconds per month

99.999% uptime

Long failure durations due to time to root-

cause failures

Page 47: Trumpet: Timely and Precise Triggers in Data Centersconferences.sigcomm.org/sigcomm/...Paper02-Trumpet... · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan

47

Every optimization is necessary❉

❉Details in the paper