low-latency network monitoring jeff rasley

28
Low-latency Network Monitoring via Oversubscribed Port Mirroring Jeff Rasley, Brent Stephens, Colin Dixon, Eric Rozner, Wes Felter, Kanak Agarwal, John Carter, Rodrigo Fonseca

Upload: others

Post on 19-Mar-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Low-latency Network Monitoring via Oversubscribed Port Mirroring

Jeff Rasley, Brent Stephens, !Colin Dixon, Eric Rozner, !Wes Felter, Kanak Agarwal, !John Carter, Rodrigo Fonseca

Self-Tuning Networks

Measurement

DecisionControl

Control Loop Examples!!

• Traffic Engineering!• Failure Detection

How fast can we do this?

Self-Tuning Networks

Measurement

DecisionControl

100 ms — 1 sec+Control Loop Examples!!

• Traffic Engineering!• Failure Detection

How fast can we do this?

Self-Tuning Networks

Measurement

DecisionControl

100 ms — 1 sec+

~100 µs

Control Loop Examples!!

• Traffic Engineering!• Failure Detection

How fast can we do this?

Self-Tuning Networks

Measurement

DecisionControl

100 ms — 1 sec+

~100 µs~10 ms

Control Loop Examples!!

• Traffic Engineering!• Failure Detection

How fast can we do this?

Self-Tuning Networks

Measurement

DecisionControl

100 ms — 1 sec+

~100 µs~10 ms

Control Loop Examples!!

• Traffic Engineering!• Failure Detection

How fast can we do this?

Motivation for Faster Control Loops

Background TCP flows, Microsoft data center!DCTCP, Alizadeh et al. Sigcomm ‘10

0

20

40

60

80

100

0 100 200 300 400 500 600

% o

f flo

ws

shor

ter t

han

x

Flow Duration at 1Gbps (ms)

This gets much !worse as you go from !

1 Gbps –> 10 Gbps

Motivation for Faster Control Loops

Background TCP flows, Microsoft data center!DCTCP, Alizadeh et al. Sigcomm ‘10

0

20

40

60

80

100

0 100 200 300 400 500 600

% o

f flo

ws

shor

ter t

han

x

Flow Duration at 1Gbps (ms)

This gets much !worse as you go from !

1 Gbps –> 10 Gbps

Motivation for Faster Control Loops

Background TCP flows, Microsoft data center!DCTCP, Alizadeh et al. Sigcomm ‘10

0

20

40

60

80

100

0 100 200 300 400 500 600

% o

f flo

ws

shor

ter t

han

x

Flow Duration at 1Gbps (ms)

At 100 ms, 90% are over !

before you see them!

This gets much !worse as you go from !

1 Gbps –> 10 Gbps

Why is Measurement Slow?

❖ Traditionally, this didn’t need to be fast!

❖ Control plane CPUs are typically slow!

❖ Sampling or port counter polling!

❖ Is this likely to get better? Maybe!

❖ Faster control plane CPUs could help, still a big gap between CPUs and ASICs

Our Solution: Abuse Port Mirroring

❖ Modern switches support port-mirroring!

❖ Copies all packets e.g. going out a port to a designated mirror port!

❖ We abuse port mirroring to radically increase the number of samples/sec we get from a switch!

❖ We mirror all ports to a single mirror port!

❖ Oversubscription approximates sampling (in the data plane) at much higher rates

Production!Traffic

Switch

Our Solution: Abuse Port Mirroring

❖ Modern switches support port-mirroring!

❖ Copies all packets e.g. going out a port to a designated mirror port!

❖ We abuse port mirroring to radically increase the number of samples/sec we get from a switch!

❖ We mirror all ports to a single mirror port!

❖ Oversubscription approximates sampling (in the data plane) at much higher rates

Mirror Port

Production!Traffic

Switch

Our Solution: Abuse Port Mirroring

❖ Modern switches support port-mirroring!

❖ Copies all packets e.g. going out a port to a designated mirror port!

❖ We abuse port mirroring to radically increase the number of samples/sec we get from a switch!

❖ We mirror all ports to a single mirror port!

❖ Oversubscription approximates sampling (in the data plane) at much higher rates

Mirror Port

Production!Traffic

Switch

Architecture ❖ A set of collectors receives a stream of samples from

mirror ports!

❖ Netmap or Intel DPDK for fast processing!

❖ Reconstruct flow information across all flows in the network!

❖ e.g. flow throughput and port congestion!

❖ Collectors can interact with an SDN controller to implement various applications!

❖ e.g. traffic engineering

Collector(s)

Controller

H H H H

S

S

S

S

S

)

What Can Go Wrong?❖ Lose input/output port information from packets!

❖ Recover meta-data about packets by sharing topology state from the controller!

❖ When mirror port fills, its drop policy is unknown thus making it hard to calculate throughout!

❖ Rate estimation via TCP sequence numbers!

❖ Oversubscribed port may occupy switch buffer space, taking away from production traffic!

❖ Indeed, buffers were reduced. Latency of production traffic decreased. Negligible increase in packet loss (~0.1%).

Collector(s)

Controller

H H H H

S

S

S

S

S

)

What Can Go Wrong?❖ Lose input/output port information from packets!

❖ Recover meta-data about packets by sharing topology state from the controller!

❖ When mirror port fills, its drop policy is unknown thus making it hard to calculate throughout!

❖ Rate estimation via TCP sequence numbers!

❖ Oversubscribed port may occupy switch buffer space, taking away from production traffic!

❖ Indeed, buffers were reduced. Latency of production traffic decreased. Negligible increase in packet loss (~0.1%).

Collector(s)

Controller

H H H H

S

S

S

S

S

)What are you crazy?!!

!

By oversubscribing a mirror port there’s a

possibility of degrading production traffic. !

— Cisco

What Can Go Wrong?❖ Lose input/output port information from packets!

❖ Recover meta-data about packets by sharing topology state from the controller!

❖ When mirror port fills, its drop policy is unknown thus making it hard to calculate throughout!

❖ Rate estimation via TCP sequence numbers!

❖ Oversubscribed port may occupy switch buffer space, taking away from production traffic!

❖ Indeed, buffers were reduced. Latency of production traffic decreased. Negligible increase in packet loss (~0.1%).

Collector(s)

Controller

H H H H

S

S

S

S

S

)

Results: Sample Latency (high congestion)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

0 1 2 3 4 5 6

CD

F

Measurement Latency (ms)

IBM G8264 (10Gb) Pronto 3290 (1Gb)

Low Congestion !Sample Latency: !

75–150 µs

H

Switch

H

H

Sender Receiver

Collectort_1

t_2

Latency = t_2 - t_1

What Can You Do With This? TE!Stride(8) 100 MiB Workload!

CDF of Flow Throughput

��

����

����

����

����

��

�� �� � �� � �� �� �� ��

���

��������

���������������

Setup!• 16 hosts!• k=4 Fat Tree

ECMPOptimal

Stride(8)!Permutation of hosts, !such that each flow !traverses the core

IBM G8264 (10 Gb) Hardware

What Can You Do With This? TE!Stride(8) 100 MiB Workload!

CDF of Flow Throughput

��

����

����

����

����

��

�� �� � �� � �� �� �� ��

���

��������

���������������

��

����

����

����

����

��

�� �� � �� � �� �� �� ��

���

��������

���������������

�� ������

Setup!• 16 hosts!• k=4 Fat Tree

ECMPOptimal

Poll 100ms

Stride(8)!Permutation of hosts, !such that each flow !traverses the core

IBM G8264 (10 Gb) Hardware

What Can You Do With This? TE!Stride(8) 100 MiB Workload!

CDF of Flow Throughput

��

����

����

����

����

��

�� �� � �� � �� �� �� ��

���

��������

���������������

��

����

����

����

����

��

�� �� � �� � �� �� �� ��

���

��������

���������������

�� ������

��

����

����

����

����

��

�� �� � �� � �� �� �� ��

���

��������

���������������

�� ����������

Setup!• 16 hosts!• k=4 Fat Tree

ECMPOptimal

Poll 100ms

Stride(8)!Permutation of hosts, !such that each flow !traverses the core

Our TE

IBM G8264 (10 Gb) Hardware

Conclusion

❖ Using oversubscribed port mirroring we get ~1 million samples / sec.!

❖ We get sampling latencies between 100 µs – 6ms on real hardware, today.!

❖ We improve this by 3–4 orders of magnitude, the state of the art is 100 ms – 1 sec+

Questions? Thank you!

Production Traffic Throughput

0 1 2 3 4 5 6

1 2 3 4 5 6 7 8 9 10

Ave

Per-F

low

Tput

(Gbp

s)

Congested Output Ports

Mirror No Mirror

Production Traffic Latency

0

0.5

1

1.5

2

2.5

3

1 2 3 4 5 6 7 8 9 10

Aver

age

Late

ncy

(ms)

Congested Output Ports

Mirror No Mirror

Production Traffic Packet Loss

0%0.02%0.04%0.06%0.08%

0.1%0.12%0.14%0.16%

1 2 3 4 5 6 7 8 9

Perc

ent o

f Pkt

s Lo

st

Congested Output Ports

MirrorNo Mirror

Flow Rate Estimation

0

2

4

6

8

10

1 1.5 2 2.5 3

Thr

ough

put

(Gbp

s)

Time (s)

8 Gbps6 Gbps

Rate Estimation Error

��������������

�� �� �� �� � ��� ��� ���

���������

������ ������

�� ����� ������������