dependability evaluation and benchmarking of network function virtualization infrastructures

NetSoft 2015 Dependability Evaluation and Benchmarking of Network Function Virtualization Infrastructures

Dependability Evaluation and Benchmarking of

Network Function Virtualization Infrastructures

1st IEEE Conference on Network Softwarization (NetSoft) April 13-17, 2015, UCL, London

Best Conference Paper Award

D. Cotroneo, L. De Simone, A.K. Iannillo, A. Lanzaro, R. Natella Critiware s.r.l. and Federico II University of Naples, Italy

http://dx.doi.org/10.1109/NETSOFT.2015.7116123


Network Function Virtualization: A new paradigm

2 / 20

Source: “Network Functions Virtualisation – Introductory White Paper”, Issue 1

t  Reduced costs, improved manageability, faster innovation

t  Comparable performance and reliability?


Why engineering reliable NFV is challenging?

 Complex stack of hardware and software off-the-shelf components

 Exposure to several sources of hardware and software faults

 Lack of tools and methodologies for testing fault-tolerance

3 / 20

Hardware

Hypervisor

VM

Guest OS

Hardware

Hypervisor

VM

Guest OS

Hardware

Virtualization

VM

Guest OS

?

? As a result, it is hard to trust the

reliability of NFV services


In this presentation:

t  An experimental methodology for dependability benchmarking of NFV based on fault injection

t  A case study on a virtual IP Multimedia Subsystem (IMS), analyzing: §  The impact of faults on performance and availability §  The sensitivity to different types of faults §  The pitfalls in the design of NFVIs

4 / 20


What is a dependability benchmark?

t  A dependability benchmark evaluates a system in the presence of (deliberately injected) faults

§  Are NFV services still available and high-performing even when a fault is injected?

t  The dependability benchmark includes: §  measures (KPIs) for characterizing performance and

availability §  procedures, tools, conditions under which

measures are obtained

5 / 20


Overview of the benchmarking process

6 / 20

Deployment of VNFs over

the NFVI Workload and

VNFs execution Data

collection Testbed clean-up ... ...

Injection of the i-th fault

Definition of workload,

faultload, and measures

Fault Injection experiments

Computation of measures and

reporting

Iterated over several different faults


Benchmark measures

t The dependability benchmark measures the quality of service as perceived by NFV users: 1. VNF latency 2. VNF throughput 3. VNF experimental availability 4. Risk Score

t We compare fault-injected experiments with the QoS objectives and the fault-free experiment (benchmark baseline)

7 / 20


VNF Latency and Throughput

8 / 20

VNFVNF

VNF

VNF

Virtualization Layer

Off-The-Shelf hardware and

software

VNF Latency:

trequest

tresponse

Fault Injection

End pointsEnd pointsEnd points

VNF Throughput:

the time required to process a unit of traffic (such as a packet or a service request)

the rate of processed traffic (packets or service requests) per second


90th percentiles

50th percentiles

Characterization of VNF latency

9 / 20

0102030405060708090

100

0 50 100 150 200 250 300 350 400

Cum

ulat

ive

Dis

trib

utio

n (%

)

Latency (ms) Gap from QoS objectives

Response latency fault-free

Response latency with faults,

good performance

Response latency with faults,

bad performance

Percentiles of the distribution are compared against QoS objectives, e.g.: •  50th percentile ≤ 150ms •  90th percentile ≤ 250ms


VNF Experimental Availability

10 / 20

VNFVNF

VNF

VNF

Virtualization Layer

Off-The-Shelf hardware and

software

End points

Experimental availability:

Fault Injection

End pointsEnd points

the percentage of traffic units that are successfully processed


Risk Score

t The Risk Score is a brief measure that summarizes the risk of experiencing service unavailability and/or performance failures

11 / 20

RS =Weighted average over all faults

∑ ( )% +Performance

failures

%Availability

failures


Benchmark faultload

12 / 20

Network frame receive/transmit

Corruption Drop Delay

Host VM Host VM Host VM

Storage block reads/write

Corruption Drop Delay

Host VM Host VM Host VM

I/O faults

Compute faults

Hogs Termination Code corruption Data corruption

CPU Memory Host VM Host VM Host VM

t  Faults in virtualized environments include disruptions in network and storage I/O traffic, in CPUs and memory

t  A fault injector has been implemented as a set of kernel modules for VMware ESXi and Linux


Benchmark workload

t The VNFs should be exercised using a representative workload

t Our dependability benchmarking methodology is not tied to a specific choice of the workload

t Realistic workloads can be generated using load testing and performance benchmarking tools (e.g., Netperf)

13 / 20


Case study: Clearwater IMS

t  Clearwater: an open-source NFV-oriented implementation of IP Multimedia Subsystem (IMS)

t  In a first round of experiments, we test a replicated, load-balanced deployment over several VMs

t  In a second round of experiments, we introduce the automated recovery of VMs (VMware HA cluster) in the setup

t We use SIPp to generate SIP call set-up requests

14 / 20

VMware ESXi replicated servers

Fault Injection


Fault injection test plan

t We inject faults in one of the physical host machines, and faults in a subset of the VMs (Sprout and Homestead)

t We inject both I/O (network, storage) and compute (memory, CPU) faults, both intermittently and permanently

t  Each fault injection experiment has been repeated three times

t  In total, 93 fault injection experiments have been performed

15 / 20


Experimental availability

t We computed performance and availability KPIs from logs of the SIPp workload generator

t Faults have a strong impact on availability t Compute faults and Sprout-VM faults

have the strongest impact

16 / 20


VNF latency (by fault type)

17 / 20

0%!

10%!

20%!

30%!

40%!

50%!

60%!

70%!

80%!

90%!

100%!

1! 10! 100! 1000! 10000!

Cum

ulat

ive

Dis

trib

utio

n (%

)!

Latency (ms) - Logarithmic scale!

I/O faults!Compute faults!Fault-free!

T 50=

150m

s

T 90=

250m

s

Over than 10% of requests exhibit a

latency much higher than

250ms!


Risk Score and problem determination

t  The overall risk score (55%) is quite high and reflects the strong impact of faults

t  The infrastructure was affected by a capacity problem §  once a VM or Host fails, the remaining replicas are not

able to handle the SIP traffic

18 / 20

NFVI design choices have a big impact on reliability! e.g., placement of VMs across hosts, topology of virtual networks and

storage, allocation of CPUs and memory for VMs, etc.


Evaluating automated recovery mechanisms

19 / 20

0

500

1000

1500

2000

2500

0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300

Time (s)

Network throughput at Sprout VM

Fault injected VM recovered

~1m

Fault-free run

Faulty run,load-balancing

only

Faulty run,load-balancing + automated

recovery

Fault tolerance mechanisms require careful tuning, based on experimentation

in our experiments, automated VM recovery was too slow and availability still resulted low


Conclusion

t Performance and availability are critical concerns for NFV

t NFVIs are very complex, and making design choices is difficult

t We proposed a dependability benchmark useful to point out dependability issues and to guide designers

t Future work will extend the evaluation to alternative virtualization technologies

20 / 20

dependability evaluation and benchmarking of network function virtualization infrastructures

Software