evaluation of cpu cgroup copy 1024 bufsize 2000 maxblocks file copy 256 bufsize 500 maxblocks file...

45
Copyright 2012 FUJITSU LIMITED Evaluation of CPU cgroup Jun 7th, 2012 Taku Izumi Fujitsu Limited LinuxCon Japan 2012

Upload: donhi

Post on 23-May-2018

214 views

Category:

Documents


2 download

TRANSCRIPT

Copyright 2012 FUJITSU LIMITED

Evaluation of CPU cgroup

Jun 7th, 2012

Taku Izumi

Fujitsu Limited

LinuxCon Japan 2012

Table of Contents

Outline of CPU cgroup

Evaluation of CFS bandwidth control

CPU bandwidth control of KVM guests

Copyright 2012 FUJITSU LIMITED 1

Motivation

Fujitsu sells cloud computing service for mission critical customers. In that service QoS/accounting is very important

CPU bandwidth control feature is for cpu QoS

CPU resource provisioning according to the service level

CPU cgroup is now enhanced to support CFS bandwidth control feature

was merged with linux-3.2 kernel

RHEL6.2 bundles kernel with CFS bandwidth control

Copyright 2012 FUJITSU LIMITED 2

Copyright 2012 FUJITSU LIMITED

Outline of CPU cgroup

3

Copyright 2012 FUJITSU LIMITED

CPU cgroup is one of subsystems of cgroups and provides a control interface for scheduler

Facilities

“share”

• provisioning of proportional CPU resource through weight

• lower-bound provisioning of CPU resource

bandwidth control for completely fair scheduler

• aka CFS bandwidth control

• provide an upper limit of CPU resource

• very new feature

bandwidth control for real-time scheduler

CPU cgroup subsystem

main topic

4

Copyright 2012 FUJITSU LIMITED

“share” facility

cpu.shares to specify the weight to provide CPU time

e.g. Configure “Gold” group to receive 1.5x the CPU bandwidth that of “Silver” group

• # echo 3072 > Gold/cpu.shares

• # echo 2048 > Silver/cpu.shares

ROOT

Gold Silver

3072 2048

A B C D

1024 1024 1024 1024

30% 30% 20% 20%

60% 40%

100%

5

Copyright 2012 FUJITSU LIMITED

Known issue with share facility

CPU time is shared among groups with runnable tasks

The amount of CPU time provided to the group depends on the state of neibouring groups

It is difficult to estimate performance and not good for selling cpu time in enterprise system

ROOT

Gold Silver

3072 2048

A B C D

1024 1024 1024 1024

30%→60% 30%→0% 30%→0% 20%→40%

60% 40%

100%

B C

e.g. if B,C go idle…

6

What we ask for is

CFS bandwidth control !

“share” facility is not suitable for our service

Our requirement is the feature to limit CPU usage according to service class

The service level of low class users must not exceed that of high class users no matter what happens.

Copyright 2012 FUJITSU LIMITED 7

Copyright 2012 FUJITSU LIMITED

CFS Bandwidth control

The bandwidth allowed for a group is specified by quota and period.

Within each given "period" (microseconds), a group is allowed to consume only up to "quota" microseconds of CPU time.

“quota” means maximum run-time in a specified “period”

10L/week

FUEL CPU

500ms/s

“quota/period”

8

Copyright 2012 FUJITSU LIMITED

CFS Bandwidth control (cont.)

When a group exhausts its own quota of CPU time per a period, tasks under the cgroup cpu will never scheduled until the next period.

Empty

~ ~ ~

Empty

9

Copyright 2012 FUJITSU LIMITED

How CFS bandwidth control works

Example

If period is 250ms and quota is also 250ms, the group will get 1 CPU worth of runtime every 250ms.

“Throttled”

time

CPU#1

CPU#0

CPU#2

CPU#3

period=250ms

! ! !

10

Copyright 2012 FUJITSU LIMITED

cgroup interface

Quota and period are managed within the cpu subsystem of cgroupfs.

cpu.cfs_quota_us:

• The total available run-time within a period (in microseconds, ~1ms)

• “-1” (means no restriction) is default

cpu.cfs_period_us:

• The length of a period (in microseconds, 1s~1ms)

• “100000” (100msec) is default

50% bandwidth

“quota/period” # echo 125000 > cpu.cfs_quota_us /* quota = 125ms */

# echo 250000 > cpu.cfs_period_us /* period = 250ms */

11

Copyright 2012 FUJITSU LIMITED

Hierarchical considerations

The interface enforces that a child cgroup’s quota/period ratio never be over parent’s one

However, it’s allowed that aggregate quota of children is over parent’s one for supporting works-conserving semantics

1000

1200 600

A child’s bandwidth cannot exceed the parent’s bandwidth

1000

600 300

A total of children’s bandwidths can exceed the parent’s bandwidth

400

OK NG

12

Copyright 2012 FUJITSU LIMITED

Hierarchical considerations (cont.)

There are two ways in which a group is throttled: a. it fully consumes its own quota within a period

b. a parent's quota is fully consumed within its period

In case b) above, even though the child may have runtime remaining, it will not be allowed to run until the parent goes to the next period.

1000

600 300 400

600 100 150

! 1000

600 300 400

450 350 200

!

450+350+200 = 1000

a) b)

13

Copyright 2012 FUJITSU LIMITED

Evaluation of

CFS bandwidth control

14

Experimental setup

Hardware

Fujitsu PRIMEQUEST 1800 E2

OS

Copyright 2012 FUJITSU LIMITED

CPU Intel Xeon E7-8870 (10core / 2.4GHz) * 2

Memory 128GB

NIC Intel 82576NS

kernel 3.1.0-rc7-tip

CONFIG_CFS_BANDWIDTH=y

qemu-kvm 0.14.0-7

libvirt 0.9.4-1

15

About benchmarks

Himeno Benchmark M (256x128x128)

Benchmark for measuring FLOPS popular in Japan

Unixbench version 5.1.2

Measure performance of UNIX based system

Hackbench

Estimate the time of chat-like operation

• # hackbench 150 process 500

SysBench (0.4.12-5 ) oltp test

Benchmark a real database performance

• # sysbench --test=oltp --num-threads=10 --db-driver=mysql --mysql-table-engine=innodb --oltp-table-size=1000000

• using mysqld on localhost

Copyright 2012 FUJITSU LIMITED 16

About benchmarks (cont.)

Super Pi benchmark ver. 2.0

Estimate the time of π calculation up to 1 million (220) digits

• # time superpi 20

Original sleep-work-wakeup benchmark

measuring process wake-up latency on not-busy host

lots of pairs of threads wake up each other with random sleep and a static job. Because of sleep, CPUs are not fully used.

used for emulate some customer’s job queuing pipeline latency, waiting event and queuing jobs.

Copyright 2012 FUJITSU LIMITED

WORK SLEEP

Thread 1 Thread 2

measure this latency

wake up

17

“quota” test

When changing “quota”, how does the score of benchmark change?

Procedure

Create the CPU cgroup named “benchmark”

Specify cpu.cfs_quota_us (cpu.cfs_period_us is left the default)

Run the benchmark under “benchmark” cgroup

Benchmarks

Himeno Benchmark

Unixbench

Hackbench

SysBench oltp test

Sleep-work-wakeup benchmark

Copyright 2012 FUJITSU LIMITED

/

benchmark

𝑐𝑝𝑢.𝑐𝑓𝑠_𝑞𝑢𝑜𝑡𝑎_𝑢𝑠

𝑐𝑝𝑢.𝑐𝑓𝑠_𝑝𝑒𝑟𝑖𝑜𝑑_𝑢𝑠=

𝑥

100000

Process of benchmark

CPU cgroup hierarchy

Process of benchmark

18

Himeno Benchmark result

Copyright 2012 FUJITSU LIMITED

2151

0

500

1000

1500

2000

2500

0 20 40 60 80 100 120

Sco

re (

MFL

OP

S)

quota (msec)

period=100msec

Score when unlimited

2148.3 MFLOPS

19

Unixbench result

Copyright 2012 FUJITSU LIMITED

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100 120

IND

EX

quota (msec)

Dhrystone 2 using register variable

Double-Precision Whetstone

Execl Throughput

File Copy 1024 bufsize 2000 maxblocks

File Copy 256 bufsize 500 maxblocks

File Copy 4096 bufsize 8000 maxblock

Pipe Throughput

Pipe-based Context Switching

Process Creation

Shell Scripts (1 concurrent)

Shell Scripts (8 concurrent)

System Call Overhead

System Benchmarks Index Score

period=100msec

20

Hackbench result

Copyright 2012 FUJITSU LIMITED

3377.899

82.612

y = 71495x-1.29

0

10

20

30

40

50

60

70

80

90

0

500

1000

1500

2000

2500

3000

3500

4000

0 50 100 150 200 250

Task

s p

er

sec

(/se

c)

Elap

sed

tim

e (

sec)

quota (msec)

Elapsed time (sec)

tasks per sec

period=100msec

Score when unlimited

3.578 sec

21

SysBench oltp test result

Copyright 2012 FUJITSU LIMITED

29.71

174.41 335.5776

57.3293

0

50

100

150

200

250

300

350

400

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80 100 120A

vera

ge o

f ex

ecu

tio

n t

ime

(m

s)

Tran

sact

ion

s (T

PS)

quota (msec)

transaction

execution time

period=100msec Score when unlimited

54.61 sec

183.03 TPS

22

Sleep-work-wakeup benchmark result

Copyright 2012 FUJITSU LIMITED

519.547914

8.078458 0.078542 0

100

200

300

400

500

600

0 50 100 150 200 250 300 350 400 450

late

ncy

(m

sec)

quota (msec)

period=100msec Score when unlimited

0.0697 msec

23

Copyright 2012 FUJITSU LIMITED

CPU bandwidth control of

KVM guests

24

CPU bandwidth control of KVM guests

Virtual CPU bandwidth control of KVM guests can be achieved by using CFS bandwidth control

libvirt has already supported this feature

Specify “vcpu_period” and “vcpu_quota” parameters against each guests

libvirt converts them to cpu.cfs_period_us and cpu.cfs_quota_us

Copyright 2012 FUJITSU LIMITED

Virtual Guest

period (vcpu_period)

quota (vcpu_quota)

25

VM1

CPU Bandwidth control of KVM guests (cont.)

vcpu_period, vcpu_quota

Copyright 2012 FUJITSU LIMITED

/

libvirt

qemu

VM1

VM2

VM3

vcpu0

vcpu1

unlimited

unlimited

unlimited

unlimited

unlimited

25000

100000

25000

100000

50000

100000

CPU cgroup hierarchy

VM1’s quota value is

sum of each vcpu’s quota

period (vcpu_period)=100000

quota (vcpu_quota) = 25000

𝑐𝑝𝑢. 𝑐𝑓𝑠_𝑞𝑢𝑜𝑡𝑎_𝑢𝑠

𝑐𝑝𝑢. 𝑐𝑓𝑠_𝑝𝑒𝑟𝑖𝑜𝑑_𝑢𝑠

vCPU#0

vCPU#1

26

Comparison with “share” facility

Preconditions

2 virtual guests (1 vcpu each)

Each vcpu is pinned to the same physical CPU

Run the super-pi benchmark at VM1 in the following case:

A) Without CPU bandwidth control

i. VM2 is shut-off

ii. VM2 is running but idle

iii. VM2 is running and busy

B) With CPU bandwidth control

i. VM2 is shut-off

ii. VM2 is running but idle

iii. VM2 is running and busy

Copyright 2012 FUJITSU LIMITED

Host

VM1 VM2

pCPU#0

vCPU#0 vCPU#0

pinning

27

Comparison with “share” facility (cont.)

Copyright 2012 FUJITSU LIMITED

A) Without CPU bandwidth control on guests <“share”>

CPU time is allocated proportionally according to the “share” value

B) With CPU bandwidth control

CPU time is limited according to each quota value

37.778

37.4

37.648

31.457

15.721

15.656

0 10 20 30 40

BUSY

IDELING

SHUTOFF

Super Pi calculation time (sec)

The

sta

te o

f V

M2

Super pi benchmark result @ VM1

w/o bandwidth control

w/bandwidth control(50% quota)

(B-iii)

(This shows elapsed time of calculation, so shorter is better) (A-i)

(A-ii)

(A-iii)

(B-i)

(B-ii)

28

“quota” test on KVM guest

Preconditions

KVM guest

Procedure

Specify vcpu_quota ( vcpu_quota is left the default) of guests

Run the benchmark on the guest

Benchmark

Super Pi benchmark (1 million digits)

SysBench oltp test

Sleep-work-wakeup benchmark

Copyright 2012 FUJITSU LIMITED

CPU 1 vcpu

Memory 2GB

OS RHEL6.1 x86_64

29

Super Pi benchmark result

Copyright 2012 FUJITSU LIMITED

341.301

17.453

y = 5632.6x-1.271

0

10

20

30

40

50

60

70

0

50

100

150

200

250

300

350

400

0 20 40 60 80 100 120

Cal

cula

tio

n s

pe

ed

(d

igit

/ m

sec)

sup

er-

pi c

alcu

lati

on

tim

e (

sec)

vcpu_quota (msec)

calculation time

calculation speed

period=100msec Score when unlimited

15.722 sec

30

SysBench oltp test result

Copyright 2012 FUJITSU LIMITED

2348.2446

117.12

4.26

85.36

-10

0

10

20

30

40

50

60

70

80

90

100

0

500

1000

1500

2000

2500

0 20 40 60 80 100 120

Ave

rage

of

exe

cuti

on

tim

e (

mse

c)

Tran

sact

ion

(TP

S)

vcpu_quota (msec)

execution time

transaction

period=100msec Score when unlimited

94.57 msec

105.71 TPS

31

Sleep-work-wakeup benchmark result

Copyright 2012 FUJITSU LIMITED

130.825721

3.892596 0

20

40

60

80

100

120

140

0 20 40 60 80 100 120

late

ncy

(m

sec)

vcpu_quota (msec)

period=100msec

Score when unlimited

2.611 sec

32

Consideration of vcpu_quota

When vcpu_quota is specified, not only vcpu but virtual machine is throttled

Even if the CPU time of vcpu is less than quota value, it may be throttled

In case of heavy I/O

Copyright 2012 FUJITSU LIMITED

1vcpu VM

vCPU#0

qemu

vcpu0 thread

main thread

I/O threads

VM

vcpu0

main thread

vcpu0 thread

CPU cgroup hierarchy

I/O threads

𝑣𝑐𝑝𝑢_𝑞𝑢𝑜𝑡𝑎

𝑣𝑐𝑝𝑢_𝑝𝑒𝑟𝑖𝑜𝑑 =

𝑞

𝑝

𝑐𝑝𝑢.𝑐𝑓𝑠_𝑞𝑢𝑜𝑡𝑎_𝑢𝑠

𝑐𝑝𝑢.𝑐𝑓𝑠_𝑝𝑒𝑟𝑖𝑜𝑑_𝑢𝑠 =

𝑞

𝑝

𝑐𝑝𝑢.𝑐𝑓𝑠_𝑞𝑢𝑜𝑡𝑎_𝑢𝑠

𝑐𝑝𝑢.𝑐𝑓𝑠_𝑝𝑒𝑟𝑖𝑜𝑑_𝑢𝑠 =

𝑞 ∗1

𝑝

qemu

33

Consideration of vcpu_quota (cont.)

Even if vcpu_quota is set as 50% of vcpu_period and vcpu0 is placed at full load on guest, CPU usage rate of vcpu0 doesn’t reach 50%

Copyright 2012 FUJITSU LIMITED

pCPU#0

Guest

Host

vCPU#0

pinning

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

CP

U u

sage

(%

)

time (sec)

mpstat of pCPU#0

%idle

%steal

%soft

%irq

iowait

%sys %

%nice

%usr

%guest

34

Consideration of vcpu_quota (cont.)

CPU usage rate of qemu and qemu I/O threads is combined, the sum total will be 50%

Copyright 2012 FUJITSU LIMITED

pCPU#0

Guest

Host

vCPU#0

pinning

main

I/O threads

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

CP

U u

sage

(%

)

time (sec)

mpstat of pCPU#0

%idle

%steal

%soft

%irq

%iowait

%sys

%nice

%usr

%guest

35

Impact investigation on network I/O

Preconditions

KVM guest

Place a network load on KVM guest by using netperf between guest and host

# netperf –t TCP_STREAM -- -m 32768 -s 57344 -S 57344

Copyright 2012 FUJITSU LIMITED

Host

Guest

CPU 1vcpu (pinned to pCPU#0)

Memory 2GB

OS RHEL 6.1 x86_64

NIC vhost_net, using macvtap

36

Impact investigation on network I/O (cont.)

CPU usage of pCPU#0 during netperf (w/o CPU bandwidth control of the guest)

Copyright 2012 FUJITSU LIMITED

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10 11 12 13

CP

U u

sage

(%

)

time (sec)

mpstat of pCPU#0

%idle

%steal

%soft

%irq

%iowait

%sys

%nice

%usr

%guest

VM using 80%+ CPU

vcpu using 25%+ CPU

37

Impact investigation on network I/O (cont.)

Copyright 2012 FUJITSU LIMITED

0

100

200

300

400

500

600

700

0 10 20 30 40 50 60 70 80 90 100 110

net

wo

rk t

hro

ugh

pu

t (M

bp

s)

vcpu_quota (msec)

Network throughput on the guest

period=100msec

38

Impact investigation on disk I/O

Preconditions

KVM guest

Place a disk load on KVM guest by using dd command

# dd if=/dev/zero of=testfile bs=1024M count=2

Copyright 2012 FUJITSU LIMITED

Host

Guest

CPU 1vcpu (pinned to pCPU#0)

Memory 4GB

OS RHEL 6.1 x86_64

HDD File based virtual HDD, 16GB

39

Impact investigation on disk I/O (cont.)

CPU usage of pCPU#0 during dd command (w/o CPU bandwidth control of the guest)

Copyright 2012 FUJITSU LIMITED

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

CP

U u

sage

(%

)

time (sec)

mpstat of pCPU#0

%idl

%steal

%soft

%irq

%iowait

%sys

%nice

%usr

%guest

VM using 30%+ CPU

vcpu using 20% CPU

40

Impact investigation on disk I/O (cont.)

Copyright 2012 FUJITSU LIMITED

0

20

40

60

80

100

120

0 10 20 30 40 50 60 70 80 90 100 110

I?O

th

rou

ghp

ut

(MB

/se

c)

vcpu_quota (msec)

Disk I/O throughput on the guest

period=100msec

41

Comparison with other hypervisors

vCPU bandwidth control of other hypervisors (Xen, VMWare)

[virtual CPU usage] <= [limit value]

• Limit virtual CPU usage

• Do NOT limit hypervisor CPU usage

vCPU bandwidth control of KVM

[virtual CPU usage] + [hypervisor CPU usage] <= [limit value]

• Limit total of virtual CPU usage and hypervisor CPU usage

Both methods have Pros. and Cons.

We Fujitsu are now enhancing libvirt to support “virtual CPU usage only” limiting on KVM

Copyright 2010 FUJITSU LIMITED 42

Summary

Outline of CPU cgroup

CFS bandwidth control provides the upper limit of CPU resource

Evaluation of CFS bandwidth control

The effect of bandwidth control depends on workload:

• tasks which use CPU for full time are affected directly, while I/O bounded tasks are a little insensitive

CPU bandwidth control of KVM guests

useful when building KVM based cloud system

necessary to take impact on guests’ I/Os into account

Copyright 2012 FUJITSU LIMITED 43

Copyright 2012 FUJITSU LIMITED 44