best practices for benchmarking and performance analysis in the cloud (ent305) | aws re:invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Best Practices for Benchmarking and

Performance Analysis in the Cloud

Robert Barnes, Amazon Web Services

November 15, 2013

Benchmarks: Measurement Demo

3

3

4

6 4

How many

ways to

measure? At least 20…

Cloud Benchmarks: Prequel

• The best benchmark • Absolute vs. relative measures • Fixed time or fixed work • What’s different? • Use a good AMI

0.00 5.00 10.0015.0020.0025.0030.00

Ubuntu 12.4 ami-…AWS CentOS 5.4 ami-…

CentOS 5.4 ami-…CentOS 5.4 ami-…CentOS 5.4 ami-…

Average CPU result

0%

10%

20%

30%

40%

50%

60%

Coefficient of Variance

Scenario: CPU-based Instance Selection

• Application runs on premises

• Primary requirement is integer CPU performance

• Application is complex to set up, no benchmark tests exist, limited time

• What instance would work best?

1. Choose a synthetic benchmark

2. Baseline: Build, configure, tune, and run it on premises

3. Run the same test (or tests) on a set of instance types

4. Use results from the instance tests to choose the best match

Testing CPU

• Choose a benchmark – geekbench, UnixBench, sysbench(cpu), and SPEC CPU2006

Integer

• How do you know when you have a good result?

• Tests run on 9 instance types – 10 instances of each of the 9 types launched

– Tests run a minimum of 4 times on each instance

– Ubuntu 13.04 base AMI

geekbench Overview • Workloads in 3 categories

– 13 Integer tests

– 10 Floating Point tests

– 4 Memory tests

• Commercial product (64bit)

• No source code

• Runs single and multi-cpu

• Fast setup, fast runtime

Integer

AES

Twofish

SHA1

SHA2

BZip2 compress

BZip2 decompress

JPEG compress

JPEG decompress

PNG compress

PNG decompress

Sobel

LUA

Dijkstra

Floating Point

Black-Scholes

Mandelbrot

Sharpen image

Blur image

SGEMM

DGEMM

SFFT

DFFT

N-Body

Ray trace

Memory

STREAM copy

STREAM scale

STREAM add

STREAM triad

geekbench Script SEQNO=$1

GBTXT=gbtest.txt

DL=+

ID="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`"

TYPE="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-type`”

OUTID=$ID$DL$TYPE$DL

START=$(date +%s.%N)

./geekbench_x86_64 --no-upload >$GBTXT

END=$(date +%s.%N)

DIFF=$(echo "$END - $START" | bc)

OUTNAME=$OUTID$SEQNO$DL$DIFF$DL$GBTXT

mv $GBTXT $OUTNAME

…

grep “Geekbench Score” i-*$GBTXT >gbresults.txt

cat gbresults.txt | sed s/:// | awk ‘/i-/ {print $1”;”$4”;”$5}’>gbresults.csv

geekbench

Geekbench

1CPU ratio C.O.V. NCPU ratio C.O.V. RT (min)

m3.xlarge 0.93 1.04% 2.04 2.31% 2.06

m3.2xlarge 0.93 1.40% 3.80 1.46% 2.08

m2.xlarge 0.80 2.84% 1.54 4.06% 1.99

m2.2xlarge 0.80 1.34% 2.82 1.21% 2.04

m2.4xlarge 0.76 2.28% 5.11 1.71% 2.01

c3.large 1.13 0.93% 1.32 0.71% 1.76

c3.xlarge 1.13 0.39% 2.51 1.81% 1.74

c3.2xlarge 1.13 0.19% 4.88 0.25% 1.70

cc2.8xlarge 1.00 0.71% 15.46 1.93% 2.21

geekbench – Run Variance geekbench 1CPU ratio C.O.V.

m3.xlarge

instance-1 0.93 0.31%

instance-2 0.97 0.23%

instance-3 0.94 0.17%

instance-4 0.94 0.10%

instance-5 0.94 0.32%

instance-6 0.94 0.10%

instance-7 0.93 0.25%

instance-8 0.93 0.38%

instance-9 0.94 0.11%

instance-10 0.94 0.09%

geekbench – Integer Portion

gb-integer 1CPU ratio C.O.V. NCPU ratio C.O.V. RT (min)

c3.large 1.12 0.50% 1.37 0.43% NA

c3.xlarge 1.13 0.38% 2.72 0.41% NA

c3.2xlarge 1.12 0.38% 5.35 0.51% NA

cc2.8xlarge 1.00 0.20% 17.88 3.31% NA

geekbench

c3.large 1.13 0.93% 1.32 0.71% 1.76

c3.xlarge 1.13 0.39% 2.51 1.81% 1.74

c3.2xlarge 1.13 0.19% 4.88 0.25% 1.70

cc2.8xlarge 1.00 0.71% 15.46 1.93% 2.21

UnixBench Overview • Default: the BYTE Index

– 12 workloads, run 2 times (roughly 29 minutes each time)

• Integer computation

• Floating point computation

• System calls

• File system calls

– Geomean Of results to a baseline produces a system benchmarks index score

• Open source – must be built – Must be patched for > 16 CPUs

11

UnixBench Script SEQNO=$1

UBTXT=ubtest.txt

DL=+

ID="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`"

TYPE="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-type`"

FN=$ID$DL$TYPE$DL$SEQNO$DL$UBTXT

COPIES=`cat /proc/cpuinfo | grep processor | wc –l`

./Run –c 1 –c $COPIES >$FN

…

grep “System Benchmarks Index Score” i-*$UBTXT >ubresults.txt

cat ubresults.txt | sed s/”.txt:System Benchmarks Index Score”// | \

awk ‘/i-/ {print $1”;”$2}’>ubresults.csv

UnixBench

UnixBench 1CPU ratio C.O.V. NCPU ratio C.O.V. RT (min)

m3.xlarge 1.38 1.90% 2.49 1.36% 28.25

m3.2xlarge 1.42 1.85% 4.21 1.99% 28.29

m2.xlarge 0.40 5.82% 0.76 1.28% 28.30

m2.2xlarge 0.42 1.71% 1.23 1.75% 28.32

m2.4xlarge 0.48 3.31% 2.02 1.71% 28.34

c3.large 1.10 1.33% 1.91 1.54% 28.17

c3.xlarge 1.06 1.48% 2.85 1.26% 28.21

c3.2xlarge 1.10 0.54% 4.50 1.02% 28.96

cc2.8xlarge 1.00 2.97% 6.44 2.65% 30.20

UnixBench – Dhrystone 2 UB-Integer 1CPU ratio C.O.V. NCPU ratio C.O.V. RT (min)

c3.large 1.05 0.24% 1.10 0.30% 0.17

c3.xlarge 1.05 0.27% 2.20 0.28% 0.17

c3.2xlarge 1.05 0.07% 4.34 0.23% 0.17

cc2.8xlarg

e 1.00 0.10% 15.54 0.95% 0.17

UnixBench

c3.large 1.10 1.33% 1.91 1.54% 28.17

c3.xlarge 1.06 1.48% 2.85 1.26% 28.21

c3.2xlarge 1.10 0.54% 4.50 1.02% 28.96

cc2.8xlarg

e 1.00 2.97% 6.44 2.65% 30.20

SPEC CPU2006 Overview

• Competitive (reviewed)

• Commercial (site) license required

• Source code provided, must be built

• Highly customizable

• Full “reportable” run 5+ hours

• Published results on www.spec.org

http://www.spec.org

SPEC CPU2006 Overview Benchmark Category

400.perlbench C Programming language

401.bzip2 C Compression

403.gcc C C compiler

429.mcf C Combinatorial optimization

445.gobmk C Artificial intelligence

456.hmmer C Search gene sequence

458.sjeng C Artificial intelligence

462.libquantum C Physics / quantum computing

464.h264ref C Video compression

471.omnetpp C++ Discrete event simulation

473.astar C++ Path-finding algorithms

483.xalancbmk C++ Xml processing

SPEC CPU2006 Integer Script CPATH=“/cpu2006/result”

COPIES=`cat /proc/cpuinfo | grep processor | wc –l`

SITXT=estspecint.txt

DL=+

ID="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`”


FN=$ID$DL$TYPE$DL$SEQNO$DL$SITXT

runspec –noreportable –tune=base –size=ref –rate=$COPIES –iterations=1 /

400 403 445 456 458 462 464 471 473 483

grep “_base” $CPATH/CINT*.ref.csv | cut -d, -f1-2 > $FN

grep “total seconds elapsed” $CPATH/CPU*.log | awk '/finished/ {print $9}’ >>$FN

Estimated SPEC CPU2006 Integer

Est.

SPECint 1CPU ratio C.O.V. RT (min)

NCPU

ratio C.O.V. RT (min)

m3.xlarge 1.01 1.06% 54.39 2.24 1.15% 104.18

m3.2xlarge 1.01 1.67% 54.49 4.25 1.63% 109.22

m2.xlarge 0.76 1.97% 70.83 1.39 2.45% 85.37

m2.2xlarge 0.79 0.94% 68.85 2.76 1.24% 85.42

m2.4xlarge 0.78 0.16% 68.73 5.21 1.26% 89.91

c3.large 1.11 1.95% 50.00 1.25 1.47% 94.22

c3.xlarge 1.10 1.96% 50.29 2.39 1.28% 97.66

c3.2xlarge 1.08 0.87% 50.87 4.67 0.25% 100.22

cc2.8xlarge 1.00 0.29% 54.92 14.92 0.52% 125.74

Sysbench Overview • Designed as quick system test of MySQL servers

• Test categories – Fileio

– Cpu

– Memory

– Threads

– Mutex

– oltp

• Source code provided, must be built

• Very simplistic defaults – tuning recommended

Sysbench Script COPIES=`cat /proc/cpuinfo | grep processor | wc –l`

TDS=$(($COPIES * 2))

STXT=sysbenchcpu.txt

DL=+



FN=$ID$DL$TYPE$DL$TDS$DL$STXT

sysbench –num-threads=$TDS --max-requests=30000 --test=cpu /

--cpu-max-prime=100000 run > $FN

grep “total time:” i-*$STXT| cut -d, -f1-2 > $FN

Sysbench – CPU

sysbench Default C.O.V. RT (min)

m3.xlarge 3.21 1.44% 0.06

m3.2xlarge 6.41 1.38% 0.03

m2.xlarge 1.59 0.75% 0.11

m2.2xlarge 3.19 0.64% 0.06

m2.4xlarge 8.83 0.62% 0.02

c3.large 1.78 0.26% 0.10

c3.xlarge 3.55 0.53% 0.05

c3.2xlarge 6.55 8.45% 0.03

cc2.8xlarge 25.34 2.30% 0.01

tuned ratio C.O.V. RT (min)

1.69 1.29% 3.86

3.38 1.41% 1.93

0.80 0.23% 8.16

1.60 0.76% 4.07

4.71 0.20% 1.38

0.91 0.09% 7.13

1.83 0.02% 3.57

3.54 3.31% 1.85

13.69 1.10% 0.48

Summary: CPU Comparison GB GB

Int

UB UB

Int

Est.

SPECInt

sysbench

default

sysbench

tuned

m3.xlarge 2.04 2.01 2.49 1.88 2.24 3.21 1.69

m3.2xlarge 3.80 3.96 4.21 3.77 4.25 6.41 3.38

m2.xlarge 1.54 1.52 0.76 1.59 1.38 1.59 0.80

m2.2xlarge 2.82 3.02 1.23 3.19 2.76 3.19 1.60

m2.4xlarge 5.11 5.54 2.02 6.48 5.21 8.83 4.71

c3.large 1.32 1.37 1.91 1.10 1.25 1.78 0.91

c3.xlarge 2.51 2.72 2.85 2.20 2.39 3.55 1.83

c3.2xlarge 4.88 5.35 4.50 4.34 4.67 6.55 3.54

cc2.8xlarge 15.46 17.88 6.44 15.5

4

14.92 25.34 13.69

Scenario: Memory Instance Selection

• Application runs on premises

• Primary requirement: memory throughput of 20K MB/sec

• What instance would work best?

1. Choose a synthetic benchmark

2. Baseline: Build, configure, tune, and run it on premises

3. Run the same test (or tests) on a set of instance types

4. Use results from the instance tests to choose the best match

Testing Memory

• Choose a benchmark: – stream, geekbench, sysbench(memory)

• How do you know when you have a good result?

• Tests run on 9 instance types – Minimum of 10 instances launched

– Tests run a minimum of 3 times on each instance

– Ubuntu 13.04 base AMI

Stream* Overview

• Synthetic measure sustainable memory bandwidth – Published results at www.cs.virginia.edu/stream/top20/Bandwidth.html

– Must be built

– By default, runs 1 thread per cpu

– Use stream-scaling to automate array size and thread scaling • https://github.com/gregs1104/stream-scaling

name kernel

bytes

iter

FLOPS

iter

COPY: a(i) = b(i) 16 0

SCALE: a(i) = q*b(i) 16 1

SUM: a(i) = b(i) + c(i) 24 1

TRIAD: a(i) = b(i) + q*c(i) 24 2

* McCalpin, John D.: "STREAM: Sustainable Memory Bandwidth in High Performance Computers",

http://www.cs.virginia.edu/stream/top20/Bandwidth.html

https://github.com/gregs1104/stream-scaling






Memory Comparison

Stream-

Triad

Geekbench

Memory-Triad

sysbench

(default)

m3.xlarge 23640.56 15375.64 302.95

m3.2xlarge 26046.17 14999.27 603.40

m2.xlarge 18766.58 17365.76 528.16

m2.2xlarge 22421.91 17600.00 1019.08

m2.4xlarge 19634.50 14405.82 1576.30

c3.large 11434.83 9967.96 2116.84

c3.xlarge 21141.30 13972.65 2643.33

c3.2xlarge 30235.78 20657.49 2944.91

cc2.8xlarge 55200.86 37067.32 1195.90

sysbench memory defaults

--memory-block-size [1K]

--memory-total-size [100G]

--memory-scope {global,local} [global]

--memory-hugetlb [off]

--memory-oper {read, write, none} [write]

--memory-access-mode {seq,rnd} [seq]

Testing Disk I/O • Storage options:

– Amazon EBS

– Amazon EBS PIOPs

– Ephemeral

– hi1.4xlarge local storage

• I/O metrics – IOPs

– Throughput

– Latency

• Test parameters: – Read %

– Write %

– Sequential

– Random

– Queue depth

• Storage configuration – Volume(s)

– RAID

– LVM

Benchmarking PIOPs

• Launch an Amazon EBS-optimized

instance

• Create provisioned IOPS volumes

• Attach the volumes to Amazon

EBS-optimized instance

• Pre-warm volumes

• Tune queue depth and latency

against IOPs

0

200

400

600

800

1000

1200

Seq.Read

Seq.Write

MixedSeq

Read

MixedSeqWrite

RandRead

RandWrite

MixedRandRead

MixedRandWrite

Late

ncy (

usec)

PIOPs 2K Queue Depth

1D PIOPS 2K

1D PIOPS 2KQD22D PIOPS 2K

2D PIOPS 2KQD2

Testing Disk I/O Examples • [global]

• clocksource=cpu

• randrepeat=0

• ioengine=libaio

• direct=1

• group_reporting

• size=1G

• [xvdd-fill]

• filename=/data1/testfile1

• refill_buffers

• scramble_buffers=1

• iodepth=4

• rw=write

• bs=2m

• stonewall

• [xvdd-1disk-write-1k-1]

• time_based

• ioscheduler=deadline

• iodepth=1

• rate_iops=4080

• ramp_time=10

• filename=/data1/testfile1

• runtime=30

• bs=1k

• rw=write

• disk copy

• cp file1 /disk1/file1

• dd

• dd if=/dev/zero of=/data1/testile1 \

bs=1048 count=1024000

• fio – flexible io tester

• fio simple.cfg

Summary Disk I/O

Seconds MB/sec

cp f1 f2 17.248 59.37

rm –rf f2; cp f1 f2 .853 1200.47

cp f1 f3 .880 1164.96

dd if=/dev/zero bs=1048 count=1024000 of=d1 .722 1419.01

dd if=/dev/urandom bs=1048 count=1024000 of=d2 79.710 12.84

fio simple.cfg NA 61.55

Beyond Simple Disk I/O

Random

1M I/O

PIOPs 16disk

MBps

read 1006.73

write 904.03

r70w30 1005.91

Summary

If benchmarking your application is not practical, synthetic

benchmarks can be used if you are careful.

• Choose the best benchmark that represents your application

• Analysis – what does “best” mean?

• Run enough tests to quantify variability

• Baseline – what is a “good result” ?

• Samples – keep all of your results – more is better!

Please give us your feedback on this

presentation

As a thank you, we will select prize

winners daily for completed surveys!

ENT305

best practices for benchmarking and performance analysis in the cloud (ent305) | aws re:invent 2013

Technology

xlarge c3

xlarge instance

xlarge m2

xlarge cc2

xlarge m3

instance tests

cpu ratio c3

geekbench geekbench