whitepaper: where did my cpu go?

1

"W"W HERE HERE DD ID ID MM Y Y CPUCPU GG OO ?"?" -- MM ONITORING AND ONITORING AND CC APACITY APACITY

PP LANNING LANNING AA DVENTURES ON A DVENTURES ON A CC ONSOLIDATED ONSOLIDATED

EE NVIRONMENTNVIRONMENT

Karl Arao, Enkitec

ABSTRACT This paper will focus on CPU monitoring and capacity planning as well as the scenarios that are typically encountered on a massively consolidated environment, where, let's say you have 30+ databases and you want to know how much CPUs they are using at a particular time interval. Before touching on the cool tricks, a deep dive into important CPU topics and metrics is a MUST. In Oracle world the CPU is not just the "Green Thing" in the Enterprise Manager. It's actually much more than that. We will discuss the usual CPU monitoring tools in OEM, visualization enhancements that can be made by AWR analytics, and how these can be applied to critical capacity planning scenarios.

TARGET AUDIENCE Target audiences are DBAs, Architects, Performance Engineers, and Capacity Planners Learner will be able to:

• Compare CPU speeds between hardware platforms • Identify the different database CPU events • Identify the consolidated CPU load of a cluster environment • Take quantitative CPU information and make use of visualization tools for solid capacity planning solutions

BACKGROUND When you migrate a database or when you move instances from one server to another you have to know how will those instances behave once they are on the new server if the instance CPU requirements will fit on the server CPU capacity. Knowing the end utilization before even doing the migration is even better because it will enable you to plan for CPU resource management and project the resource utilization of that machine in the next coming months. Every DBA should have a deep understanding how the CPU works, how to administer it, and it’s a very critical resource being a big driver on server migration and acquisition decisions.

PART 1: COMPARING CPU SPEEDS Let’s start with how Oracle calculates the default settings for CPU resources and this is dictated by the database parameter CPU_COUNT.

• Essentially the CPU_COUNT is the number of CPUs that your database is using

2

• And that actually boils down to the number of Logical CPUs that you have on your DB server • And this is actually the parameter that you alter to allocate just specific number of CPUs to a particular database

which is what you do on “Instance Caging” Let’s dive little bit deeper on the CPU architecture specifically the (Sockets, Cores, Threads) so you’ll have an idea on how Oracle calculates the default settings for CPU resources.

The above image is the CPU architecture of an Exadata V2 machine with the following specifications:

• Total of 2 sockets that is equivalent to 2 physical CPUs • Total of 8 cores which is 4 cores for each socket • Total of 16 threads because with HT each core has 2 logical CPUs

The 16 threads are the Logical CPUs being shown on the CPU_COUNT parameter, another view of this architecture is shown below from the cpu_topology [1] script output

The cpu_topology script parses the dmidecode and /proc/cpuinfo and it’s very useful for a quick characterization of the CPU architecture plus the make and model

• The script should give you the hardware make and model when executed as root • Processors – are numbered sequentially across the two sockets • If you grep for “Physical ID” – that’s the distinct number of the number of sockets • If you grep for “Siblings” – that’s the Logical CPUs (per socket) • “Core ID” is just the ID numbering for each core • “CPU cores” – number of cores (per socket)

Read on the RedHat Doc-7715 [2] for more details on how to determine if your Intel CPU is multi-processor, multi-core, or supports Hyper-Threading

3

To monitor the utilization of each Logical CPU and how it maps to core and physical socket there’s a tool by Intel that you can use called “turbostat” [4], the important columns are as follows:

• The TSC column is the published clock rate of the CPU • The GHz is the turbo boost clock rate (similar to overclocking)

You can see on the image that there is a single threaded highly active process that made one CPU 99% utilized because whenever possible the turbo boost feature will increase the frequency of that particular processor core. By using the script you’ll be able to tell if the turbo boost is kicking in or at least if it’s enabled. On the later sections of the paper I’ll explain more about this feature and show you the performance matrix when we turn off/on Hyper-Threading and Turbo Boost. Different methods of measuring CPU speeds: There are different ways to measure the CPU speed of one platform to another but the bottom line is to come up with a single currency system between hardware platforms to be able to quickly measure the speed differences. Below are the methods:

• Published benchmarks TPC-C SPECint_rate2006

• Actual Benchmarking cputoolkit SLOB (Logical IO test) TPC-C This is probably the most known benchmark…

• Published by a non-profit organization called Transaction Processing Performance Council (TPC) • Performance is measured by tpmC and price/performance which you usually see on the Full Disclosure Report

(FDR) • To get the CPU performance, we need to derive the tpmC/core which we can easily pull the CSV data from the site • Then we compare this number to another platform which is a good yardstick for comparing different CPUs ability to

crank through database code path (Logical IOs mostly) because they are being executed on an Oracle database

4

Here’s the snippet from the Full Disclosure Report:

Here’s the CSV output [5] of the TPC-C benchmark: The yardstick of CPU performance is the tpmC / core and the Cisco UCS server above has a value of => 1609186.39 / 16 = 100574

SPECint_rate2006

• Published by a non-profit organization called SPEC • The organization is more diverse because it has a suite of benchmarks to measure the performance of let’s say

Java, Virtual Machines, Energy efficiency, Web Servers, etc. • But for the purposes of measuring CPU performance we use the SPECint_rate2006

• Measures integer performance which all software makes heavy use of integer instructions • The opposite of this is the Floating point performance which is useful for video games and digital

content creation • Good thing about this benchmark is all CPUs are used because old versions of SPECint just measure 1CPU

which along the way of microprocessor advancements brought some inconsistencies so they have to revise the benchmark

• Used by OEM for CPU sizing (stored in SYSMAN.EMCT_SPEC_RATE_LIB table) • To get the CPU performance we compute for SPECint_rate2006/core • On the website you’ll see this table and there’s a way to download it as CSV dump and you can hack the data so you

can quickly grep for data sets

5

Here’s the snippet from the SPEC.org website:

The yardstick of CPU performance is SPECint_rate2006/core and the X3-2 server above has a value of => 702/16 = 43.875 Then for easy comparison across platforms you can also pull the data points as CSV [6] and derive the SPECint_rate2006/core value for each row:

Case #1: Comparing 2007 vs 2012 CPU speeds To give you a better idea on how to make use of these published benchmarks we can make use of these numbers to compare the CPUs from 2007 and 2012.

• The highlighted rows in yellow are the servers we are comparing

• The tpmC/core and SPECint_rate2006/core numbers are already derived and these numbers are shown on the first column so for quick comparison purposes just focus your eyes on the first column

• The following are the rest of the header info for tpmC/core and SPECint_rate2006/core tpmC/core tpmC/core, System, tpmC, Price/Perf, Total System Cost, Currency, Database Software, Server CPU Type, Total Server Cores, Cluster, Date Submitted SPECint_rate2006/core SPECint_rate2006/core, #Cores, #Chips, #Cores/Chip, #Threads/Core, Base, Peak, Hardware Vendor, System

6

• On year 2007, the IBM p570 is the fastest CPU around with 101,116 tpmC/core while the Proliant ML370 G5 is on the range of 34K tpmC/core

• This performance gap is also evident on the SPECint_rate2006/core numbers which is 30.5 vs 18.25 respectively

• On year 2012 there’s no TPC-C data for Power7 servers but assuming that the SPECint_rate2006 numbers would also reflect the TPC-C which is what it was back in 2007 we can say that the Intel CPUs nowadays are closing-in in terms of performance

• From the SPECint_rate2006 of IBM, Sun X3-2, and Cisco UCS below they are pretty much on the same range, which is 43-45 SPECint_rate2006/core. This means that if I’ll be doing server migrations across these servers I can size for 1:1 ratio, let’s say 16CPUs of X3-2 is equivalent to 16CPUs of IBM p780

• On Solaris M9000 (at the bottom), they are slow per thread/core but these servers have a lot of CPU cores. So if you have a high logical IO capacity requirements then go for this box or you may opt to spread out across multiple UCS blades or Sun X3-2

• With the knowledge of how to play around with the data points of published benchmarks you’ll be able perform what-if capacity planning scenarios quickly and easily

7

Actual Benchmarking

• The actual benchmarking really boils down on how many LIOs you can do per second • There are two tools that can do an Oracle Logical IO micro benchmark and these are cputoolkit and SLOB • The main difference is on their workloads • The cputoolkit [7] runs on multiple cycles and that depends on the start and end CPUs you put in the parameters

• So when you run ./runcputoolkit-auto 1 2 dw will make 1 CPU 100% busy on the first cycle and 2 CPUs busy on 2nd cycle

• And when you do 1 – 8 it will do that kind of behavior on each cycle • BTW the driver SQL of this tool is 1sec per execute which is doing more LIOs than SLOB

• The SLOB [8] on the other hand, spreads out the work on all the Logical CPUs so if I do ./runit.sh 0 2 it will be utilizing the same workload as cputoolkit but spread out across CPUs

• But cputoolkit allows you to control the saturation of specific number of CPUs where in SLOB you need to make use of numactl to pin it on specific CPUs

• But having varying CPU workloads [9] helps a lot on analyzing the Cores vs Threads behavior that I will discuss later on this paper

• The actual benchmarking (LIO micro benchmark) is the most accurate in terms of comparing CPU speeds across database servers

8

Case #2: Exadata V2 and X2 performance comparison and migration On this example we will make use of the results of the cputoolkit to compare the CPU speeds between Exadata V2 and X2 and make use of the speed numbers to accurately estimate the CPU utilization on the new server even before doing the migration. Below is the speed comparison between the V2 and X2, the former having 2.1 million LIOs/sec and the latter with 3.6 million LIOs/sec. Take note that these LIOs/sec numbers are at the peak utilization where all of the CPUs are very active.

And you can make use of the actual benchmark numbers when you are migrating a database from V2 to X2, below is a Swingbench [10] run on both machines running the same ~1200 Transactions per second (TPS)

• Look closely at the CPU utilization of the V2 and the Elapsed times of the SQLs • Even without the actual migration I can estimate the CPU utilization on the target machine by making use of the

actual benchmark numbers

9

• First we need to consider how fast the V2 CPUs compared to the X2… I call it the chip efficiency factor

Chip efficiency factor = (source LIOs/sec) / (destination LIOs/sec)

= 2.1M / 3.6M

= .5833

• 2nd we apply that factor to the current CPU usage of the V2 to get the requirements on the target machine

X2 CPU requirement = source host CPUs * utilization * chip efficiency factor

= 16 * .46

= 7.36 * .5833

= 4.29 CPUs

• Then divide the requirement by the CPU capacity

X2 CPU Utilization = CPU requirement / CPU capacity

= 4.29 / 24

= 17.8 %

• Now when we do the actual migration, this is what happens (look closely on the % and elapsed of the SQLs) • So when migrating from a slower to faster CPU the per execution gets faster because of the CPU speed gains

resulting to lower CPU utilization • With faster CPUs you are now able to process the same amount of Transactions per second (TPS) at less

time and lower CPU • The lower CPU utilization will give you more headroom for more Transactions per second (TPS) resulting to

more work being done

10

PART 2: MEASURING CORES VS THREADS

About the Cores vs Threads test case:

• This is a test case to show the CPU behavior as you saturate up to the max # of CPUs and also showing the effects of having Hyper-Threading and Turbo Boost on or off

• It's a CPU centric (LIOs) workload and as the server gets saturated up to max CPU then there's diminishing returns on LIOs performance

• Above are the session level numbers

• Below are the workload level numbers

• With Hyper-Threading turned on once it gets past the core count from my tests you only get about 20% increase up to the max # of threads. So it's really not doubling the CPU LIO capacity by any means but I would always leave it enabled.

HTon-TurboOn

• On session level, as you use more and more CPUs the lesser LIOs/elapsed you can do (the more you are sharing the LIO capacity with others) and once you reach the max CPUs count you start to see some wait on run queue which ultimately affects the session response times

• The narration of the graph would be like this: On the HTonTurboOn test case.. the x axis 14 sa turated CPUs has y axis value of about 100,000 LIOs/Elap which means on the workload level the LIOs/sec range is about 14 x 100,000 = 1400000 but at this point I'm consuming part of my response time on CPU wait (run queue) for 1.2secs per execution (from 2.76 total elapsed.. not shown on graph)

Turbo boost

• On turbo boost notice that it only helps when you are just using 1 or 2 CPUs (either HTon or HToff), so the overclocking effect does not really matter when you get all or most of the CPUs working even if turbostat (not shown on the graph) shows that it is still doing the increase in clock rate

Hyper-Threading turned off

• On Hyper-Threading on the section HToff-TurboOn look at the 320K range LIOs on CPU#1 which is higher than HTon-TurboOn, it may seem faster being on this config but this is what I mean by Hyper-Threading in a way helping on scalability.. look at the effect on "CPU wait" (time spent on run queue) as it starts to manifest on the response time of the SQLs when it reach the max CPUs of 4 (HToff) while if you have the HTon it starts at CPU 8. So you are experiencing the queueing earlier and on the OS side with HToff the 2 CPU usage will immediately tell 50% OS CPU utilization and 4 CPU as 100%. But with HTon (CPU 8) even if your are not getting linear LIO performance once you get past the number of cores, completely avoiding "CPU wait" until the max # of threads is still a good thing in terms of performance.

11

The session level numbers

The workload level numbers

12

• Also the 30% LIOs/sec performance increase is variable…

• As I mentioned earlier that cputoolkit and SLOB have different workloads

• cputoolkit having a more sustained 1sec executions

• SLOB having frequent millisecond level executions that tends to spread out the work on all CPUs

• Just treat the cputoolkit as the minimum LIOs/sec increase that you can get and 30% as a ceiling

• There’s also an Intel whitepaper [18] that explicitly says that you will only get 30% increase on Hyper-Threading

• Knowing this CPU behavior is a good knowledge to have when you do instance caging

• So when you do instance caging, make sure to treat the 75% of the total Logical CPUs as the ceiling

13

PART 3: THE DIFFERENT CPU EVENTS

There are three main CPU events in the database:

• CPU – the real CPU cycles

• CPU wait – the CPU time spent on run-queue

• CPU scheduler – the CPU time spent above the specified CPU_COUNT

The three events above are all measured in AAS (Average Active Sessions), and when talking about CPU requirements 1 AAS CPU is equivalent to 1 Logical CPU being fully utilized. The examples below show a database server with 1socket, 4cores, and 8 Logical CPUs with a CPU_COUNT value of 8

The AAS CPU

• The workload below is coming from cputoolkit driving 2 CPUs and is well within the CPU capacity, this means that all the sessions are getting served and not waiting on run-queue

The AAS CPU wait

• The workload is increased to 10 CPUs, with only 8 CPUs on the server this means that the workload is asking for 25% more CPU than the capacity which results to some of the CPU cycles to run on run-queue

15

The AAS CPU scheduler

• The instance caging is set to CPU_COUNT of 4 and that caused the CPU bound workload to drop. But still at this state the workload is still asking for 10 CPUs only that the instance caging controlled the Oracle CPU cycles to just make use of 4 CPUs.

16

• Putting it all together here’s a scenario where there was a sudden SQL plan change that caused the server to go CPU bound and then instance caging was implemented and a SQL Profile fix was applied

17

PART 4: MONITORING THE CPU OF MASSIVELY CONSOLIDATED ENVIRONMENTS

On a massively consolidated environment having a single view of utilization across servers is very critical

• On the Operating System side there are a couple of helpful tools to help diagnose the source of the CPU surge

o collectl has this module called colmux [12] which collects data from different servers and presents it in one streamlined output

o In Exadata the dcli command can be used together with vmstat and uptime commands to quickly characterize the load of the servers

• On the database the Enterprise Manager Grid Control is a must have

o The Oracle Load Map can be a stethoscope to know which database is causing the CPU surge

o Clicking on the database will detail on the Performance Page which will enable you to monitor the CPU usage in real time and historical

18

• Another useful way of monitoring the CPU resource usage is mining the AWR and extracting the data points as CSV

19

• Then you can make use of an analytics tool [13] to slice and dice, aggregate, and investigate on data across all of the instances in a time series manner

20

• With the analytics tool you’ll be able to create visualizations [16] not available on current Oracle monitoring tools

• Below is a visualization of CPU usage across half rack Exadata, this clearly shows that when the BIPRD database was implemented the CPU usage of the entire cluster went through the roof. This enabled a more focused tuning on just the BIPRD database until the workload is stabilized.

21

• Having an instance dimension on the visualization will enable you to show the CPU usage per instance which is very useful on server capacity planning as you can quickly estimate the CPU usage of each server just by looking at the AAS CPU and dividing it by the CPU capacity of the server where that instance is residing. Here the instance #3 is on AAS CPU of 21.94 with the total CPU of the server at 32, so the CPU utilization of that server is at 68.5%

• To get the cluster wide CPU utilization get the sum of all AAS CPU and divide it by the sum of server CPU capacity

• And if you are migrating from slower to faster CPU you need to take into account the CPU speed differences as shown on the “Case #2: Exadata V2 and X2 performance comparison and migration” of this paper

22

• Visualizing the workload is also very helpful on capacity planning decisions, here the FSPRD database had a workload growth on the month of April and distributing the instances across three nodes will help to have a balance workload across the cluster

23

REFERENCES

1) cpu_topology - http://goo.gl/EUDG7

2) Intel architecture topology script http://goo.gl/50tRM

3) RedHat kbase DOC-7715 multi-processor/core or supports HT http://goo.gl/NI0nn

4) turbostat.c - http://goo.gl/jDUKg

5) cpu - TPC-C tpmC/core http://goo.gl/L4RXw

6) cpu - SPECint_rate2006 http://goo.gl/doBI5

7) cputoolkit - http://karlarao.wordpress.com/scripts-resources/

8) SLOB - http://goo.gl/yKa45

9) CPU centric benchmark comparisons - http://goo.gl/nR9Yy

10) Swingbench - http://www.dominicgiles.com/swingbench.html

11) Cores vs Threads - http://goo.gl/1MLFf

12) collect colmux - http://collectl-utils.sourceforge.net/colmux.html

13) Tableau Analytics - http://www.tableausoftware.com/

14) AAS investigation - http://goo.gl/5WaAg

15) Kyle Hailey - http://dboptimizer.com/2011/07/21/oracle-cpu-time/

16) AWR Tableau and R toolkit Visualization Examples - http://goo.gl/xZHHY

17) The mindmap of the “Where did my CPU go?” presentation - http://goo.gl/XeY0e

18) Intel Hyper-Threading technical user’s guide - http://goo.gl/FcMf2

19) Book: Computer Architecture: A Quantitative Approach 5th Ed - Chapter1 Section1.10 Putting it all together Perf, Price, Power http://goo.gl/MXigAQ

20) Book: The Art of Scalability - Ch11 “Headroom” http://theartofscalability.com

21) Performance Page - CPU_COUNT, threads, cores line - http://goo.gl/CunHN

22) run_awr-quickextract - http://goo.gl/7uCk7w

whitepaper: where did my cpu go?

Technology

cpu architecture

intel cpu

cpu resources

cpu works

server cpu capacity

database parameter cpu

cpu resource management

consolidated cpu load