benchmarking slides

8/12/2019 Benchmarking Slides

1/9

BE (CIS)Computer Systems Modeling (CS-417)

Spring 2014S. Zaffar Qasim, Assistant Prof (CIS)

BENCHMARKING COMPUTER SYSTEMS

Benchmarking (in computing) is the act of running a set ofcomputer programs, or other operations, in order to assess the relativeperformance of an object in the computer system.

Usually associated with assessing performancecharacteristics of computer hardware, e.g., the floating pointoperation performance of aCPU,o but there are circumstances when the technique is also

applicable to software.o Software benchmarks are, e.g., run against compilers or

database management systems.

The term benchmark mostly used for the elaborately-designedbenchmarking programs themselves.

In a selection scenario, usually not possible to actually accessthe machine for detailed evaluationo instead, one must depend on published data on the machine.o Benchmarks are usually run by vendors for typical

configurations and workloads, and not by the user interested

in the selection process.o This leaves a lot of room for misinterpretation and misuse of

the measures both by the vendors and the users.

Purpose of Benchmarking

Provides a method of comparing the performance of varioussystems (or subsystems) across different architectures.o As computer architecture advanced, it became more

difficult to compare the performance of various computersystems simply by looking at their specifications.

o Tests were developed for different architectures.e.g., Pentium 4 processors generally operate at a higher

clock frequency than Athlon XP processors, doesnt meanmore computational power for the former.

A slower processor, w.r.t. clock frequency, can perform aswell, as a processor operating at a higher frequency.

Benchmarks are designed to mimic a particular type ofworkload on a component or system.o Synthetic benchmarks do this by specially created programs

that impose the workload on the component.o Application benchmarks run real-world programs on the system.
http://en.wikipedia.org/wiki/Computinghttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Compilerhttp://en.wikipedia.org/wiki/Database_management_systemhttp://en.wikipedia.org/wiki/Computer_architecturehttp://en.wikipedia.org/wiki/Pentium_4http://en.wikipedia.org/wiki/Athlon_XPhttp://en.wikipedia.org/wiki/Athlon_XPhttp://en.wikipedia.org/wiki/Pentium_4http://en.wikipedia.org/wiki/Computer_architecturehttp://en.wikipedia.org/wiki/Database_management_systemhttp://en.wikipedia.org/wiki/Compilerhttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Computing


2/9

2

It also helps to choose effective system upgrades and toidentify potential bottlenecks.o Particularly important inCPU design,o gives processor architects the ability to measure and make

tradeoffs inmicro-architectural decisions.o e.g. if a benchmark extracts the key algorithms of an

application, it will contain the performance-sensitiveaspects of that application.

o Running this much smaller snippet on a cycle-accuratesimulator can give clues on how to improve performance.

Types of benchmarks

I.

Application benchmarks (or Real programs)A small subset of real-world programs that are

representative of the application domain of interest.o e.g. word processing software, tool software of CAD,

user's application software (i.e. MIS)

II. Synthetic designArtificial programs that mimic a real program execution

environment by using statistical data about real high-levellanguage programs.

Some examples of the relevant statistical data are:-o Fraction of statements of each type (assignment, conditionals,

for-loops, etc.).o Fraction of variables of each type (integer, real, character, etc.)

and locality (local or global).o Fraction of expressions with certain number and type of

operators and operands.

These characteristics vary widely over different applicationdomains and programming constructs,o more useful to have a separate benchmark (application or

synthetic) for each application domain and programminglanguage class.

Procedure for programming synthetic benchmark:a.take statistics of all types of operations from many

application programsb.get proportion of each operationc. write program based on the proportion above
http://en.wikipedia.org/wiki/CPU_designhttp://en.wikipedia.org/wiki/Microarchitecturehttp://en.wikipedia.org/wiki/Algorithmshttp://en.wikipedia.org/wiki/Algorithmshttp://en.wikipedia.org/wiki/Microarchitecturehttp://en.wikipedia.org/wiki/CPU_design


3/9

3

III. Component Benchmark / MicrobenchmarkConsists of a relatively small specific piece of code to

measure performance of computer's basic components.

The most important component in a particular case dependson the individual person's usage patterns.

o A gamer seeking the best possible frame rates will probably bebetter served by a faster GPU than by more memory.

o A casual user seeking more responsive system may benefit byupgrading a slow hard drive to fast solid-state drive.

o One must decide which aspects of system performanceare most important,tailor the suite of benchmarks to your specific needs,and then weigh the individual test results accordingly.

IV. Kernel Low-level benchmark codes designed to measure basic

architectural features of parallel machineso normally abstracted from actual programo

popular kernel: Livermore loopsconsists of 24 do loops, some of which can bevectorized, and some of which cannot.

o linpack benchmark (contains basic linear algebrasubroutine written in FORTRAN language)

o results are represented in MFLOPSV. I/O benchmarks

VI.

Database benchmarks: to measure the throughput andresponse times of database management systems (DBMS)

VII. Parallel benchmarks: used on machines with multiplecores, processors or systems consisting of multiplemachines.

A synthetic benchmark can be thought of as a control programo that takes some parameters to define the nature of the

desired workload ando then generates such a workload.o Thus, a synthetic program can be used to generate both

CPU-intensiveand I/O-intensiveworkloads.o For example, in a Unix-like environment,
http://en.wikipedia.org/wiki/Do_while_loophttp://en.wikipedia.org/wiki/Do_while_loop


4/9

4

the synthetic program could be a shell-scriptthat generates some controlled mixture of commandsi.e. editing, compilation, file/directory manipulation, CPU-

intensive computing, process creation/ destruction, etc.

Synthetic programs have a few disadvantages as well:-a. Such programs are complexand are usually OS-dependent;b. It may be difficult to choose parameters such that they

approximate the real workload well; andc. they generate much raw data that needs to be reduced and

interpreted.

Since there are many different capabilities of the machine tobe tested out (integer arithmetic, floating point arithmetic,cache management, paging, I/O, etc.),o we can either have a suite of benchmarks for them,o or come up with a single synthetic program that can

exercise all these capabilities.

Application vs Synthetic benchmarks

Main advantageof synthetic benchmark over application oneo overall application domain characteristics can be closely

matched by a single program.

While application benchmarks give much better measure ofreal-world performanceon a given system,o synthetic benchmarks are useful for testing individual

components, like ahard disk or networking device.

Since a synthetic benchmark not designed to do anythingmeaningful, may show following strange characteristics:-

1.Since the expressions controlling conditional statements arechosen randomly,o a significant amount of unreachable (or dead) code may be

created,o much of which can be eliminated by a smartcompiler.Code-size becomes highly dependent on quality of compiler.

2.Random selection of statements may result in unusual localityproperties, defeating well-designed paging algorithms.

3.The benchmark may be small enough to fit entirely in thecache and result in an unusually good performance.
http://en.wikipedia.org/wiki/Hard_diskhttp://en.wikipedia.org/wiki/Hard_disk


5/9

5

Some popular benchmarks

Whetstone synthetic benchmark

o Based upon the characteristics of FORTRAN programs doingextensive floating-point computation.

o A single Whetstone instruction is defined as a mix of basichigh-level operations, and the results are reported as mega-Whetstones per second.

o This benchmark, although still very popular representsoutdated and arbitrary instruction mixes and thus

not very useful for machines with a high degree of internalparallelism (pipelining, vectored computation etc.),

particularly in conjunction with optimizing andparallelizing compilers.

Dhrystone synthetic benchmark

o written in C,o designed to represent applications involving primarily

integer arithmetic and string manipulation in a block-

structured language.o This benchmark is only 100 statements long, and thus would

fit in the instruction cache of most systems.

o Also, since calls only two simple procedures, strcpy() andstrcmp(), a compiler that replaces them by inline code cancause a very significant speed-up.

o The results are quoted as Dhrystones/sec, i.e. the rate at whichthe individual benchmark statements are executed.

Linpack application benchmark

Solves a dense 100 x 100 linear system of equations using theLinpack library package.

Two problems with this benchmark:a.Over 80% of the time is spent in doing A(I) =B(I) +C x

D(I)type of calculation, making results highly dependenton how well such a computation is handled,

b.Problem is too small.


6/9

6

Other popular benchmarks:-o Spice, a large analog circuit simulation package, mostly

written in FORTRAN, uses both integer and floatingpoint arithmetic,

ogee, based on the GNU C compiler,o li, a lisp interpreter written in C, ando nasa7, a synthetic benchmark consisting of a set of seven

kernels doing double-precision arithmetic in FORTRAN.

Given a benchmark program, we can compare variousmachines in terms of their running times.

o For historical reasons, the VAX11/780 is still a popularreference machine;moreover, it is regarded to be a typical 1 MIPS machine.

o Thus, if an integer benchmark takes 80 seconds of CPUtime on a VAX11/780, and 4 seconds on machine A, wecan claim thatA is an 80/4=20 MIPS machine.

Recognizing the need for high quality standardizedbenchmarks and benchmark data,o a number of vendors have collectively established an

organization System Performance Evaluation Cooperative(SPEC).

For CPU performance, SPEC has defined a suite of tenbenchmarks,ofourof which (gcc, espresso, li, and eqntott) do primarily

integer arithmetic, ando

other six (spice, doduc, nasa7, matrix, fpppp, andtomcatv) primarilyfloating point.

The reference machine used is VAX11/780.The geometric mean of the integer benchmark results is

known as SPECint, and those of others as SPECfp.

The geometric mean of SPECint and SPECfp is known asSPECmark, and has become a popular measure.

Table 1 lists these three measures for a few systems (denotedas MARK, INT, and FP).


7/9

7

TABLE 1: Benchmark data for some selected workstations

SPARCstation-2 is about 20.7 times faster than a VAX11/780on integer operations.

The data shown here is for illustrative purposes only,o not intended for a direct comparison,o since the selected systems do not necessarily fall in the

same price/performance category.

Also, some important information such as cache size,compiler version, background load, etc., is not indicated.

Precautions in Benchmarking

A number of hardware and software factors need to beconsidered before running benchmarks.

1.Many benchmarks place significant stress on specificcomponents,o ensure all such components are ingood working order,oproperly cooled(if necessary), and receiving adequate power.

2.When comparing machines using benchmarks, some care isneeded to eliminate the effect of extraneous factors.

o e.g., a highly optimizing compiler can often speedup thecode twicely compared with an ordinary compiler.o Therefore, its necessary that benchmarks for all machines

be compiled using compilers with similar features andsettings.

o Also, the details of machine configuration should bespelled out clearly, e.g., one could not claim much shortertimes by just increasing the cache-size.

3.On the software front, parameters for OS, applications anddrivers must be satisfied to ensure accurate, repeatablebenchmark results.


8/9

8

o Windows (and other) OSs proactively prefetch data andstore numerous temporary files that could interfere with abenchmark, so it's best to clear out any temporary filesand prefetch data before running a test.

o In Windows 7, you can find prefetch data in C:\Windows\Prefetch,and temporary files in C:\Windows\Temp andC:\Users\[username]\AppData\Local\Temp.

4.Using the correct (or latest) drivers for a component toensure it is operating and performing optimally.

o especially true of graphics boards and motherboards/chipsets, where wrong driver can significantly worsen

the system's frame rates or transfer speeds and latency.o Finally, confirm that the OS is fully updated and patched

to ensure optimal compatibility and to reflect the current,real-world OS configuration.

Challenges

Interpretation of benchmarking data extraordinarily difficult.

Here is a partial list of common challenges:-1.Manufacturers commonly report only those benchmarks (or

aspects of benchmarks) that show their products in the best light.o They also have been known to misrepresent the

significance of benchmarks, again to show their productsin the best possible light.

2.Many benchmarks focus entirely on the speed ofcomputational performance,o neglecting other important features such as Qualities of Service.o Unmeasured Qualities of Service include security,

availability, reliability, serviceability, scalability etc.o Transaction Processing Performance Council (TPPC)

Benchmark specifications partially address these concernsby specifying ACIDproperty tests, database scalability rulesand service level requirements.

3.Users can have very different perceptions of performancethan benchmarks may suggest.o In particular, users appreciatepredictabilityservers that

always meet or exceedservice level agreements.
http://en.wikipedia.org/wiki/Computer_performancehttp://en.wikipedia.org/wiki/Transaction_Processing_Performance_Councilhttp://en.wikipedia.org/wiki/ACIDhttp://en.wikipedia.org/wiki/Service_level_agreementhttp://en.wikipedia.org/wiki/Service_level_agreementhttp://en.wikipedia.org/wiki/ACIDhttp://en.wikipedia.org/wiki/Transaction_Processing_Performance_Councilhttp://en.wikipedia.org/wiki/Computer_performance


9/9

9

o Benchmarks tend to emphasize mean scores (ITperspective), rather than maximum worst-case responsetimes (real-time computing perspective), or low standarddeviations (user perspective).

4.Many server architectures degrade dramatically at high(near 100%) levels of usage.o Vendors, in particular, tend to publish server benchmarks

at continuous at about 80% usage an unrealisticsituation

o and do not document what happens to the overall systemwhen demand spikes beyond that level.

5.Benchmarking institutions often disregard or do not followbasic scientific method.o This includes, but not limited to: small sample size, lack of

variable control, and the limited repeatability of results.

Benchmark Strategies

Most of the benchmark programs base the measure ofperformance on time required to execute benchmark.

Several other strategies can be employed, however.

Specifically, the three different strategies for using abenchmark program to measure the performance of acomputer system are the following.

1. Measure the time required to perform a fixed amount ofcomputation.o a fixed workloado The SPEC CPU benchmarks

2. Measure the amount of computation performed within a fixedperiod of time.o Argument: users with large problems are willing to wait a fixed

amount of time to obtain a solution.o At the end of allotted execution time, total amount of

computation completed, used as the measure of relative speedsof different systems.

o The TPC benchmarks.3. Allow both the amount of computation performed and the time to vary.

o These types of benchmarks instead try to measure some otheraspect of performance that is a function of the computation andthe execution time.

o The HINT benchmark
http://en.wikipedia.org/wiki/Real-time_computinghttp://en.wikipedia.org/wiki/Real-time_computing

benchmarking slides

Documents