system measurement with the hardware performance...

System measurement with the hardware performancecounters

Advanced Architecture (7810) Project Proposal

Anton [email protected]

May 23, 2006

1 Introduction

This work studies the costs and rates of various operating system events. By this studywe try to improve understanding of the operating system workloads, their influence on theoverall system performance and approaches to possible optimizations. Historically, hard-ware designers underpay attention to the operating system support. Furthermore, inherentcomplexity of operating systems forces hardware developers to make simplifying assump-tions while emulating operating system behaviour in software. We use performance moni-toring interface provided by contemporary processors to study operating system workloadson the real hardware. We hope to gather undistorted results which will be beneficial forboth system and hardware research communities.

2 Motivation

Most experiments in this work are motivated by two recent trends in the research of hard-ware architectures and systems. Hardware platforms start moving towards multicore pro-cessors with simpler heterogeneous cores. System research in turn tries to adopt microker-nel approach to the system construction in the form of virtual machines. We try to optimizerelation between this emerging hardware and software architectures.

We divide our experiments in two groups: micro and macrobenchmarks. Microbenchmarksmeasure the cost of basic low-level events used by the kernel to implement higher systemabstractions. Macrobenchmarks provide a high-level picture of the system workload, fre-quency of the low-level events, and effects of interference between user-level applicationsand the operating system.

Combining results from these groups we can predict how system performance changesif either hardware architectures improve support for microoperations or systems will berestructured to run as virtual machines on multicore processors.

1

3 Microbenchmarks

The set of microbenchmarks in this section is aimed to explore the cost of basic kerneloperations. In the experiments below we care about Amdahl law, and, in other words, to beable to make any reasonable suggestions about system optimizations, we measure not onlythe cost of the operations but their frequency as well.

3.1 Cost of entering the kernel

This experiment measures the cost of switching execution context from user-level to a ker-nel. In other words, it’s the cost of doing a null system call.

A most common implementation of the system call raises processor exception, which isdelivered to a kernel handler. The system call doesn’t require address space switch, howeverit changes a privilege level, saves execution context and switches to a new stack.

Note, that only the change of the privilege level is required for system call correctness.Generally, the system call can proceed on the same stack and do not save context (thisimplementation was suggested by the sysenter, syscall and sysexit instructionsintroduced in the last Intel Pentium models). Therefore, it’s interesting to measure howmuch overhead introduced particularly by context saving. This study can be done by com-paring system calls implemented with the sysenter and int instructions.

Linux implementation of the system call introduces some overhead comparing to the ab-stract pure system call. Therefore, it’s interesting to measure both abstract and Linux systemcalls and see how the cost distributed along the invocation path.

The Section 4.1.2 measures the cost of most used Linux system calls to figure the minimalhardware resources they need. The frequency of the system calls has also to be measured.

Experiment plan

To undertake this experiment the Linux kernel should be modified in the following way:we extend the kernel with a system call, we read the time-stamp counter before invokingthis system call and right after entering the kernel from the body of the system call, or evenearlier if we want to avoid measuring the system call overhead introduced by the Linuxkernel.

First time we read the time-stamp counter at the user-level, second time we do that in thekernel. Therefore, we have to pass the read value from the user-level to the kernel. Themost straightforward way to do that is to use one of the registers. Note although, that thetime-stamp counter is 64 bit wide, so we need 2 registers on the IA-32 architecture to saveand pass it.

The time-stamp counter can be read with the RDTSC instruction (see Section 18.8 of theIntel Architecture Programming Manual, Volume 3B [3]). Note also, that we have to checkthat the Linux kernel does not restricts invocation of the RDTSC instruction to the ring zerothrough the TSD (time-stamp counter disable) flag, which is the part of the CR4 register.

2

Generally there are two possible approaches to this experiment: we can either extend Linuxkernel with a test system call or instrument several other Linux system calls hoping thatsuch measurement will deliver a more realistic cost value.

Extending a Linux kernel with a new system call

The Linux kernel can be extended with a system call in a following way:

1. File: ./arch/i386/kernel/syscall table.S: Add a new system call han-dler to the sys call table like:

.long sys your sys call

2. File: ./include/asm-i386/unistd.h: Define a new system call numberlike:

#define NR your sys call 311

At the same file redefine an overall number of system calls in the kernel by incre-menting it by one:

#define NR syscalls 312

3. File: ./<some kernel file.c>: Implement a system call like:

asmlinkage long sys your sys call(void){

// ... your code ...return 0;

}

System call invocation

The system call can be invoked through one of the syscall{x} functions, which areuser level wrappers capable to pack x parameters and trap in the kernel with the int 0x80instruction. syscall{x} defines are invoked by the high-level system call wrappers fromthe libc library. An example of a system call invocation is presented on Figure 1.

Reading of the time-stamp counter register

We read the time-stamp counter with the GNU inline assembler sequence presented onFigure 2, which is perfectly optimized by the g++ compiler with a -O2 optimization levelin a plain rdtsc invocation.

Experiment implementaton

If we decide to instrument only one system call and measure its cost right after entering

3

#include <syscall.h>#include <unistd.h>#include <stdio.h>#include <sys/types.h>

syscall0(int, your sys call);

int main(void) {

long res;

res = syscall(SYS your sys call);

return 0;}

Figure 1: An example of a Linux system call invocation

inline volatile unsigned long int RDTSCL()unsigned long int x;asm volatile ("rdtsc" : "=a" (x) : : "edx");

return x;

Figure 2: Sample code for reading the time-stamp register

the kernel, we have to provide a way to distinguish it from other system calls. The mostobvious way to do that is to pass the system call to the kernel through one of free interruptvectors (for example, vector 0x81). Only the test system call will enter the kernel throughthis interrupt, and therefore we can safely undertake our measurements. To do that wehave to create the copies of the syscall{x} functions to trap through the 0x81 interrupt.We also have to create the copy of the interrupt handler ENTRY(system call) in the./arch/i386/kernel/entry.S, and register it as an 0x81 interrupt handler with theset system gate function during the system initialization in ./arch/i386/kernel/traps.c.

After that we instrument the ENTRY(system call) kernel entry point to read the time-stamp counter right after entering the kernel. The actual ENTRY(system call) instru-mentation and code sequence is presented on Figures 3 and 4:

So the kernel entry point is instrumented to read the time-stamp counter register and exitthe kernel immediately. This measurement gives us the cost of entering the kernel via 0x81interrupt.

Similar instrumentation for sysenter instruction is slightly more complex. I’m not sureabout how it works and have to read that.

4

/* Read time beforeinvoking a system call */

tl0 = RDTSCL();

/* Invoke a system call itwill return a time fromthe kernel */

tl1 = cost entering kernel(); ---------->

<----------printf("Time to enter:%u",

tl1 - tl0);

Figure 3: User level system call invocation

/* System call kernelentry point */

ENTRY(system call 0x81)rdtsciret

Figure 4: Kernel entry point

3.2 Measuring the cost of a NULL system call

The cost of doing a NULL system call can be measured by reading the time-stamp counterregister before and after returning from the kernel ( Figure 5)

tl0 = RDTSCL();cost entering kernel();tl1 = RDTSCL();

printf("Time to do a NULL system call:%u",tl1 - tl0);

Figure 5: Reading the time-stamp counter

By modifying the definition of syscall{x} we can force system call to enter the kerneleither through the int 0x81 or sysenter instructions.

We repeat the experiment for 10000 times. After first three-five iterations the costs ofentering the kernel converges. We measured the cost of entering the kernel via int 0x81instruction to be about 390 cycles. The costs of a null system call by entering the kernel viaint 0x81 and via sysenter instructions are 1065 and 577 cycles respectively.

3.3 Measuring the Linux system calls overhead (not-implemented)

If we decide to instrument several (or all?) Linux system calls to gather statistics for this ex-periment we just have to instrument the syscall{x} functions and the ENTRY(system call)entry.

To measure the overhead introduced to the system call handling by the Linux kernel, we wecan take the second measurement right after entering the system call body. This will giveus the length of the real system call invocation path instead of the cost payed just to enterthe kernel mode.

5

The simplest way to report the measured time is to store time-stamp counter in a memoryarray allocated in the kernel space and then dump it on the console with the printk func-tion. Although more sophisticated methods can be elaborated (is writing to a file harder?or should we use the syslog daemon?.

3.4 Cost of doing a user thread switch

This experiment measures the cost of performing a thread switch. Thread switch requiressaving and loading execution context, change of the stack, and kernel bookkeeping relatedto thread management.

Again, as in the previous experiment only saving of execution context is strictly required.In general, thread can proceed on the same stack.

In order to make this experiment realistic we have to measure the cost of the thread switchin some existing thread library implementation. It’s not clear which implementation can bethe best choice.

The frequency of thread switching can vary under different workloads, therefore, we willstudy at least two cases: server and desktop workloads.

Experiment implementation

In this experiment two threads run with a high priority and yield the CPU to each other.One thread reads the time-stamp register before yielding the CPU, the second thread readsthe time-stamp right after getting CPU (i.e. returning from the previous yield). Threads aresynchronized through the iteration variable. Experiment’s pseudocode is presented onFigures 6 and 7

We yield CPU between threads for 5000 times. After first three-five iterations the costs ofthread switch converges. We found that the cost of performing a thread switch is about 2182cycles. The cost of yielding a CPU to another thread and receive control back (i.e. read thetime-stamp counter before and right after returning from sched yield) is measured tobe about 4485 cycles.

3.5 Cost of performing an address space switch

On Intel architectures address space switch is done by reloading base page table register(CR3). As I understand, it flushes all TLB entries unless they are defined as persistent. Therefill of the TLB entries constitutes the most significant part of the cost of the address spaceswitch.

On the architectures providing address space identifiers, TLB has not to be flushed, andtherefore address switches are virtually free incurring only some bookkeeping cost.

6

pthread mutex lock(&mut);

iteration = 1;pthread cond broadcast(&cond);

pthread mutex unlock(&mut);

sched yield();

for ( i = 0; i < DEPTH - 1; ) {

sched yield();

/* Read completion of threadswitch */

t1 = RDTSCL();pthread mutex lock(&mut);i = iteration;(*calldata launch)[i] = t1;pthread mutex unlock(&mut);

};

Figure 6: Launch thread

pthread mutex lock(&mut);

while ( iteration == 0) {pthread cond wait(&cond, &mut);

}

pthread mutex unlock(&mut);

sched yield();

for ( i = 0; i < DEPTH; i ++ ) {

pthread mutex lock(&mut);iteration = i;pthread mutex unlock(&mut);

/* Read beginning of threadswitch */

(*calldata yield)[i] = RDTSCL();sched yield();

};

Figure 7: Yield thread

3.5.1 Cost of doing a process switch

The most widely used case of an address space switch is a process switch. Comparingto an abstract pure address space switch, process switch incurs significant bookkeepingoverheads related to the process management in the kernel. The bookkeeping overhead isdifferent between operating systems.


This experiment is almost the same as measuring the cost of thread switch. Two processesrun at a high priority and yield CPU to each other, they are synchronized through a sharedmemory region.

Again we yield the CPU between processes for 1000 times. After first several iterations thecost of a process switch converges. We measured it to be about 3105 - 3225 cycles.

3.5.2 Cost of refiling a single TLB entry

This simple experiment has to measure the average cost of refiling one TLB entry. TheTLB entry can be either in the processor cache or in the main memory. To undertake thisexperiment we will periodically delete one TLB entry from the TLB and measure the costof the hardware page table lookup.

Experiment plan

To undertake this experiment, we will flush the TLB entry with the INVLPG instruction,read the time-stamp counter, invoke some operation which touches the page pointed by the

7

flushed TLB entry and read the time-stamp counter again.

The only complexity in this experiment is the page-touching operation to choose. We cantry different operations, for example: passing control to the flushed page or reading datafrom it. In both cases we can probably achieve an effect when instructions from the flushedpage are in the trace cache or in the processor cache and we measure only the pure cost ofthe TLB entry refill.

On the other hand it’s also interesting to split experiment in two parts: flushing and refilling.These parts are interleaved with some code to see whether Pentium processors actually flushthe TLB entry or take some optimizations to keep them in internal memory or in L1, L2caches.


INVLPG is a privileged instruction, therefore we have invoke it from the kernel. To do thatwe implement a loadable kernel module and invoke the test code from the module function.

We define two macros for a TLB entry invalidation and memory access:

#define INVL OP(x) asm volatile ("invlpg %0": :"m" (*((char *)p + (x)*SHIFT))#define OP(x) asm volatile ("movl (%0), %%eax" :

:"r"(((char*)p) + (x)*SHIFT): "eax")

We also define a macro invoking series of operations on subsequent memory locations toavoid the overhead of invoking them in a loop:

#define OP10( x) OP(( x*10 + 0)); OP(( x*10 + 1)); OP(( x*10 + 2)); OP(( x*10 + 3));OP(( x*10 + 4)); OP(( x*10 + 5)); OP(( x*10 + 6)); OP(( x*10 + 7));OP(( x*10 + 8)); OP(( x*10 + 9));

In order to measure the cost of invalidating a page table entry we should at first accessthese pages to ensure that translation is in the TLB. The Xeon CPU has separate data andinstructions TLBs. The data TLB has 64 entries. We declare a global SHIFT variable to beequal to the PAGE SIZE variable defined by Linux and access 64 pages linearly with thehelp of OPxx macro.

After that we invalidate these 64 TLB entries with the INVL OP macro reading the time-stamp counter register before and after invalidation.

In order to measure the cost of a hardware TLB lookup we access the region which werecently invalidated with the help of OP macro, reading the time-stamp counter before andafter completion.

An example code sequence we invoke is presented on Figure 8:

We repeat the whole experiment for 100 times. After the first two-three iterations the costsof all operation converges. We measured the cost of invalidating 64 TLB entries to be about33255 cycles; cost of refilling 64 TLB entries to be 1335 cycles. Trying to measure theoverhead introduced by a OPmacro (i.e. by tt mov instruction and address computation) we

8

tl0 = RDTSCL();

INVL OP10(0);INVL OP10(1);INVL OP10(2);INVL OP10(3);INVL OP10(4);INVL OP10(5);

INVL OP(60);INVL OP(61);INVL OP(62);INVL OP(63);INVL OP(64);

tl1 = RDTSCL();

(*calldata2)[i] = tl1 - tl0;

Figure 8: Invalidating 64 TLB entries

measured the cost of accessing 64 memory locations, which TLB translations are preservedin TLB, to be about 277 cycles.

We also performed the same experiments but accessed the memory and invalidated TLBentries from the loop. In that case the cost of invalidating 64 TLB entries was around 32940cycles, and .... (have to check).

3.6 Slowdown caused by the TLB flush Not implemented

This experiment measures the slowdown of a process caused by a full TLB flush. Wecompare performance of the process, which runs on the hot TLB and which TLB is flushedperiodically. This experiment shows how much execution performance degrades due to run-ning on a cold TLB. Furthermore, this experiment clarifies how much TLB flush contributesto the cost of a process switch.

In some sense this experiment is related to the experiments in Section 4.2. We howeverdistinguish it as a separate experiment because it probably has the most effect on the processswitch performance (Maybe it’s a bad idea, and branch predictors, and caches pollutionincur comparable effect and we have to undertake similar experiments for cache and branchpredictors.).

Experiment plan

We will run a benchmark instrumented to flush the TLB periodically (maybe only onceactually) and compare the time spent to accomplish the benchmark in both cases.

To do the TLB flush we have to extend the Linux kernel with an appropriate system call.We implement the system call similarly to the kernel entering experiment in the Section 3.1.

Theoretically, on the Intel Pentium architecture, we can flush the TLB by writing to the CR3control register. CR3 is the register, which stores the base page table address. Therefore,

9

we have to save CR3 value and then immediately write it back, thus causing the TLB flushand remaining on the same TLB.

To find the time required to accomplish a benchmark, we have to read the time-stampcounter before starting the benchmark and right after it completes. To do that elegantly,we have to find out how to compile, link and run the benchmark code directly from ourtest code. I hope that at least some Spec benchmarks are ported to Linux, compilable andrunnable in this way. Maybe a useful resource about running Spec bemnhmarks on Linuxis: http://lbs.sourceforge.net/.

In order to achieve cleaner measurements, we might need to ensure non-interruptable exe-cution of the tested process. To do that we can apply techniques described in the Section 6.3.

Another approach to refining experiment results is statistical. We can run experiments manytimes under the same workload and obtain result by statistical approximation.

3.7 Cost of synchronization

There are several most common ways to implement thread synchronization: interrupt dis-abling (used only in a single-CPU case), use of atomic instructions to implement criticalsections, use of lock-free algorithms. Efficient implementation of a synchronization libraryneeds therefore a knowledge of overhead introduced by the low-level synchronization prim-itives.

A conventional Linux kernel usually uses combination of interrupt disabling and atomicinstructions for locks implementation.

3.7.1 Cost of interrupt disabling

This experiment measures the cost of switching thread to a non-interruptable state. On Intelprocessors, this is usually done by invoking the CLI instruction.

Experiment plan

This is a simple experiment: we just read time-stamp counter before and after invoking theCLI and STI instructions.

Further, we can also measure the cost of masking APIC external interrupts. The cost ofcommunicating with the APIC is supposed to be greater because it requires to go off chip.


Since CLI and STI are privileged instructions we invoke the code of this experiment froma loadable kernel module.

Again we declare the following macros:

#define local irq disable() asm volatile ("cli": : :"memory")#define local irq enable() asm volatile ("sti": : :"memory")

10

http://lbs.sourceforge.net/

#define OP(x) local irq disable();local irq enable()

With the help of serial invocation macros (OP100) we invoke the interrupt disable/enablesequence for 100 times. We repeat this experiment for 100 times. After first iterations thecost of disabling interrupts converges. We measured the cost of 100 subsequent interruptdisable/enable invocations to be equal to 11700 cycles on the Intel Xeon processor.

3.7.2 Cost of a snooping variable from a remote cache

In a multiprocessor case, interrupt disabling is not sufficient to implement synchronization.The most widely used technique in conventional OS kernels relies on the use of compare-and-swap instructions to implement spin-lock critical sections. Synchronization betweendifferent processors incurs additional overheads introduced by cache consistency protocols.

Experiment plan

The most widely used implementation of a critical section relies on an atomic hardwareprimitive (equivalent to the one of test-and-set, compare-and-swap or load-store condi-tional) to build a spin lock. Thus two threads trying to acquire the lock read a sharedvariable (most often from the private L1 cache) and try to modify it atomically. When twothreads compete for the lock intensively the shared variable is frequently moved betweenthe private caches by the cache consistency protocol. Therefore, the cost of synchronizationis equal to the time needed to read-write variable which is either cached locally or residein a remote cache of another core. Thus in case of a multicore CPU the cost to acquire thelock is virtually equal to the time required to read a variable from the memory shared bytwo cores (most often it’s an L2 cache). In case of an SMP system this cost includes thecost of a memory coherence protocol.


The code on the Figures 9 and 10 represents two processes, which share the shared memoryregion and are additionally synchronized with the help of the signal variable in a waythat memory accessed by the first (master) thread is always evicted from the local cache ofthe measuring thread between subsequent attempts to access them. The s area is a pointerto the shared memory region. The example on Figure 9 provides a code measuring a costof snooping 32 variables between the caches.

In this example OP is defined as

#define OP(x) asm volatile ("movl (%0), %%eax" ::"r"(((char*)var) + (x)*SHIFT): "eax")

In order to undertake this experiment cleanly we have to consider cache configurations ofthe tested CPUs. Opteron has 64KB L1 2-way associative data cache. Its cache line is 64bytes (it’s able to store eight 64bit values; and thus cache has 1024 lines or 512 sets). The

11

var = ((long long *)s area);signal = (int *)((char *)(s area)

+ S SIZE - sizeof(int));

for ( int i = 0; i < DEPTH; i ++ ) {

tl0 = RDTSCL();

/* sequential access, 32 sets */

OP10(0);OP10(1);OP10(2);

OP(30);OP(31);

tl1 = RDTSCL();

calldata[i] = tl1 - tl0;

*signal = 0;std::cout << "Waiting for slave";

do {asm volatile ("nop");

} while ( *signal != 1 );

};

Figure 9: Launch thread

#define SHIFT 64*2#define SNOOP DEPTH 256

var = ((long long *)s area);signal = (int *)((char*)s area

+ S SIZE - sizeof(int));

for ( int i = 0; i < DEPTH; i ++ ) {

std::cout << "Waiting for master";do {

asm volatile ("nop");


/** Snoop variable

*/var = ((long long *)s area);for( int j = 0; j < SNOOP DEPTH; j ++) {

*var = i;var = (long long *)

((char *)var + SHIFT);};

*signal = 1;};

Figure 10: Yield thread

SHIFT variable in OP is defined to be equal to 64*2 to place variables in separate cachesets (although for our experiment it’s enough to place variables in separate cache lines).

In each experiment we do 1000 snoops. After the first several iterations the cost of snoopingbecomes stable. We measured the cost of snooping of 32 and 256 memory locations. Theresults of our measurements are presented in Table 1. A loop overhead is a cost of accessingthe same memory location 32 or 256 times.

Opteron XeonCMP SMP CMP SMP HT

32 snoopsSequential access 1069 - 1272/32 1295 - 1424/32 1290/32 - 4747/32 262/32 - 330/32

= 34.5 - 39.7 = 40.4 - 44.5 = 40.31 - 148 = 8.1 - 10.3Pseudorandom access 1135 - 1354/32 1416 - 1843/32 2580/32 - 5445/32 270/32 - 345/32

= 35.46 - 42.3 = 44.5 - 57.5 = 80.6 - 170 = 8.4 - 10.7Loop overhead 387/32 = 12 341/32 = 10.65 285 - 487 = 8.9 - 15 210/32 = 6.56

256 snoopsSequential access 11968 - 13060/256 12934 - 14189/256 14977/256 - 21593/256 1943 - 2303/256

= 46 - 51 = 50.5 - 57.7 = 58.5 - 84.34 = 7.58 - 8.9Pseudorandom access 11886 - 14792/256 1416 - 1843/32 17610/256 - 20992/256 2085 - 2400/256

= 46.4 - 57.8 = 44.5 - 57.5 = 68.8 - 82 = 8.1 - 9.37Loop overhead 1385/256 = 5.4 1236/256 = 4.8 930 - 1245 = 3.6 - 4.86 1177 - 1402 = 4.59 - 5.47

Table 1: Cost of snooping variable

12

3.8 Cost of L1 migration between the cores

In this experiment we measure the cost of migrating L1 cache between the cores. Theidea behind this experiment is to approximate the cost of migrating the thread between thecores (note, that actually it’s a bad approximation, it’s better to have a separate experimentevaluating thread slowdown due to migration).


Similarly to the cache snoop experiment we have two processes, which share the sharedmemory region and are synchronized with the help of the signal variable in a way thatthe whole cache will be always evicted from the cache of the measuring (master) threadbetween subsequent attempts to access them. In this experiment we define the OP macro tobe:

#define OP(x) asm volatile ("movl (%0), %%eax" ::"r"(((char*)var) + (x)*SHIFT): "eax")

SHIFT variable is defined to be equal the size of a CPU word. Similarly to the cache snoopexample we access the whole cache with the help of macroses. The pseudocode of twothreads is presented on Figures 11 and 12

During this experiment we migrate the cache 1000 times. In order to ensure that line isactually evicted the slave thread writes to the memory. Results of our measurements arepresented in Table 2. A loop overhead is a cost of accessing the same memory location asmany times as required to cover the cache (for example 4096 times in case of accessing a 4byte value and 16KB Xeon cache).

Opteron Xeon64KB cache 16KB cache

CMP SMP CMP SMP HTSequential access 68836 69950 24308 23025Loop overhead 34143 34143 18007Pseudorandom access 80447 81530 - 83853 35745 21817Loop overhead 32950 32950 18007 18270Sequential in a loop 62292 67705 26092Loop overhead 33049 32931 22283

Table 2: L1 cache transfer cost

3.9 Cost of interprocessor IPC

This experiment measures the cost of sending data between cores in the multi-core and SMPcases. Basically, interprocess communication consist of two parts: sending interprocessorinterrupt (IPI) notifying receiver about the message, and actual copying of data between thecores.

Experiment plan

13

for ( int i = 0; i < DEPTH; i ++ ) {

tl0 = RDTSCL();

/** Access cache

* Opteron: 1024 lines each

* storing 8 values

*/

OP1000(0);OP1000(1);OP1000(2);OP1000(3);OP1000(4);OP1000(5);OP1000(6);OP1000(7);OP100(80);OP100(81);

tl1 = RDTSCL();

(*calldata)[i] = tl1 - tl0;

*signal = 0;

std::cout << "Waiting for the slave";

do {asm volatile ("nop");


};

Figure 11: Master thread

for ( int i = 0; i < DEPTH; i ++ ) {

std::cout << "Waiting for master";do {

asm volatile ("nop");} while ( *signal != 0 );

/** Snoop the cache

*/var = ((test t *)s area);for(int j=0; j<SNOOP DEPTH; j ++) {

*var = i;var += 1;

};

*signal = 1;};

Figure 12: Slave thread

We will measure the cost of pure IPI in this experiment. It’s a reasonable benchmark, whichmeasures unavoidable cost of inter-processor communication. Later we can arrange high-level benchmarks to measure costs of the IPC systems implemented in Linux (pipes, FIFOqueues, POSIX message queues, System V IPC, and sockets).

In general, inter-processor IPIs are sent with the send IPI mask function. The send IPI maskcan be implemented differently depending on the system architecture and APIC. In somesense it’s fair to start measuring the IPI cost right before invoking the send IPI maskfunction.

On i386 architecture and the desktop machines we use everyday and have in Flux, send IPI maskis implemented in include/asm-i386/mach-default/mach ipi.h file by thesend IPI mask bitmask function. send IPI mask bitmask is in turn implementedin the arch/i386/kernel/smp.c file.


The send IPI mask function sends an interprocessor interrupt to a specific interrupt vec-

14

tor. To measure the cost of sending an IPI, we extend the Linux kernel with two IPI vectors.First vector (smp km test interrupt) receives an IPI on a remote CPU and immedi-ately sends reply back to the sender. The second vector (smp km test reply interrupt)keeps waiting on the sender’s CPU for the reply and measures the end of an IPI’s round-trip.

The Linux kernel can be extended with an IPI handler in the following way:

1. File: include/asm-i386/mach-default/irq vectors.h, define a new IPI interrupt vector.

2. File: arch/i386/kernel/smpboot.c: Register a new interrupt handler by adding thefollowing line:

set intr gate(KM TEST VECTOR, km ipi interrupt)

3. File: include/asm-i386/hw irq.h: Add an interrupt handler declaration like km ipi interrupt.

4. File: include/asm-i386/mach-default/entry arch.h: construct an interrupt entry:

BUILD INTERRUPT(km test interrupt,KM IPI VECTOR)

5. File: arch/i386/kernel/smp.c: Implement an interrupt handler like smp km ipi reply interrupt

Since the cores of the CMP/SMP system are not synchronized we will measure the round-trip time of the IPI. From the sending function we read the time-stamp counter before send-ing an IPI through the send IPI mask. In the remote handler (smp km test interrupt)we measure the time spend in the handler before sending the reply IPI back. Later we sub-stract this time to obtain a cleaner round-trip time value. In the reply handler (smp km test reply interrupt)we read the time-stamp counter again to find the time of the completion of the round-tripmessage.

To make our results cleaner we measure the time we spend on-chip sending IPI (i.e. thetime needed to invoke send IPI mask function). During the experiment we send 100inter-processor interrupts. We measured a round-trip time to be about 29520 cycles outof which 98 cycles were spent in receiver’s body, and 375 cycles were spent sending IPIthrough the send IPI mask function. The rest 29047 cycles were spent off-chip. Andthe one-way cost was therefore about 14524 cycles.

We also measured the cost of sending an IPI to a hyper-threading core to be about 4770cycles out of which 98 cycles were spent in receiver’s body and 428 were spent sending anIPI through send IPI mask funciton.

The cost of sending IPI to itself turned out to be 6892 cycles with 375 cycles spent insidereceiver and 98 cycles of sending.

15

3.10 Cost of the thread migration between the cores Not implemented

Migration of a thread or process between processors in SMP case or between cores in multi-core case will most probably become important and frequent operation in case of asymmet-ric (heterogeneous) architectures. Process migration consist of migrating process executioncontext, process address space, process code, data and stack segments of process’ threadsmigrated along with it, and some kernel bookkeeping.

Migration of a thread, most probably assumes that address space has been already createdand requires only migrating the thread execution context.

After migration, process starts executing on the cold TLB, caches and branch predictors. Incontrast to process, migrated thread incurs this effect only partially.

3.10.1 Experiment plan

The Linux kernel maintains a queue of running tasks on each CPU. Task is a vague termused in Linux. Generally it’s a thread, as Linux does not distinguish threads and processes,with an associated address space.

Periodically the kernel compares load of every CPU and tries to re-balance the tasks. Mi-gration of the task is nothing more than moving task structure between the run queues ofthe two CPUs. Since all data is communicated through the memory, the task can be safelystarted on the new CPU, and hardware memory protocols complete migration transparently.

We will measure the cost of the task migration similarly to the measuring the cost of aprocess switch (Section 3.5.1). In other words, we instrument the Linux kernel with asystem call initiating migration of the invoking task. We will read the time-stamp counterbefore invoking the system call and right after it, when it returns on another CPU.

These two measurements happen on the different cores, so we have to ensure that the time-stamps are synchronized between the cores. Linux kernel performs time-stamp counter syn-chronization during the SMP boot (see function synchronize tsc bp in the ./arch/i386/kernel/smpboot.c).However we should study this code and related sources [1] to figure out the degree of syn-chronization accuracy which can be achieved. Personally I doubt that the Linux code is ableto synchronize time-stamp counters with the accuracy of nanoseconds.

Possessing little knowledge about task migration code, I suggest the following implemen-tation of the system call which we need to initiate migration.

In order to migrate the task our system call should invoke the sched migrate taskfunction. sched migrate task migrates task to the destination CPU with the help ofthe migrate task function. Before calling the migrate task function, sched migrate taskacquires the lock on the task’s run queue. Note, that migrate task checks that task isallowed to migrate to the destination CPU, therefore we have to patch the code to avoid thischeck for us. Alternatively we can explicitly allow migration by calling set cpus allowedbefore invoking the sched migrate task function.

16

The actual migration is performed by the migrate task function. If the task is not in therunning state migrate task function just moves it between scheduling queues. This isnot true in our case, unless we were interrupted in the kernel, thus we migrate the runningtask. Therefore, migrate task relies on the assistance from the migration thread.

The migration thread is the high priority kernel thread running on each CPU and looping inthe migration thread function and waiting for migration requests.

To activate migration thread, sched migrate task function wakes it up. Afterthat migration is completed on the destination CPU by the migration thread. I need toread the SMP wake up code and eliminate all unnecessary work if any to achieve cleanerexperiment results

Migration thread accomplishes migration with the help of migrate task function. Themigrate task function checks if the priority of a current task on the new CPU is lower

than priority of migrated task, and if so reschedules the tasks. Therefore, by simply runningour test task with the highest priority we ensure that it will be scheduled immediately aftermigration.

After migrating our task will continue on the new CPU from the point where it had enteredthe kernel. So we will be able to read the time-stamp counter on the new CPU right afterreturning from the test system call.

3.11 Cost of a procedure invocation

This experiment measures the cost of procedure invocation. Although procedure invocationlooks like a tiny operation its frequency can have significant impact on overall performance.Some architectures use register stack to reduce the cost of procedure invocation payinghowever additional cost during the context switch, which in this case requires saving of alarger register stack.

Cost of procedure invocation can vary depending on the type of a code executed. Positiondependent, position independent, and object oriented code have different ABIs and there-fore incur different procedure invocation overheads.

The simplest procedure invocation saves complete or partial context of the caller and returnaddress to the stack, and passes control to the invocation point. Therefore, invocation costis mostly consists of a context saving.

Experiment plan

To undertake this experiment, we read the time-stamp counter before invoking and rightafter entering a procedure. Bypassing of a time-stamp counter value can be done throughthe region of static memory.

I don’t believe we need to use handcrafted assembler procedure invocation sequences andcan rely on general code generated by the compiler.

17

We can undertake measurements on procedures with different number and type of param-eters. We also can instrument the code of different types. Bench++ can be an example ofobject oriented code.

We can compile the same code as position dependent and position independent and see howthis affects the cost of procedure invocation.

Question 1: flags to compile position independent and dependent code? Is it only -pic? Orit’s also -pie -PIE and -PIC And what’s the difference?

Question 2: should we test position independent code on libraries or it has some sense forexecutable files as well?

4 Macrobenchmarks

This section discusses macrobenchmarks exploring typical overheads incurred by operatingsystem kernels.

4.1 Breakdown of the kernel workload

Experiments in this sections measure the fraction of the CPU time, which is spent in thekernel mode serving external interrupts and user requests. This time can greatly vary be-tween different workloads. Probably it makes sense to study two main kinds of workloads:desktop and server.

4.1.1 Overall time spend by the kernel serving user requests

This first experiment has to measure the fraction of time, which CPU spends in the kernelmode under different workloads. Further experiments identify how CPU load is distributedacross kernel methods and user-level components.

4.1.2 Costs of Linux system calls

This experiment measures the CPU load and frequency of the most commonly used Linuxsystem calls (e.g. file open, process create, etc.). Different system calls incur differentoverheads. Obtained information can be used to define groups of system calls with commonCPU requirements. Further, these groups can be executed on the least powerful CPU corein a heterogeneous system.

4.1.3 Cost of IRQ handling

The following experiments measure the load introduced by interrupts from external devices.

18

Network I/O overhead

Network I/O can be the most important contributor to the kernel load. Contemporary net-work cards require frequent CPU intensive interrupt handling. For example, [4] shows thatthree intensively loaded 1Gb network cards fully load the CPU.

Network I/O incurs different overheads during send and receive operations. Incoming traf-fic usually requires more CPU activity. Therefore, we have to measure these two casesseparately.

Disk I/O overhead

Disk I/O is usually less intensive than network operations. Implementation of the disk I/Ois more straightforward and incurs less overhead. However as in the networking case it’sworth studying two cases: disk read and disk write.

Graphical subsystem

Graphical subsystem is essential for a desktop workload. Unfortunately, I have little knowl-edge to study it deeply, however some simple experiments can be arranged.

4.2 Warm up periods

Every interruption of a thread by a kernel exception handler handler introduces pollution tothe processor caches, TLBs and branch predictors. The following set of experiments aimedto measure warm up periods for these structures. Experiments should also show the fractionof the execution time, when thread runs on warm or at least close to warm caches.

In all experiments in this section we will periodically measure number of correspondinghardware events to determine degree of warmth. We also measure number of committedinstructions in all experiments to identify the slowdown caused by cold structures.

Some of these experiments incur communication with the main memory, for example, toserve cache and TLB misses. Therefore, it would be interesting to undertake them with thedifferent CPU to memory clock speed ratios.

4.2.1 I-cache warm up

The warm up of the instruction cache.

4.2.2 D-cache and L2 warm up

The warm up of the L1 data cache. If Intel performance monitoring tools permit we alsomeasure L2 cache.

4.2.3 Trace cache warm up

Warm up period of the trace cache, which is used by the Intel architecture.

19

4.2.4 TLB warm up

Warm up period of the TLBs used by the Intel architecture. Note, that if I remember cor-rectly, Intel CPUs use several TLBs.

4.2.5 Branch predictors warm up

Warm up of branch predictors.

4.2.6 Performance comparison on cold and warm structures

This experiment measures performance difference on cold and warm structures. We run thesame benchmark on the realistically cold and fully warm structures.

4.3 Impact of a kernel and user-level workload interaction on overall systemperformance

The following two experiments aimed to answer the question about performance degrada-tion caused by kernel and user-level workloads interaction.

In both experiments we will isolate kernel from a user-level application and vice versa andcompare performance with the traditional non-isolated case. To undertake these experi-ments we will try to create an execution environment, where user-level applications willnever be interrupted by the kernel. This means that they will never encounter any excep-tions. Interrupts from external devices will be either disabled or redirected to another CPU.

4.3.1 User-level application slowdown

How much user-level application is faster if it’s not interrupted by the kernel.

4.3.2 Kernel slowdown

How much kernel processes are faster if they are not sharing CPU with the user-level appli-cations. Examples of the kernel processes can be: external interrupt handlers, a page-faulthandler, process management operations, address space management and maybe some otherwhich I missed.

4.4 Defining an active working set of the system

Experiments in this section targeted on identifying the size of an active set of the system.Active working set is the pages or data ranges most frequently accessed by the system. Ifthe rate of accesses to the active working set is noticeably higher comparing to the rest

20

of the system, we can say that active working set is distinguishable. If the working set ischanging relatively slow it’s stable.

There is a good chance that both operating system kernel and user-level applications havesmall distinguishable and stable active sets. Knowledge of the active working set can in-fluence design of the memory subsystems. For example, complete active set can be placedin the fast internal CPU memory. Presumably this approach can deliver better performanceresults than blind cache size growing.

4.4.1 Active working set of an operating system

Different operating systems have different active sets. Separation of the active set on thekernel and user is generally artificial. Conventional system components can be placed eitherin the kernel space or left on the user level. Most often it’s a performance vs isolation trade-off. Therefore, it’s not very wise to pay too much attention to this separation. In both cases,i.e. microkernel and monolitic design, combined active set should be of approximately thesame size. However, it’s still worth identifying it and explore its variation between designs.I suggest, to study at least two most widely used cases: conventional monolitic design(Linux) and microkernel virtual machine (Xen).

A system has different working set under different workloads. Therefore, it makes sense tostudy at least server and desktop workloads.

Linux

A working set of the Linux operating system will be studied as an example of conventionalmonolitic design.

Virtual Machine (Linux on Xen)

The paravirtualized Linux on Xen will be studied as an example of microkernel design.

4.4.2 Active working set of a user-level code

User-level applications share code via libraries. Most probably, similar to the kernel, user-level applications have small, identifiable working set.

4.5 Kernel event frequency (Amdahl law study)

In all microbenchmarks above we study performance costs of particular kernel events alongwith their frequency. The frequency gives us an estimation of the impact of each event onsystem performance.

Developing this study further, we investigate the frequency of kernel events under two typesof workloads: server and desktop.

21

4.5.1 Desktop workload

We assume that most common example of the desktop environment is a window managersystem (e.g. KDE or Gnome) under control of an X server (for example, Xorg).

4.5.2 Server workload

Server workload has completely different frequencies of external interrupts under differentworkloads. Therefore, we have to study several examples of server workload: networkintensive (ttcp), web server (Apache), database, make.

5 Virtual machine banchmarks

Virtual machine monitors are examples of the microkernel system architecture. Providingexcellent means of server consolidation, user isolation, access control and security enforce-ment, virtual machines, most probably, become wide spread in the server environment inthe nearest future. Furthermore, virtual machines are an excellent example of asymmetricsystem design. Usually only one trusted component of the virtual machine has rights toaccess hardware resources of the entire system. Therefore, VMM systems are interestingwith respect to emerging asymmetric multiprocessor hardware architectures.

Microkernel architectures are still challenging for operating system community. Therefore,comparison between experiments undertaken in monolitic and microkernel environments isstill important for future software and hardware optimizations.

The most popular virtual machine monitor, which we can use in our experiments, is Xen [2].Xen is a small microkernel providing simple memory management, scheduling, and IPCmechanisms. Xen runs paravirtualized Linux kernel. We will be able to recreate mostof the Linux experiments without significant modifications under the Xen virtual machinemonitor.

6 Methodology

We undertake all experiments on the real hardware, gathering statistics through the hard-ware performance monitoring interface provided by the Intel processors [3]. An obviousadvantage of this approach is that execution on the real hardware provides more realisticresults comparing to software emulation and doesn’t restricts complexity of the softwaresystem, which is investigated.

On the other hand, we undertake measurements operating directly inside the experimentand obviously introduce some error to it. Furthermore, additional error is introduced by thehardware working in the performance gathering mode. Therefore, we have try to minimizeour influence on the running system and provide some estimation of the introduced error.

22

6.1 Static and dynamic experiments

All experiments we undertake can be divided in two categories. In the first category wemeasure the cost of a particular event (for example, cost of entering the kernel mode). Suchexperiments are easy to arrange: we modify the code to read the time at the beginningand the end of the event. Although the code used to take measurements interferes withthe system, we can reduce the rate of the measurement to the level when the effect ofinterference will deteriorate between subsequent measurements.

The second group of experiments studies the evolution of some event during the system ex-ecution (for example, how the processor cache warms after the context switch). In order toundertake such experiments we have periodically read the value of corresponding hardwareperformance counter interfering with execution. This introduces an unavoidable error to themeasurements, which we have to estimate to make experiment data cleaner.

6.1.1 Dynamic measurement of event frequency with respect to the number of com-mitted instructions and time

We want to measure how the frequency of a hardware event evolves in time. For examplehow fast a cache warms after a context switch, or a trace cache miss slows the execution ofthe thread.

There are two possible approaches to measure the frequency of events with the help ofhardware performance counters: polling and handling asynchronous interrupts on over-flow. Polling approach assumes that we periodically read a hardware performance counteralong with the time-stamp counter to find the number of events occurred since the last mea-surement. Handling, an asynchronous interrupt on overflow in contrast assumes that weconfigure the performance counter to overflow after certain number of events. On overflowthe CPU will raise a nonmaskable interrupt (NMI) and pass control to the correspondinginterrupt handler. From the handler body we will be able to read the hardware counters andfind the time and the number of committed instructions since the last overflow.

Note: Intel save something to somewhere method. Can we use it and be happy.

Both approaches have disadvantages. Polling method requires instrumentation of the bench-mark code which is costly if done naively by hand. Overflow method introduces noticeableerror to the measurements due to the cost of handling NMIs. Since we do not have an au-tomatic technique to instrument the benchmark code we will use the overflow method andwill try to estimate the error introduced to the measurement by the NMI handler.

Alternative approach: instead of the hardware event counter (e.g. TLB or cache misscounter) we can use either the counter of committed-instructions or the time-stamp counterand read the number of hardware events happened since the last overflow. The last approachwill provide us with a homogeneous time axis.

Therefore, to measure the frequency of TLB misses, we instrument a nonmaskable interrupthandler to read the time-stamp counter and configure the hardware TLB miss counter to

23

overflow after a given number of events. We have to save time-stamp counter between themeasurements; the most straightforward way to do that is to preallocate memory space forN measurements. I hope that one 4KB page of memory capable to store 500 single-wordmeasurements will provide enough space for our experiments.

6.1.2 Error estimation

To estimate an error introduced to the experiments by NMI handling and operations requiredto save measurement values we will undertake the series of similar experiments configur-ing them with and without performance monitoring enabled, and with and without storingexperiment data between the measurements.

There are four experiment configurations we will run to define the error:

Clean run. A run of a benchmark code without any performance monitoring. We will onlyread the time-stamp counter before and after the benchmark to measure the timerequired to accomplish the experiment.

Hardware overhead. We will run the same benchmark with hardware performance mon-itoring enabled however we configure the performance counter to overflow after avery large (virtually infinite) number of events. In that way we avoid NMI interruptsand overheads associated with them. At the same time we will be able to measure theoverheads introduced solely by the CPU hardware performance interface.

NMI overhead. In this run we configure the performance counter to overflow similarly tothe real experiment. However, we will not save the time-stamp counter values andthus avoid the overhead associated with managing data in the experiment. Substract-ing the time required to accomplish the clean run from this run we will be able tofigure out the overhead introduced solely by the NMI handling. Dividing this over-head by the number of NMI handler invocations we will be able to figure out anapproximate cost of one NMI handler invocation.

Measurement data overhead. This run is similar to the hardware overhead run except thatwe save experiment data similarly to the real experiment. Similarly to the NMI over-head run we will be able to figure out the cost of saving data of one measurement.

Note, that two last runs (NMI and measurement data overhead) can have a nonlinear depen-dency between the frequency of events (i.e. NMIs) and their cost. Therefore, to approxi-mate this relation more accurately we can vary the frequency of NMI handler invocationsby configuring the performance counter to overflow more or less frequently.

Code instrumentation

Steps required to implement the experiment code (this I don’t know how to do withoutlooking on the source code):

• We have to implement a NMI handler.

24

• Register handler.

• Allocate some memory for experiment data: either statically, or dynamically. If dy-namically then we have to figure out how to use kernel memory primitives. Mostprobably we anyhow will do some preallocation system init.

• Implement a system call starting performance monitoring. We will invoke this systemcall before running our benchmark code.

The system call has to configure hardware performance interface

• Implement a system call finishing performance monitoring and returning the saveddata from the kernel.

The system call has to stop hardware monitoring interface and map the pageswith experiment data to the invoking user level process

Questions:

• Kernel file interface? Why we need to remap data to user level, maybe its easier towrite it directly to the file.

6.1.3 Measuring the event cost

We would like to measure the performance slowdown experienced by a process due tosome hardware events (e.g. cache, TLB misses or branch mispredictions). To do thatwe can instrument the time-stamp counter to overflow periodically and find the number ofcommitted instructions by reading a corresponding hardware counter from the NMI handler.We can also find the number of hardware events occurred during the last time interval tofind out how they slowed the execution.

Questions:

• How to pollute/flush branch predictor?

• How to pollute/flush caches (L1 instruction and data, L2, L3 if any, trace cache)?

• We know how to flush TLB.

6.2 An error introduced by the execution of the RDTSC instruction

The Intel Manual describes operation of the RDTSC instruction with the following text:

The RDTSC instruction is not serializing or ordered with other instructions.It does not necessarily wait until all previous instructions have been executedbefore reading the counter. Similarly, subsequent instructions may begin exe-cution before the RDTSC instruction operation is performed.

25

Taking this into account, the rest of this section can have little sense. However, we will tryto gather some meaningful statistic about behaviour of the RDTSC instruction.

In order to estimate an error introduced to the measurements by the invocation of the RDTSCinstruction we have to measure its cost itself. Usually we invoke RDTSC twice: once toread time-stamp counter at the beginning of tested/measured code and then at the end.Measurement done at the beginning requires storing of the read value. RDTSC returns time-stamp value in the EDX:EAX register pair. Thus, we need to save this registers somewhere.Most experiments will allow us to store EDX:EAX in two other registers, but sometimes wewill need to save these registers in memory.

In order to estimate the delay introduced by RDTSC we undertake the following experi-ment: we invoke RDTSC twice (or more?): one invocation right after another. Betweeninvocations we save EDX:EAX. Furthermore, to create a realistic picture we will take sev-eral measurements interleaving them with a meaningful benchmark-derived code in orderto create different processor cache and pipeline state conditions between invocations.

We have to discuss how better to interleave RDTSC with code sequences, should we intro-duce cache misses etc?

6.3 Techniques to provide a non-interruptable execution of a process

In some experiments we might want to ensure that the process is running without being in-terrupted during experiment. I see three reasons causing process interruption: synchronousexceptions (e.g. divisions by zero, page faults), asynchronous interrupts from external de-vices (e.g. timer interrupts, other IRQs from network, disk devices, etc.), and exhaustion ofthe scheduling quantum.

Page faults can be prevented by instrumenting Linux kernel to not swap out pages of theprocess (i.e. wire process pages). Naturally, this technique is impossible if process memoryis larger than physical memory in the system. To prevent page faults occurring upon a firstpage reference we have to touch all process pages and load them to memory before runningthe process code.

The most straightforward way to prevent descheduling of the process is to run it with thehighest priority and infinite quantum.

Eliminating interrupts from external devices is most complex. If we can guarantee thatthe tested process doesn’t perform disk and network I/O we can configure APIC to disableexternal interrupts during the experiment. In case of a multiprocessor machine we caninstruct APIC to redirect all external interrupts to another CPU.

7 Notes on serializing instructions

See IA-32 Intel Manual volume 3A.

26

• MOV to the control register (CR3 for example on TLB flush)

• INVLPG - invalidate page table entry

8 OProfile Internals

OProfile internals are documented in:http://oprofile.sourceforge.net/doc/internals/index.html

The kernel-related parts of the OProfile can be found under the arch/<architecture>/oprofiledirectory of the Linux source tree. OProfile driver source code can be found in the drivers/oprofile.Available hardware monitoring events for each architecture are in the OProfile source treein the events/<architecture> directory.

9 Related work

To be written

References

[1] TSC, power management events on AMD processors, and Open Solaris,http://www.opensolaris.org/jive/thread.jspa?messageID=14402.

[2] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, RolfNeugebauer, Ian Pratt, and Andrew Warfield. Xen and the art of virtualization. In SOSP’03: Proceedings of the nineteenth ACM symposium on Operating systems principles,pages 164–177, New York, NY, USA, 2003. ACM Press.

[3] Intel Corporation. The IA-32 Intel Architecture Software Developer’s Manual, Volume3B: System Programming Guide, Part 2, January 2006.

[4] Aravind Menon, Jose Renato Santos, Yoshio Turner, G. (John) Janakiraman, and WillyZwaenepoel. Diagnosing performance overheads in the xen virtual machine environ-ment. In VEE ’05: Proceedings of the 1st ACM/USENIX international conference onVirtual execution environments, pages 13–23, New York, NY, USA, 2005. ACM Press.

27

http://oprofile.sourceforge.net/doc/internals/index.html

http://www.opensolaris.org/jive/thread.jspa?messageID=14402

system measurement with the hardware performance...

Documents