special course on computer architecture hiroki matsutani and hideharu amano june 3rd, 2011special...

Special Course on Computer Architecture 1

Special Course on Computer Architecture

Hiroki Matsutani and Hideharu Amano

June 3rd, 2011

#7 Simulation of Multi-Processors


Outline: Simulation of Multi-Processors

• Background– Recent multi-core and many-core processors– Network-on-Chip

• Shared-memory chip multi-processors– Architecture– Coherence protocols

• Simulation environment: GEMS/Simics

• Exercises [50min]

– Performance evaluation of parallel applications– Performance evaluation of coherence protocolsJune 3rd, 2011

Multi- and many-core architectures

4

8

16

32

64

128

256

20112004 2006 2008 2010

MIT RAW

STI Cell BE

Sun T1 Sun T2

TILERA TILE64

Intel Core, IBM Power7AMD Opteron

Intel 80-coreClearSpeed CSX600

ClearSpeed CSX700

picoChip PC102 picoChip PC205

UT TRIPS (OPN)

Num

ber o

f PEs

(cac

hes

are

not i

nclu

ded)

2

Fujitsu SPARC64

Intel SCC


Network-on-Chip (NoC)• Interconnection network to connect many-cores

June 3rd, 2011

RouterCore

16-Core Tile Architecture


On-chip router architecture

June 3rd, 2011

5x5 CROSSBAR

ARBITER

FIFO

FIFO

FIFO

FIFO

FIFOX+

X-

Y+

Y-

CORE

X+

X-

Y+

Y-

CORE

Routing, arbitration,&switch traversal are performed in pipeline manner

Input ports Output ports1) selecting an output channel

2) arbitration for the selected output channel

GRANT

3) sending the packet to the output channel


Today’s target architecture• Chip multi-processors (CMPs)– Multiple processors (each has private L1 cache)– Shared L2 cache divided into multiple banks (SNUCA)

June 3rd, 2011

Processor tile

Cache tile UltraSPARC

L1 cache (I & D)

L2 cache bank


Today’s target architecture• Chip multi-processors (CMPs)– Multiple processors (each has private L1 cache)– Shared L2 cache divided into multiple banks (SNUCA)– Processors and L2 cache banks are connected via NoC

June 3rd, 2011

Processor tile


L1 cache (I & D)

L2 cache bank

On-chip router


Cache coherence is maintained• Write back policy– Cache-write updates the memory when block is evicted

• Write invalidate policy– Cache-write invalidates all copies of the other sharers

June 3rd, 2011

Processor tile

Cache tile MainMemory


Cache coherence is maintained• A CPU wants to read a block cached at– The CPU sends a read request to the memory controller – The controller forwards the request to current owner– The owner sends the block to the requestor

June 3rd, 2011

Processor tile

Cache tile MainMemory


Cache coherence: MOESI protocol class

• Modified (M)– Modified (i.e., dirty)– Valid in one cache

• Shared (S)– Shared by multiple

CPUs • Exclusive (E)– Clean– Exists in one cache

• Invalid (I)

• Owned (O)– May or may not clean– Exists in multiple caches– Owned by one cache

• Owner– Responsibility to

respond any requests

• MOESI protocols– MSI, MOSI,– MESI, MOESI, …June 3rd, 2011

Status of each cache block is represented with M/O/E/S/I


Cache coherence protocols• MSI/MOSI directory protocol– E state is not implemented– S-to-M transition always updates the main memory

• MESI directory protocol– O state is not implemented; Dirty sharing not allowed– M-to-S transition always updates the main memory

• MOESI directory protocol• MOESI token protocol [Martin ISCA03]

– There are tokens as many as the number of CPUs– A CPU has one or more tokens It can read the block– A CPU has all tokens It can modify (write) the blockJune 3rd, 2011

I

M S

CpuRd---CpuWr--- CpuRd---

CpuWrBusWr

CpuWrBusWr

CpuRdBusRd

I

M S

BusRd---BusWr---

CpuRd---

BusRdFlush

BusWrFlush

BusWr---

MSI Protocol: State transition

S-to-M transitions flush (update) the main memoryY. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

MESI Protocol: State transition

M-to-S transitions flush (update) the main memory

I

M

S


ECpuWr---CpuWrBusWr

CpuRdBusRd(!C)

CpuRdBusRd(C)

CpuWrBusUpgr

CpuRd---

I

M

S

E

BusWrFlushOpt

BusRdFlush

BusRdFlushOpt

BusWrFlush

BusRdFlushOpt

BusRd---BusWr---BusUpgr---

Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

MOESI Protocol: State transition (1/2)

I

M

S


ECpuWr---CpuWrBusWr

CpuRdBusRd(!C)

CpuRdBusRd(C)

CpuWrBusUpgr

CpuRd---

O

CpuRd---

CpuWrBusUpgr


MOESI Protocol: State transition (2/2)

O

BusRdFlush

BusRdFlush

I

M

S

E

BusWrFlushOpt

BusRdFlushOpt

BusWrFlush

BusRdFlushOpt

BusRd---BusWr---BusUpgr---

BusWrFlushBusUpgr---



Full-system simulation: GEMS/Simics

• Wind River’s Simics– Commercial detailed processor simulator

• Univ. of Wisconsin’s GEMS– Cache, memory, and network module for Simics

June 3rd, 2011

Processor tile


L1 cache (I & D)

L2 cache bank

On-chip router

MainMemory



• Today’s simulation target– Solaris 9 OS on eight UltraSPARC processors– Parallel application examples: Pi and Integer sort– Various coherence protocols are supported

June 3rd, 2011

Processor tile


L1 cache (I & D)

L2 cache bank

On-chip router

MainMemory



• Simulation target– Solaris 9 OS on eight UltraSPARC processors– Parallel application example: Integer Sort (IS)

June 3rd, 2011

Processor tile


L1 cache (I & D)

L2 cache bank

On-chip router

MainMemory

Solaris 9 is running on8-core UltraSPARC

A parallel program

Compile

Execute itwith 8-core

Parallel application example: OpenMP#include <stdio.h>#include <omp.h>int main() {

#pragma omp parallelprintf("hello world from %d of %d\n",

omp_get_thread_num(), 　　　　　　　　　 omp_get_num_threads());

return 0;}

Hello from all threads

Parallel application example: OpenMPint main() {

int i; double start_time, end_time;start_time = omp_get_wtime();omp_set_num_threads(num);#pragma omp parallel shared(A) private(i){

#pragma omp forfor (i = 0; i < N; i++)

A[i] = A[i] * A[i] - 3.0;}end_time = omp_get_wtime();printf("Elapsed time: %f sec\n",

end_time - start_time);return 0;

}

Parallel application example: OpenMPint main() {

int i; double s = 0.0;double start_time, end_time;start_time = omp_get_wtime(); #pragma omp parallel private(i) reduction(+:s){

#pragma omp forfor (i = 0; i < N; i++)

s += (4.0 / (4 * i + 1) - 4.0 / (4 * i + 3));

}printf("pi = %lf\n", s);end_time = omp_get_wtime();printf("Elapsed time: %f sec\n",

end_time - start_time);}


The first step: How to use the simulator

• Please pick up your account information

• Log-in one of ICS cluster machines (id = 01…15)ssh –X <username>@cluster<id>.ics.keio.ac.jp

• Copy sample scripts and configuration filescp –r ~matutani/comparch2011/files workcd work

June 3rd, 2011


The first step: How to use the simulator

• Start Simics./start_ideal_memory.sh

• You can use the gray window as a console of the target system (i.e., Solaris 9 on 8-core UltraSPARCs).

June 3rd, 2011


The first step: How to use the simulator• In the target machine, for example, you can check the

number of processors as follows. bash-2.05# /usr/sbin/psrinfo -v

June 3rd, 2011

You will see that there are eight processors


Parallel application: “pi” calculation

• You can execute a "pi" calculation program using eight, four, and one threads.bash-2.05# export OMP_NUM_THREADS=8bash-2.05# ./pibash-2.05# export OMP_NUM_THREADS=4bash-2.05# ./pibash-2.05# export OMP_NUM_THREADS=1bash-2.05# ./pi

June 3rd, 2011


Parallel application: Integer Sort (IS)

• You can execute an Integer Sort (IS) program using eight, four, and one threads.bash-2.05# export OMP_NUM_THREADS=8bash-2.05# ./ISbash-2.05# export OMP_NUM_THREADS=4bash-2.05# ./ISbash-2.05# export OMP_NUM_THREADS=1bash-2.05# ./IS

June 3rd, 2011


Exercise 1• Report the execution time of “pi” using 1, 4, 8, and

16 threads. Does the execution time linearly decrease as the number of threads increase? Discuss the results.

June 3rd, 2011


Coherence protocols: Integer Sort (IS)

• The following scripts automatically perform the IS program with different cache coherent protocols. ./start_moesi_directory.sh ./start_mesi_directory.sh ./start_msi_mosi_directory.sh ./start_moesi_token.sh

• Each simulation takes five to ten minutes. Do not run more than one scripts at the same time!

June 3rd, 2011


Exercise 2• Report the execution time of MSI/MOSI directory,

MESI directory, MOESI directory, and MOESI token. Discuss the results. For more detail about the protocols, you can see pages 14—19.

June 3rd, 2011

special course on computer architecture hiroki matsutani and hideharu amano june 3rd, 2011special...

Documents

cache coherence

core tile architecture

special course

chip router architecture

core processors network

policy cache

l2 cache banks

chip router slide