special course on computer architecture hiroki matsutani and hideharu amano june 3rd, 2011special...

32
Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011 Special Course on Computer Architecture 1 #7 Simulation of Multi- Processors

Post on 21-Dec-2015

220 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Special Course on Computer Architecture 1

Special Course on Computer Architecture

Hiroki Matsutani and Hideharu Amano

June 3rd, 2011

#7 Simulation of Multi-Processors

Page 2: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Special Course on Computer Architecture 2

Outline: Simulation of Multi-Processors

• Background– Recent multi-core and many-core processors– Network-on-Chip

• Shared-memory chip multi-processors– Architecture– Coherence protocols

• Simulation environment: GEMS/Simics

• Exercises [50min]

– Performance evaluation of parallel applications– Performance evaluation of coherence protocolsJune 3rd, 2011

Page 3: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Multi- and many-core architectures

4

8

16

32

64

128

256

20112004 2006 2008 2010

MIT RAW

STI Cell BE

Sun T1 Sun T2

TILERA TILE64

Intel Core, IBM Power7AMD Opteron

Intel 80-coreClearSpeed CSX600

ClearSpeed CSX700

picoChip PC102 picoChip PC205

UT TRIPS (OPN)

Num

ber o

f PEs

(cac

hes

are

not i

nclu

ded)

2

Fujitsu SPARC64

Intel SCC

Page 4: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Special Course on Computer Architecture 4

Network-on-Chip (NoC)• Interconnection network to connect many-cores

June 3rd, 2011

RouterCore

16-Core Tile Architecture

Page 5: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Special Course on Computer Architecture 5

On-chip router architecture

June 3rd, 2011

5x5 CROSSBAR

ARBITER

FIFO

FIFO

FIFO

FIFO

FIFOX+

X-

Y+

Y-

CORE

X+

X-

Y+

Y-

CORE

Routing, arbitration,&switch traversal are performed in pipeline manner

Input ports Output ports1) selecting an output channel

2) arbitration for the selected output channel

GRANT

3) sending the packet to the output channel

Page 6: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Special Course on Computer Architecture 6

Outline: Simulation of Multi-Processors

• Background– Recent multi-core and many-core processors– Network-on-Chip

• Shared-memory chip multi-processors– Architecture– Coherence protocols

• Simulation environment: GEMS/Simics

• Exercises [50min]

– Performance evaluation of parallel applications– Performance evaluation of coherence protocolsJune 3rd, 2011

Page 7: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Special Course on Computer Architecture 7

Today’s target architecture• Chip multi-processors (CMPs)– Multiple processors (each has private L1 cache)– Shared L2 cache divided into multiple banks (SNUCA)

June 3rd, 2011

Processor tile

Cache tile UltraSPARC

L1 cache (I & D)

L2 cache bank

Page 8: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Special Course on Computer Architecture 8

Today’s target architecture• Chip multi-processors (CMPs)– Multiple processors (each has private L1 cache)– Shared L2 cache divided into multiple banks (SNUCA)– Processors and L2 cache banks are connected via NoC

June 3rd, 2011

Processor tile

Cache tile UltraSPARC

L1 cache (I & D)

L2 cache bank

On-chip router

Page 9: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Special Course on Computer Architecture 9

Cache coherence is maintained• Write back policy– Cache-write updates the memory when block is evicted

• Write invalidate policy– Cache-write invalidates all copies of the other sharers

June 3rd, 2011

Processor tile

Cache tile MainMemory

Page 10: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Special Course on Computer Architecture 10

Cache coherence is maintained• A CPU wants to read a block cached at– The CPU sends a read request to the memory controller – The controller forwards the request to current owner– The owner sends the block to the requestor

June 3rd, 2011

Processor tile

Cache tile MainMemory

Page 11: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Special Course on Computer Architecture 11

Cache coherence: MOESI protocol class

• Modified (M)– Modified (i.e., dirty)– Valid in one cache

• Shared (S)– Shared by multiple

CPUs • Exclusive (E)– Clean– Exists in one cache

• Invalid (I)

• Owned (O)– May or may not clean– Exists in multiple caches– Owned by one cache

• Owner– Responsibility to

respond any requests

• MOESI protocols– MSI, MOSI,– MESI, MOESI, …June 3rd, 2011

Status of each cache block is represented with M/O/E/S/I

Page 12: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Special Course on Computer Architecture 12

Cache coherence protocols• MSI/MOSI directory protocol– E state is not implemented– S-to-M transition always updates the main memory

• MESI directory protocol– O state is not implemented; Dirty sharing not allowed– M-to-S transition always updates the main memory

• MOESI directory protocol• MOESI token protocol [Martin ISCA03]

– There are tokens as many as the number of CPUs– A CPU has one or more tokens It can read the block– A CPU has all tokens It can modify (write) the blockJune 3rd, 2011

Page 13: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

I

M S

CpuRd---CpuWr--- CpuRd---

CpuWrBusWr

CpuWrBusWr

CpuRdBusRd

I

M S

BusRd---BusWr---

CpuRd---

BusRdFlush

BusWrFlush

BusWr---

MSI Protocol: State transition

S-to-M transitions flush (update) the main memoryY. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

Page 14: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

MESI Protocol: State transition

M-to-S transitions flush (update) the main memory

I

M

S

CpuRd---CpuWr--- CpuRd---

ECpuWr---CpuWrBusWr

CpuRdBusRd(!C)

CpuRdBusRd(C)

CpuWrBusUpgr

CpuRd---

I

M

S

E

BusWrFlushOpt

BusRdFlush

BusRdFlushOpt

BusWrFlush

BusRdFlushOpt

BusRd---BusWr---BusUpgr---

Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

Page 15: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

MOESI Protocol: State transition (1/2)

I

M

S

CpuRd---CpuWr--- CpuRd---

ECpuWr---CpuWrBusWr

CpuRdBusRd(!C)

CpuRdBusRd(C)

CpuWrBusUpgr

CpuRd---

O

CpuRd---

CpuWrBusUpgr

Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

Page 16: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

MOESI Protocol: State transition (2/2)

O

BusRdFlush

BusRdFlush

I

M

S

E

BusWrFlushOpt

BusRdFlushOpt

BusWrFlush

BusRdFlushOpt

BusRd---BusWr---BusUpgr---

BusWrFlushBusUpgr---

Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).

Page 17: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Special Course on Computer Architecture 17

Outline: Simulation of Multi-Processors

• Background– Recent multi-core and many-core processors– Network-on-Chip

• Shared-memory chip multi-processors– Architecture– Coherence protocols

• Simulation environment: GEMS/Simics

• Exercises [50min]

– Performance evaluation of parallel applications– Performance evaluation of coherence protocolsJune 3rd, 2011

Page 18: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Special Course on Computer Architecture 18

Full-system simulation: GEMS/Simics

• Wind River’s Simics– Commercial detailed processor simulator

• Univ. of Wisconsin’s GEMS– Cache, memory, and network module for Simics

June 3rd, 2011

Processor tile

Cache tile UltraSPARC

L1 cache (I & D)

L2 cache bank

On-chip router

MainMemory

Page 19: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Special Course on Computer Architecture 19

Full-system simulation: GEMS/Simics

• Today’s simulation target– Solaris 9 OS on eight UltraSPARC processors– Parallel application examples: Pi and Integer sort– Various coherence protocols are supported

June 3rd, 2011

Processor tile

Cache tile UltraSPARC

L1 cache (I & D)

L2 cache bank

On-chip router

MainMemory

Page 20: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Special Course on Computer Architecture 20

Full-system simulation: GEMS/Simics

• Simulation target– Solaris 9 OS on eight UltraSPARC processors– Parallel application example: Integer Sort (IS)

June 3rd, 2011

Processor tile

Cache tile UltraSPARC

L1 cache (I & D)

L2 cache bank

On-chip router

MainMemory

Solaris 9 is running on8-core UltraSPARC

A parallel program

Compile

Execute itwith 8-core

Page 21: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Parallel application example: OpenMP#include <stdio.h>#include <omp.h>int main() {

#pragma omp parallelprintf("hello world from %d of %d\n",

omp_get_thread_num(),           omp_get_num_threads());

return 0;}

Hello from all threads

Page 22: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Parallel application example: OpenMPint main() {

int i; double start_time, end_time;start_time = omp_get_wtime();omp_set_num_threads(num);#pragma omp parallel shared(A) private(i){

#pragma omp forfor (i = 0; i < N; i++)

A[i] = A[i] * A[i] - 3.0;}end_time = omp_get_wtime();printf("Elapsed time: %f sec\n",

end_time - start_time);return 0;

}

Page 23: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Parallel application example: OpenMPint main() {

int i; double s = 0.0;double start_time, end_time;start_time = omp_get_wtime(); #pragma omp parallel private(i) reduction(+:s){

#pragma omp forfor (i = 0; i < N; i++)

s += (4.0 / (4 * i + 1) - 4.0 / (4 * i + 3));

}printf("pi = %lf\n", s);end_time = omp_get_wtime();printf("Elapsed time: %f sec\n",

end_time - start_time);}

Page 24: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Special Course on Computer Architecture 24

Outline: Simulation of Multi-Processors

• Background– Recent multi-core and many-core processors– Network-on-Chip

• Shared-memory chip multi-processors– Architecture– Coherence protocols

• Simulation environment: GEMS/Simics

• Exercises [50min]

– Performance evaluation of parallel applications– Performance evaluation of coherence protocolsJune 3rd, 2011

Page 25: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Special Course on Computer Architecture 25

The first step: How to use the simulator

• Please pick up your account information

• Log-in one of ICS cluster machines (id = 01…15)ssh –X <username>@cluster<id>.ics.keio.ac.jp

• Copy sample scripts and configuration filescp –r ~matutani/comparch2011/files workcd work

June 3rd, 2011

Page 26: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Special Course on Computer Architecture 26

The first step: How to use the simulator

• Start Simics./start_ideal_memory.sh

• You can use the gray window as a console of the target system (i.e., Solaris 9 on 8-core UltraSPARCs).

June 3rd, 2011

Page 27: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Special Course on Computer Architecture 27

The first step: How to use the simulator• In the target machine, for example, you can check the

number of processors as follows. bash-2.05# /usr/sbin/psrinfo -v

June 3rd, 2011

You will see that there are eight processors

Page 28: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Special Course on Computer Architecture 28

Parallel application: “pi” calculation

• You can execute a "pi" calculation program using eight, four, and one threads.bash-2.05# export OMP_NUM_THREADS=8bash-2.05# ./pibash-2.05# export OMP_NUM_THREADS=4bash-2.05# ./pibash-2.05# export OMP_NUM_THREADS=1bash-2.05# ./pi

June 3rd, 2011

Page 29: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Special Course on Computer Architecture 29

Parallel application: Integer Sort (IS)

• You can execute an Integer Sort (IS) program using eight, four, and one threads.bash-2.05# export OMP_NUM_THREADS=8bash-2.05# ./ISbash-2.05# export OMP_NUM_THREADS=4bash-2.05# ./ISbash-2.05# export OMP_NUM_THREADS=1bash-2.05# ./IS

June 3rd, 2011

Page 30: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Special Course on Computer Architecture 30

Exercise 1• Report the execution time of “pi” using 1, 4, 8, and

16 threads. Does the execution time linearly decrease as the number of threads increase? Discuss the results.

June 3rd, 2011

Page 31: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Special Course on Computer Architecture 31

Coherence protocols: Integer Sort (IS)

• The following scripts automatically perform the IS program with different cache coherent protocols. ./start_moesi_directory.sh ./start_mesi_directory.sh ./start_msi_mosi_directory.sh ./start_moesi_token.sh

• Each simulation takes five to ten minutes. Do not run more than one scripts at the same time!

June 3rd, 2011

Page 32: Special Course on Computer Architecture Hiroki Matsutani and Hideharu Amano June 3rd, 2011Special Course on Computer Architecture 1 #7 Simulation of Multi-Processors

Special Course on Computer Architecture 32

Exercise 2• Report the execution time of MSI/MOSI directory,

MESI directory, MOESI directory, and MOESI token. Discuss the results. For more detail about the protocols, you can see pages 14—19.

June 3rd, 2011