special course on computer architecture hiroki matsutani and hideharu amano june 3rd, 2011special...
Post on 21-Dec-2015
220 views
TRANSCRIPT
Special Course on Computer Architecture 1
Special Course on Computer Architecture
Hiroki Matsutani and Hideharu Amano
June 3rd, 2011
#7 Simulation of Multi-Processors
Special Course on Computer Architecture 2
Outline: Simulation of Multi-Processors
• Background– Recent multi-core and many-core processors– Network-on-Chip
• Shared-memory chip multi-processors– Architecture– Coherence protocols
• Simulation environment: GEMS/Simics
• Exercises [50min]
– Performance evaluation of parallel applications– Performance evaluation of coherence protocolsJune 3rd, 2011
Multi- and many-core architectures
4
8
16
32
64
128
256
20112004 2006 2008 2010
MIT RAW
STI Cell BE
Sun T1 Sun T2
TILERA TILE64
Intel Core, IBM Power7AMD Opteron
Intel 80-coreClearSpeed CSX600
ClearSpeed CSX700
picoChip PC102 picoChip PC205
UT TRIPS (OPN)
Num
ber o
f PEs
(cac
hes
are
not i
nclu
ded)
2
Fujitsu SPARC64
Intel SCC
Special Course on Computer Architecture 4
Network-on-Chip (NoC)• Interconnection network to connect many-cores
June 3rd, 2011
RouterCore
16-Core Tile Architecture
Special Course on Computer Architecture 5
On-chip router architecture
June 3rd, 2011
5x5 CROSSBAR
ARBITER
FIFO
FIFO
FIFO
FIFO
FIFOX+
X-
Y+
Y-
CORE
X+
X-
Y+
Y-
CORE
Routing, arbitration,&switch traversal are performed in pipeline manner
Input ports Output ports1) selecting an output channel
2) arbitration for the selected output channel
GRANT
3) sending the packet to the output channel
Special Course on Computer Architecture 6
Outline: Simulation of Multi-Processors
• Background– Recent multi-core and many-core processors– Network-on-Chip
• Shared-memory chip multi-processors– Architecture– Coherence protocols
• Simulation environment: GEMS/Simics
• Exercises [50min]
– Performance evaluation of parallel applications– Performance evaluation of coherence protocolsJune 3rd, 2011
Special Course on Computer Architecture 7
Today’s target architecture• Chip multi-processors (CMPs)– Multiple processors (each has private L1 cache)– Shared L2 cache divided into multiple banks (SNUCA)
June 3rd, 2011
Processor tile
Cache tile UltraSPARC
L1 cache (I & D)
L2 cache bank
Special Course on Computer Architecture 8
Today’s target architecture• Chip multi-processors (CMPs)– Multiple processors (each has private L1 cache)– Shared L2 cache divided into multiple banks (SNUCA)– Processors and L2 cache banks are connected via NoC
June 3rd, 2011
Processor tile
Cache tile UltraSPARC
L1 cache (I & D)
L2 cache bank
On-chip router
Special Course on Computer Architecture 9
Cache coherence is maintained• Write back policy– Cache-write updates the memory when block is evicted
• Write invalidate policy– Cache-write invalidates all copies of the other sharers
June 3rd, 2011
Processor tile
Cache tile MainMemory
Special Course on Computer Architecture 10
Cache coherence is maintained• A CPU wants to read a block cached at– The CPU sends a read request to the memory controller – The controller forwards the request to current owner– The owner sends the block to the requestor
June 3rd, 2011
Processor tile
Cache tile MainMemory
Special Course on Computer Architecture 11
Cache coherence: MOESI protocol class
• Modified (M)– Modified (i.e., dirty)– Valid in one cache
• Shared (S)– Shared by multiple
CPUs • Exclusive (E)– Clean– Exists in one cache
• Invalid (I)
• Owned (O)– May or may not clean– Exists in multiple caches– Owned by one cache
• Owner– Responsibility to
respond any requests
• MOESI protocols– MSI, MOSI,– MESI, MOESI, …June 3rd, 2011
Status of each cache block is represented with M/O/E/S/I
Special Course on Computer Architecture 12
Cache coherence protocols• MSI/MOSI directory protocol– E state is not implemented– S-to-M transition always updates the main memory
• MESI directory protocol– O state is not implemented; Dirty sharing not allowed– M-to-S transition always updates the main memory
• MOESI directory protocol• MOESI token protocol [Martin ISCA03]
– There are tokens as many as the number of CPUs– A CPU has one or more tokens It can read the block– A CPU has all tokens It can modify (write) the blockJune 3rd, 2011
I
M S
CpuRd---CpuWr--- CpuRd---
CpuWrBusWr
CpuWrBusWr
CpuRdBusRd
I
M S
BusRd---BusWr---
CpuRd---
BusRdFlush
BusWrFlush
BusWr---
MSI Protocol: State transition
S-to-M transitions flush (update) the main memoryY. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).
MESI Protocol: State transition
M-to-S transitions flush (update) the main memory
I
M
S
CpuRd---CpuWr--- CpuRd---
ECpuWr---CpuWrBusWr
CpuRdBusRd(!C)
CpuRdBusRd(C)
CpuWrBusUpgr
CpuRd---
I
M
S
E
BusWrFlushOpt
BusRdFlush
BusRdFlushOpt
BusWrFlush
BusRdFlushOpt
BusRd---BusWr---BusUpgr---
Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).
MOESI Protocol: State transition (1/2)
I
M
S
CpuRd---CpuWr--- CpuRd---
ECpuWr---CpuWrBusWr
CpuRdBusRd(!C)
CpuRdBusRd(C)
CpuWrBusUpgr
CpuRd---
O
CpuRd---
CpuWrBusUpgr
Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).
MOESI Protocol: State transition (2/2)
O
BusRdFlush
BusRdFlush
I
M
S
E
BusWrFlushOpt
BusRdFlushOpt
BusWrFlush
BusRdFlushOpt
BusRd---BusWr---BusUpgr---
BusWrFlushBusUpgr---
Y. Solihin, "Fundamentals of Parallel Computer Architecture" (2009).
Special Course on Computer Architecture 17
Outline: Simulation of Multi-Processors
• Background– Recent multi-core and many-core processors– Network-on-Chip
• Shared-memory chip multi-processors– Architecture– Coherence protocols
• Simulation environment: GEMS/Simics
• Exercises [50min]
– Performance evaluation of parallel applications– Performance evaluation of coherence protocolsJune 3rd, 2011
Special Course on Computer Architecture 18
Full-system simulation: GEMS/Simics
• Wind River’s Simics– Commercial detailed processor simulator
• Univ. of Wisconsin’s GEMS– Cache, memory, and network module for Simics
June 3rd, 2011
Processor tile
Cache tile UltraSPARC
L1 cache (I & D)
L2 cache bank
On-chip router
MainMemory
Special Course on Computer Architecture 19
Full-system simulation: GEMS/Simics
• Today’s simulation target– Solaris 9 OS on eight UltraSPARC processors– Parallel application examples: Pi and Integer sort– Various coherence protocols are supported
June 3rd, 2011
Processor tile
Cache tile UltraSPARC
L1 cache (I & D)
L2 cache bank
On-chip router
MainMemory
Special Course on Computer Architecture 20
Full-system simulation: GEMS/Simics
• Simulation target– Solaris 9 OS on eight UltraSPARC processors– Parallel application example: Integer Sort (IS)
June 3rd, 2011
Processor tile
Cache tile UltraSPARC
L1 cache (I & D)
L2 cache bank
On-chip router
MainMemory
Solaris 9 is running on8-core UltraSPARC
A parallel program
Compile
Execute itwith 8-core
Parallel application example: OpenMP#include <stdio.h>#include <omp.h>int main() {
#pragma omp parallelprintf("hello world from %d of %d\n",
omp_get_thread_num(), omp_get_num_threads());
return 0;}
Hello from all threads
Parallel application example: OpenMPint main() {
int i; double start_time, end_time;start_time = omp_get_wtime();omp_set_num_threads(num);#pragma omp parallel shared(A) private(i){
#pragma omp forfor (i = 0; i < N; i++)
A[i] = A[i] * A[i] - 3.0;}end_time = omp_get_wtime();printf("Elapsed time: %f sec\n",
end_time - start_time);return 0;
}
Parallel application example: OpenMPint main() {
int i; double s = 0.0;double start_time, end_time;start_time = omp_get_wtime(); #pragma omp parallel private(i) reduction(+:s){
#pragma omp forfor (i = 0; i < N; i++)
s += (4.0 / (4 * i + 1) - 4.0 / (4 * i + 3));
}printf("pi = %lf\n", s);end_time = omp_get_wtime();printf("Elapsed time: %f sec\n",
end_time - start_time);}
Special Course on Computer Architecture 24
Outline: Simulation of Multi-Processors
• Background– Recent multi-core and many-core processors– Network-on-Chip
• Shared-memory chip multi-processors– Architecture– Coherence protocols
• Simulation environment: GEMS/Simics
• Exercises [50min]
– Performance evaluation of parallel applications– Performance evaluation of coherence protocolsJune 3rd, 2011
Special Course on Computer Architecture 25
The first step: How to use the simulator
• Please pick up your account information
• Log-in one of ICS cluster machines (id = 01…15)ssh –X <username>@cluster<id>.ics.keio.ac.jp
• Copy sample scripts and configuration filescp –r ~matutani/comparch2011/files workcd work
June 3rd, 2011
Special Course on Computer Architecture 26
The first step: How to use the simulator
• Start Simics./start_ideal_memory.sh
• You can use the gray window as a console of the target system (i.e., Solaris 9 on 8-core UltraSPARCs).
June 3rd, 2011
Special Course on Computer Architecture 27
The first step: How to use the simulator• In the target machine, for example, you can check the
number of processors as follows. bash-2.05# /usr/sbin/psrinfo -v
June 3rd, 2011
You will see that there are eight processors
Special Course on Computer Architecture 28
Parallel application: “pi” calculation
• You can execute a "pi" calculation program using eight, four, and one threads.bash-2.05# export OMP_NUM_THREADS=8bash-2.05# ./pibash-2.05# export OMP_NUM_THREADS=4bash-2.05# ./pibash-2.05# export OMP_NUM_THREADS=1bash-2.05# ./pi
June 3rd, 2011
Special Course on Computer Architecture 29
Parallel application: Integer Sort (IS)
• You can execute an Integer Sort (IS) program using eight, four, and one threads.bash-2.05# export OMP_NUM_THREADS=8bash-2.05# ./ISbash-2.05# export OMP_NUM_THREADS=4bash-2.05# ./ISbash-2.05# export OMP_NUM_THREADS=1bash-2.05# ./IS
June 3rd, 2011
Special Course on Computer Architecture 30
Exercise 1• Report the execution time of “pi” using 1, 4, 8, and
16 threads. Does the execution time linearly decrease as the number of threads increase? Discuss the results.
June 3rd, 2011
Special Course on Computer Architecture 31
Coherence protocols: Integer Sort (IS)
• The following scripts automatically perform the IS program with different cache coherent protocols. ./start_moesi_directory.sh ./start_mesi_directory.sh ./start_msi_mosi_directory.sh ./start_moesi_token.sh
• Each simulation takes five to ten minutes. Do not run more than one scripts at the same time!
June 3rd, 2011
Special Course on Computer Architecture 32
Exercise 2• Report the execution time of MSI/MOSI directory,
MESI directory, MOESI directory, and MOESI token. Discuss the results. For more detail about the protocols, you can see pages 14—19.
June 3rd, 2011