multi-threaded processors - · pdf filemulti-threaded processors ... amd k10 architecture 4...

41
Multi-threaded processors Hung-Wei Tseng

Upload: vothien

Post on 30-Mar-2018

230 views

Category:

Documents


1 download

TRANSCRIPT

Multi-threaded processors

Hung-Wei Tseng

OoO SuperScalar Processor

2

• Fetch instructions in the instruction window• Register renaming to eliminate false dependencies• Schedule an instruction to execution stage (issue)

whenever all data inputs are ready for the instruction• Put the instruction in reorder buffer and commit

the instruction if the instruction is (1) not mis-predicted and (2) all the instruction prior to this instruction are committed

Simplified OOO pipeline

3

Register renaming

logicSchedule Execution

UnitsData

MemoryReorder Buffer/Commit

InstructionFetch

InstructionDecode

Branch predictor

AMD K10 architecture

4

3-issue integer pipeline 3-issue floating point pipeline

AMD FX (Bulldozer)

5 4-issue floating point pipeline

4-issue integer pipeline

intel Nehalem (1st gen core i7)

6

3-issue integer pipeline

3-issue floating point pipeline

3-issue memory pipeline

Intel SkyLake architecture

7

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2-2

2.1 THE SKYLAKE MICROARCHITECTURE The Skylake microarchitecture builds on the successes of the Haswell and Broadwell microarchitectures. The basic pipeline functionality of the Skylake microarchitecture is depicted in Figure 2-1.

The Skylake microarchitecture offers the following enhancements:• Larger internal buffers to enable deeper OOO execution and higher cache bandwidth.• Improved front end throughput.• Improved branch predictor.• Improved divider throughput and latency.• Lower power consumption.• Improved SMT performance with Hyper-Threading Technology.• Balanced floating-point ADD, MUL, FMA throughput and latency.

The microarchitecture supports flexible integration of multiple processor cores with a shared uncore sub-system consisting of a number of components including a ring interconnect to multiple slices of L3 (an off-die L4 is optional), processor graphics, integrated memory controller, interconnect fabrics, etc. A four-core configuration can be supported similar to the arrangement shown in Figure 2-3.

Figure 2-1. CPU Core Pipeline Functionality of the Skylake Microarchitecture

32K L1 Instruction Cache

MSROM Decoded Icache (DSB)

Legacy DecodePipeline

Instruction Decode Queue (IDQ,, or micro-op queue)

Allocate/Rename/Retire/MoveElimination/ZeroIdiom

32K L1 Data Cache

256K L2 Cache (Unified)

Int ALU, Vec FMA,Vec MUL,Vec Add,Vec ALU,Vec Shft,Divide,

Branch2

Port 2LD/STA

Scheduler

BPU

Port 0

Int ALU, Fast LEA,Vec FMA,Vec MUL,Vec Add,Vec ALU,Vec Shft,Int MUL,Slow LEA

Int ALU, Fast LEA,Vec SHUF,Vec ALU,

CVT

Int ALU, Int Shft,Branch1,

Port 3LD/STA

Port 4STD

Port 7STA

Port 1 Port 5 Port 6

5 uops/cycle4 uops/cycle6 uops/cycle

4-issue integer pipeline 4-issue memory pipeline

Dynamic execution with register naming • Consider the following dynamic instructions

1: lw $t1, 0($a0)2: lw $a0, 4($a0)3: add $v0, $v0, $t14: bne $a0, $zero, LOOP5: lw $t1, 0($a0)6: lw $t2, 4($a0)7: add $v0, $v0, $t18: bne $t2, $zero, LOOPAssume a superscalar processor with unlimited issue width & physical registers that can fetch up to 4 instructions per cycle, 2 cycles to execute a memory instruction how many cycles it takes to issue all instructions?

8

A. 1B. 2C. 3D. 4E. 5

1

3

2

4

8

5 6

cycle #1

cycle #2

cycle #3

cycle #4

cycle #5

You code looks like this when performing “linked list” traversal

7

Wasted

Wasted

Announcement• CAPE (Course Evaluation)• Final review tomorrow• Hung-Wei’s office hour

• Thursday: 10:30a-11:30a• Friday: 11:00a-12:00p

9

Outline• Simultaneous multithreading• Chip multiprocessor• Parallel programming

10

Simultaneous Multi-Threading (SMT)

11

Simultaneous Multi-Threading (SMT)• Fetch instructions from different threads/processes

to fill the not utilized part of pipeline• Exploit “thread level parallelism” (TLP) to solve the problem

of insufficient ILP in a single thread

• Keep separate architectural states for each thread• PC• Register Files• Reorder Buffer

• Create an illusion of multiple processors for OSs• The rest of superscalar processor hardware is

shared• Invented by Dean Tullsen

• Now a professor in UCSD CSE!• You may take his CSE148 in Spring 2015

12

Simplified SMT-OOO pipeline

13

Register renaming

logicSchedule Execution

UnitsData

Cache

InstructionFetch: T0

InstructionDecode

InstructionFetch: T1InstructionFetch: T2InstructionFetch: T3

ROB: T0

ROB: T1

ROB: T2

ROB: T3

Simultaneous Multi-Threading (SMT)

• Fetch 2 instructions from each thread/process at each cycle to fill the not utilized part of pipeline

• Issue width is still 2, commit width is still 4T1 1: lw $t1, 0($a0)T1 2: lw $a0, 0($t1)T2 1: sll $t0, $a1, 2T2 2: add $t1, $a0, $t0T1 3: addi $a1, $a1, -1T1 4: bne $a1, $zero, LOOPT2 3: lw $v0, 0($t1)T2 4: addi $t1, $t1, 4T2 5: add $v0, $v0, $t2T2 6: jr $ra

IF

IF

IF

IF

IF

IF

ID

ID

ID

ID

IF

IF

Can execute 6 instructions before bne resolved.14

EXE

Sch

MEM

Sch

MEM

EXE

Sch

Sch

Sch

Sch

ID

ID

ID

ID

Ren

Ren

Ren

Ren

IF

IF

C

C

C

MEM

C

Sch

EXE

Sch

Sch

Sch

Sch

Ren

Ren

Ren

Ren

ID

ID

C

C

EXE

C

C

C

EXE

C

C

EXE

C

C

Sch

Sch

Sch

Sch

C

C

MEM

EXE

EXE

Sch

Sch

EXE

Sch

Sch

Sch

EXE

Sch

Sch

Sch

Ren

Ren

SMT• Improve the throughput of execution

• May increase the latency of a single thread

• Less branch penalty per thread• Increase hardware utilization• Simple hardware design: Only need to duplicate PC/

Register Files• Real Case:

• Intel HyperThreading (supports up to two threads per core)• Intel Pentium 4, Intel Atom, Intel Core i7

• AMD Zen

15

SMT• How many of the following about SMT are correct?

• SMT makes processors with deep pipelines more tolerable to mis-predicted branches

• SMT can improve the throughput of a single-threaded application

• SMT processors can better utilize hardware during cache misses comparing with superscalar processors with the same issue width

• SMT processors can have higher cache miss rates comparing with superscalar processors with the same cache sizes when executing the same set of applications.

16

A. 0B. 1C. 2D. 3E. 4

hurt, b/c you are sharing resource with other threads.

Simultaneous Multithreading• SMT helps covering the long memory latency

problem• But SMT is still a “superscalar” processor• Power consumption / hardware complexity can still

be high.• Think about Pentium 4

17

A wide-issue processor or multiple narrower-issue processors

18

What can you do within a 21 mm * 21 mm area?

vents the instruction fetch mechanism from becoming a bottlenecksince the 6-way execution engine requires a much higher instruc-tion fetch bandwidth than the 2-way processors used in the MParchitecture.

The on-chip memory hierarchy is similar to the Alpha 21164 — asmall, fast level one (L1) cache backed up by a large on-chip leveltwo (L2) cache. The wide issue width requires the L1 cache to sup-port wide instruction fetches from the instruction cache and multi-ple loads from the data cache during each cycle. The two-way setassociative 32 KB L1 data cache is banked eight ways into eightsmall, single-ported, independent 4 KB cache banks each of whichhandling one access every 2 ns processor cycle. However, the addi-tional overhead of the bank control logic and crossbar required toarbitrate between the multiple requests sharing the 8 data cachebanks adds another cycle to the latency of the L1 cache, andincreases the area by 25%. Therefore, our modeled L1 cache has ahit time of 2 cycles. Backing up the 32 KB L1 caches is a large, uni-fied, 256 KB L2 cache that takes 4 cycles to access. These latenciesare simple extensions of the times obtained for the L1 caches ofcurrent Alpha microprocessors [4], using a 0.25 µm process tech-nology

4.2 4 x 2-way Superscalar Multiprocessor Architecture

The MP architecture is made up of four 2-way superscalar proces-sors interconnected by a crossbar that allows the processors to sharethe L2 cache. On the die, the four processors are arranged in a gridwith the L2 cache at one end, as shown in Figure 3. Internally, eachof the processors has a register renaming buffer that is much morelimited than the one in the 6-way architecture, since each CPU onlyhas an 8-entry instruction buffer. We also quartered the size of thebranch prediction mechanisms in the fetch units, to 512 BTBentries and 8 call-return stack entries. After the area adjustmentscaused by these factors are accounted for, each of the four proces-

sors is less than one-fourth the size of the 6-way SS processor, asshown in Table 3. The number of execution units actually increasesin the MP because the 6-way processor had three units of each type,while the 4-way MP must have four — one for each CPU. On theother hand, the issue logic becomes dramatically smaller, due to thedecrease in instruction buffer ports and the smaller number ofentries in each instruction buffer. The scaling factors of these twounits balance each other out, leaving the entire processor very closeto one-fourth of the size of the 6-way processor.

The on-chip cache hierarchy of the multiprocessor is significantlydifferent from the cache hierarchy of the 6-way superscalar proces-sor. Each of the 4 processors has its own single-banked and single-ported 8 KB instruction and data caches that can both be accessedin a single 2 ns cycle. Since each cache can only be accessed by asingle processor with a single load/store unit, no additional over-head is incurred to handle arbitration among independent memory-access units. However, since the four processors now share a singleL2 cache, that cache requires an extra cycle of latency during everyaccess to allow time for interprocessor arbitration and crossbardelay. We model this additional L2 delay by penalizing the MP anadditional cycle on every L2 cache access, resulting in a 5 cycle L2hit time.

5 Simulation Methodology

Accurately evaluating the performance of the two microarchitec-tures requires a way of simulating the environment in which wewould expect these architectures to be used in real systems. In thissection we describe the simulation environment and the applica-tions used in this study.

5.1 Simulation Environment

We execute the applications in the SimOS simulation environment[18]. SimOS models the CPUs, memory hierarchy and I/O devices

Figure 2. Floorplan for the six-issue dynamic superscalar microprocessor.

ExternalInterface Instruction

Fetch

InstructionCache(32 KB)

DataCache(32 KB)

TLB

Inst. Decode &Rename

Reorder Buffer,Instruction Queues,

and Out-of-Order Logic

Floating PointUnit

Inte

ger U

nit

On-

Chip

L2

Cach

e (2

56KB

)

Cloc

king

& Pa

ds

21 mm

21 mm

Figure 3. Floorplan for the four-way single-chip multiprocessor.

Processor #1

L2 C

omm

unica

tion

Cros

sbar

On-

Chip

L2

Cach

e (2

56KB

)Processor #2

Processor #3

Processor #4

I-Cache #1 (8K) I-Cache #2 (8K)

I-Cache #3 (8K) I-Cache #4 (8K)

D-Cache #1 (8K) D-Cache #2 (8K)D-Cache #3 (8K) D-Cache #4 (8K)

Cloc

king

& Pa

ds

ExternalInterface

21 mm

21 mm

A 6-issue superscalar processor3 integer ALUs3 floating point ALUs3 load/store unitsvents the instruction fetch mechanism from becoming a bottleneck

since the 6-way execution engine requires a much higher instruc-tion fetch bandwidth than the 2-way processors used in the MParchitecture.

The on-chip memory hierarchy is similar to the Alpha 21164 — asmall, fast level one (L1) cache backed up by a large on-chip leveltwo (L2) cache. The wide issue width requires the L1 cache to sup-port wide instruction fetches from the instruction cache and multi-ple loads from the data cache during each cycle. The two-way setassociative 32 KB L1 data cache is banked eight ways into eightsmall, single-ported, independent 4 KB cache banks each of whichhandling one access every 2 ns processor cycle. However, the addi-tional overhead of the bank control logic and crossbar required toarbitrate between the multiple requests sharing the 8 data cachebanks adds another cycle to the latency of the L1 cache, andincreases the area by 25%. Therefore, our modeled L1 cache has ahit time of 2 cycles. Backing up the 32 KB L1 caches is a large, uni-fied, 256 KB L2 cache that takes 4 cycles to access. These latenciesare simple extensions of the times obtained for the L1 caches ofcurrent Alpha microprocessors [4], using a 0.25 µm process tech-nology

4.2 4 x 2-way Superscalar Multiprocessor Architecture

The MP architecture is made up of four 2-way superscalar proces-sors interconnected by a crossbar that allows the processors to sharethe L2 cache. On the die, the four processors are arranged in a gridwith the L2 cache at one end, as shown in Figure 3. Internally, eachof the processors has a register renaming buffer that is much morelimited than the one in the 6-way architecture, since each CPU onlyhas an 8-entry instruction buffer. We also quartered the size of thebranch prediction mechanisms in the fetch units, to 512 BTBentries and 8 call-return stack entries. After the area adjustmentscaused by these factors are accounted for, each of the four proces-

sors is less than one-fourth the size of the 6-way SS processor, asshown in Table 3. The number of execution units actually increasesin the MP because the 6-way processor had three units of each type,while the 4-way MP must have four — one for each CPU. On theother hand, the issue logic becomes dramatically smaller, due to thedecrease in instruction buffer ports and the smaller number ofentries in each instruction buffer. The scaling factors of these twounits balance each other out, leaving the entire processor very closeto one-fourth of the size of the 6-way processor.

The on-chip cache hierarchy of the multiprocessor is significantlydifferent from the cache hierarchy of the 6-way superscalar proces-sor. Each of the 4 processors has its own single-banked and single-ported 8 KB instruction and data caches that can both be accessedin a single 2 ns cycle. Since each cache can only be accessed by asingle processor with a single load/store unit, no additional over-head is incurred to handle arbitration among independent memory-access units. However, since the four processors now share a singleL2 cache, that cache requires an extra cycle of latency during everyaccess to allow time for interprocessor arbitration and crossbardelay. We model this additional L2 delay by penalizing the MP anadditional cycle on every L2 cache access, resulting in a 5 cycle L2hit time.

5 Simulation Methodology

Accurately evaluating the performance of the two microarchitec-tures requires a way of simulating the environment in which wewould expect these architectures to be used in real systems. In thissection we describe the simulation environment and the applica-tions used in this study.

5.1 Simulation Environment

We execute the applications in the SimOS simulation environment[18]. SimOS models the CPUs, memory hierarchy and I/O devices

Figure 2. Floorplan for the six-issue dynamic superscalar microprocessor.

ExternalInterface Instruction

Fetch

InstructionCache(32 KB)

DataCache(32 KB)

TLB

Inst. Decode &Rename

Reorder Buffer,Instruction Queues,

and Out-of-Order Logic

Floating PointUnit

Inte

ger U

nit

On-

Chip

L2

Cach

e (2

56KB

)

Cloc

king

& Pa

ds

21 mm

21 mm

Figure 3. Floorplan for the four-way single-chip multiprocessor.

Processor #1

L2 C

omm

unica

tion

Cros

sbar

On-

Chip

L2

Cach

e (2

56KB

)Processor #2

Processor #3

Processor #4

I-Cache #1 (8K) I-Cache #2 (8K)

I-Cache #3 (8K) I-Cache #4 (8K)

D-Cache #1 (8K) D-Cache #2 (8K)D-Cache #3 (8K) D-Cache #4 (8K)

Cloc

king

& Pa

ds

ExternalInterface

21 mm

21 mm

4 2-issue superscalar processor4*1 integer ALUs4*1 floating point ALUs4*1 load/store units

You will have more ALUs if you choose

this!

Chip multiprocessor (CMP)

19

Die photo of a CMP processor

20

CMP advantages• How many of the following are advantages of CMP

over traditional superscalar processor• CMP can provide better energy-efficiency within the same

area • CMP can deliver better instruction throughput within the

same die area (chip size)• CMP can achieve better ILP for each running thread• CMP can improve the performance of a single-threaded

application without modifying code

21

A. 0B. 1C. 2D. 3E. 4

CMP v.s. SMT• Assuming both application X and application Y have

similar instruction combination, say 60% ALU, 20% load/store, and 20% branches. Consider two processors:

P1: CMP with a 2-issue pipeline on each core. Each core has a private L1 32KB D-cache

P2: SMT with a 4-issue pipeline. 64KB L1 D-cache

Which one do you think is better?

22

A. P1B. P2

Speedup a single application on multi-threaded processors

23

Parallel programming• To exploit CMP/SMT parallelism you need to break your

computation into multiple “processes” or multiple “threads”• Processes (in OS/software systems)

• Separate programs actually running (not sitting idle) on your computer at the same time.

• Each process will have its own virtual memory space and you need explicitly exchange data using inter-process communication APIs

• Threads (in OS/software systems)• Independent portions of your program that can run in parallel• All threads share the same virtual memory space

• We will refer to these collectively as “threads”• A typical user system might have 1-8 actively running threads.• Servers can have more if needed (the sysadmins will hopefully

configure it that way)

24

Create threads/processes• The only way we can improve a single application

performance on CMP/SMT• You can use fork() to create a child process (CSE120)• Or you can use pthread or openmp to compose multi-

threaded programs• Threads from “the same process” share the same virtual

memory address space (i.e. only one page table for all threads).

25

/* Do matrix multiplication */ for(i = 0 ; i < NUM_OF_THREADS ; i++) { tids[i] = i; pthread_create(&thread[i], NULL, threaded_blockmm, &tids[i]); } for(i = 0 ; i < NUM_OF_THREADS ; i++) pthread_join(thread[i], NULL);

Spawn a thread

Synchronize and wait a for thread to terminate

Supporting shared memory model• Provide a single memory space that all processors

can share• All threads within the same program shares the

same address space.• Threads communicate with each other using shared

variables in memory• Provide the same memory abstraction as single-

thread programming

26

Simple idea...• Connecting all processor and shared memory to a

bus.• Processor speed will be slow b/c all devices on a

bus must run at the same speed

27

Bus

Core 0 Core 1 Core 2 Core 3

Shared $

Memory hierarchy on CMP• Each processor has

its own local cache

28

Core 0

Local $

Core 1

Local $

Core 2

Local $

Core 3

Local $

Bus

Shar

ed $

Cache on Multiprocessor• Coherency

• Guarantees all processors see the same value for a variable/memory address in the system when the processors need the value at the same time• What value should be seen

• Consistency• All threads see the change of data in the same order

• When the memory operation should be done

29

Simple cache coherency protocol• Snooping protocol

• Each processor broadcasts / listens to cache misses

• State associate with each block (cacheline)• Invalid

• The data in the current block is invalid

• Shared• The processor can read the data• The data may also exist on other processors

• Exclusive• The processor has full permission on the data• The processor is the only one that has up-to-date data

30

Simple cache coherency protocol

Invalid Shared

Exclusive

read miss(processor)wr

ite m

iss(p

roce

ssor

)

write miss(bus)

write request(p

rocessor)

write

miss

(bus

)wr

ite b

ack

data

read miss(bus)

write back

data

read miss/hit

read/write miss (bus)

write hit

31

Cache coherency practice• What happens when core 0 modifies 0x1000?, which

belongs to the same cache block as 0x1000?

32

Bus

Shared $

Local $

Core 0 Core 1 Core 2 Core 3

Shared 0x1000 Shared 0x1000 Shared 0x1000 Shared 0x1000Excl. 0x1000 Invalid 0x1000 Invalid 0x1000 Invalid 0x1000

Write miss 0x1000

invalidate

Cache coherency practice• Then, what happens when core 2 reads 0x1000?

33

Bus

Shared $

Local $

Core 0 Core 1 Core 2 Core 3

Shared 0x1000 Invalid 0x1000 Shared 0x1000 Invalid 0x1000Excl. 0x1000 Invalid 0x1000

Read miss 0x1000Write back 0x1000Fetch 0x1000

Cache coherency practice• Now, what happens when core 2 writes 0x1004,

which belongs the same block as 0x1000?

34

Bus

Shared $

Local $

Core 0 Core 1 Core 2 Core 3

Shared 0x1000 Invalid 0x1000 Shared 0x1000 Invalid 0x1000Invalid 0x1000 Invalid 0x1000 Excl. 0x1000 Invalid 0x1000

Write miss 0x1004

• Then, if Core 0 accesses 0x1000, it will be a miss!

Invalidate all 0x1000 because 0x1000 and 0x1004 belong to the same cache block!

Cache coherency• Assuming that we are running the following code on

a CMP with some cache coherency protocol, which output is NOT possible? (a is initialized to 0)

35

thread 1 thread 2while(1) printf(“%d ”,a);

while(1) a++;

A. 0 1 2 3 4 5 6 7 8 9B. 1 2 5 9 3 6 8 10 12 13C. 1 1 1 1 1 1 1 1 1 1 D. 1 1 1 1 1 1 1 1 1 100

It’s show time!• Demo!

36

thread 1 thread 2while(1) printf(“%d ”,a);

while(1) a++;

Cache coherency practice• Now, what happens when core 2 writes 0x1004,

which belongs the same block as 0x1000?

37

Bus

Shared $

Local $

Core 0 Core 1 Core 2 Core 3

Shared 0x1000 Invalid 0x1000 Shared 0x1000 Invalid 0x1000Invalid 0x1000 Invalid 0x1000 Excl. 0x1000 Invalid 0x1000

Write miss 0x1004

• Then, if Core 0 accesses 0x1000, it will be a miss!

4C model• 3Cs:

• Compulsory, Conflict, Capacity

• Coherency miss:• A “block” invalidated because of the sharing among

processors.• True sharing

• Processor A modifies X, processor B also want to access X. • False Sharing

• Processor A modifies X, processor B also want to access Y. However, Y is invalidated because X and Y are in the same block!

38

Hard to debug

39

thread 1 thread 2int loop;

int main(){ pthread_t thread; loop = 1; pthread_create(&thread, NULL, modifyloop, NULL); while(loop == 1) { continue; } pthread_join(thread, NULL); fprintf(stderr,"User input: %d\n", loop); return 0;}

void* modifyloop(void *x){ sleep(1); printf("Please input a number:\n"); scanf("%d",&loop); return NULL;}

Performance of multi-threaded programs

• Multi-threaded block algorithm for matrix multiplication

• Demo!

40

Conclusion• In the past, software engineers and hardware

engineers work on different sides of the ISA abstraction• Software engineers have no idea about what happen in

processors/hardware• Hardware engineers have no sense about what are the

demands of applications• This works fine if we can keep accelerating CPUs, but not true

anymore

• We need new execution & programming model to better utilize these hardware components

• We need innovative computer architectures to address the challenges from process technologies and the application demands

47