cs 161 review for test 2

33
1 1999 ©UCB CS 161 Review for Test 2 Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan Adapted from notes by Dave Patterson (http.cs.berkeley.edu/~patterson)

Upload: galena

Post on 08-Jan-2016

30 views

Category:

Documents


1 download

DESCRIPTION

CS 161 Review for Test 2. Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan Adapted from notes by Dave Patterson (http.cs.berkeley.edu/~patterson). How to Study for Test 2 : Chap 5. Single-cycle (CPI=1) processor know how to reason about processor organization (datapath, control) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS 161 Review for Test 2

1 1999 ©UCB

CS 161Review for Test 2

Instructor: L.N. Bhuyanwww.cs.ucr.edu/~bhuyan

Adapted from notes by Dave Patterson(http.cs.berkeley.edu/~patterson)

Page 2: CS 161 Review for Test 2

2 1999 ©UCB

How to Study for Test 2 : Chap 5°Single-cycle (CPI=1) processor

• know how to reason about processor organization (datapath, control)

- e.g., how to add another instruction? (must modify both control, datapath, or both)

- How to add multiplexors in the datapath

- How to design hardware control unit

°Multicycle (CPI>1) processor- Changes to Single Cycle Datapath

- Control Design through FSM

- how to add new instruction to multicycle?

Page 3: CS 161 Review for Test 2

3 1999 ©UCB

Putting Together a Datapath for MIPS

Memory(Dmem)

PC RegistersALU

Data In Data Out

Memory(Imem)

Address

Data Out

AddressData Out

Data In

Step 1 Step 2 Step 3 Step 4 5

°Question: Which instruction uses which steps and what is the execution time?

Page 4: CS 161 Review for Test 2

4 1999 ©UCB

Datapath Timing: Single-cycle vs. Pipelined°Suppose the following delays for major functional units:• 2 ns for a memory access or ALU operation

• 1 ns for register file read or write

°Total datapath delay for single-cycle:

°What about multi-cycle datapath?

Insn Insn Reg ALU Data Reg TotalType Fetch Read Oper Access Write Time

beq 2ns 1ns 2ns 5nsR-form 2ns 1ns 2ns 1ns 6nssw 2ns 1ns 2ns 2ns 7nslw 2ns 1ns 2ns 2ns 1ns 8ns

Page 5: CS 161 Review for Test 2

5 1999 ©UCB

Implementing Main Control

Main Control

RegDst

Branch

MemRead

MemtoReg

ALUop

MemWrite

ALUSrc

RegWrite

op

2

Main Control has one 6-bit input, 9 outputs (7 are 1-bit, ALUOp is 2 bits)

To build Main Control as sum-of-products:

(1) Construct a minterm for each different instruction (or R-type); each minterm corresponds to a single instruction (or all of the R-type instructions), e.g., MR-format, Mlw

(2) Determine each main control output by forming the logical OR of relevant minterms (instructions), e.g., RegWrite: MR-format OR Mlw

Page 6: CS 161 Review for Test 2

6 1999 ©UCB

Single-Cycle MIPS-lite CPU

Regs

ReadReg1

Readdata1

ALURead

data2

ReadReg2

WriteReg

WriteData

Zero

ALU-con

RegWrite

Address

Readdata

WriteData

SignExtend

Dmem

MemRead

MemWrite

Mux

MemTo-Reg

Mux

Read Addr

Instruc-tion

Imem

4

PC

add

add <<

2

Mux

ALU Control

5:0ALUOp

ALU-src

Mux

25:21

20:16

15:11

RegDst

15:0

31:0

Branch

Main Control

op=[31:26]

PCSrc

Page 7: CS 161 Review for Test 2

7 1999 ©UCB

R-format Execution Illustration (step 4)

Regs

ReadReg1

Readdata1

ALURead

data2

ReadReg2

WriteReg

WriteData

Zero

ALU-con

RegWrite

Address

Readdata

WriteData

SignExtend

Dmem

MemRead

MemWrite

Mux

MemTo-Reg=1

Mux

Read Addr

Instruc-tion

Imem

4

PC

add

add <<

2

Mux

PCSrc=0

ALU Control

5:0 ALUOp

ALU-src=0

Mux

25:21

20:16

15:11

RegDst=1

15:0

31:0

Branch

Main Control

[r1] + [r2]

Page 8: CS 161 Review for Test 2

8 1999 ©UCB

Multicycle Datapath (overview)

Registers

ReadReg1

ALU

ReadReg2

WriteReg

Data

PC

Address

Instructionor Data

Memory

MIPS-liteMulticycle Version

A

B

ALU-Out

InstructionRegister

Data MemoryData

Register

Readdata 1

Readdata 2

• One ALU (no extra adders)• One Memory (no separate Imem, Dmem)• New Temporary Registers (“clocked”/require clock input)

Page 9: CS 161 Review for Test 2

9 1999 ©UCB

Cycle 3 Datapath (R-format)

MIPS-liteMulticycle Version

ALU

Regs

ReadReg1

Readdata1

Readdata2

ReadReg2

WriteReg

WriteData

Sgn Ext- end

PC

<<2

A

B

ALU-Out

Address

ReadData

Mem

WriteData

MDR

Mux

25:21

20:16

15:0 0 1M2 u3 x

Mux

Mux

Mux

IR4

z

15:11

ALUControl

22

3

(funct) 5:0

Mux

ALUOut=A op B

Page 10: CS 161 Review for Test 2

10 1999 ©UCB

MemReadALUSrcA = 0

IorD = 0IRWrite

ALUSrcB = 1ALUOp = 0

PCWritePCSrc = 0

state 0

ALUSrcA = 0ALUSrcB = 3ALUOp = 0

ALUSrcA = 1ALUSrcB = 2ALUOp = 0

ALUSrcA = 1ALUSrcB = 0

ALUOp =2

ALUSrcA = 1ALUSrcB = 0

ALUOp =1PCWriteCond

PCSrc = 1

1

26

8

Memory Access

R-format execution

Branch Completion

FSM diagram for Multicycle Machine

start new instruction

cycle1

cycle2

cycle3

lw/sw

R-form

at beq

Page 11: CS 161 Review for Test 2

11 1999 ©UCB

Implementing the FSM controller (C.3)PCWrite

PCWriteCond

IorD

MemtoReg

PCSrc

ALUOp

ALUSrcB

ALUSrcA

RegWrite

RegDst

NS3NS2NS1NS0

Op

5

Op

4

Op

3

Op

2

Op

1

Op

0

S3

S2

S1

S0

IRWrite

MemRead

MemWrite

Outputs

Inputs

PLA or ROMimplementation of both next-state and output functions

Next-state}

DatapathControl Points

Instruction register opcode field

state register

Page 12: CS 161 Review for Test 2

12 1999 ©UCB

Micro-programmed Control (Chap. 5.5)° In microprogrammed control, FSM states become microinstructions of a microprogram (“microcode”)

• one FSM state=one microinstruction

• usually represent each micro-instruction textually, like an assembly instruction

°FSM current state register becomes the microprogram counter (micro-PC)

• normal sequencing: add 1 to micro-PC to get next micro-instruction

• microprogram branch: separate logic determines next microinstruction

Page 13: CS 161 Review for Test 2

13 1999 ©UCB

Micro-program for Multi-cycle Machine

ALU Reg Mem PC NextOp In1 In2 File Op Src Writ -Instr

--------------------- ------ ---------------- ------- --------- Fetch: Add PC 4 Rd PC ALU

Add PC SE*4 Rd [D1]Mem: Add A SE [D2]LW: Rd ALU

Wr FetchSW: Wr ALU FetchRform: funct A B

Wr FetchBEQ: Sub A B Equ Fetch

D1 = { Mem, Rform, BEQ }D2 = { LW, SW }

Page 14: CS 161 Review for Test 2

14 1999 ©UCB

How to Study for Test 2 : Chap 6°Pipelined Processor

• how pipelined datapath, control differs from architectures of Chapter 5?

- All instructions execute same 5 cycles

- pipeline registers to separate the stages of datapath & control

• Problems for Pipelining- pipeline hazards: structural, data, control

(how each solved?)

Page 15: CS 161 Review for Test 2

15 1999 ©UCB

Pipelining Lessons

° Pipelining doesn’t help latency (execution time) of single task, it helps throughput of entire workload

° Multiple tasks operating simultaneously using different resources

° Potential speedup = Number of pipe stages

° What is real speedup?

° Time to “fill” pipeline and time to “drain” it reduces speedup

6 PM 7 8 9

Time

B

C

D

A

303030 3030 3030Task

Order

Page 16: CS 161 Review for Test 2

16 1999 ©UCB

Space-Time Diagram

°To simplify pipeline, every instruction takes same number of steps, called stages

°One clock cycle per stage

IFtch Dcd Exec Mem WB

IFtch Dcd Exec Mem WB

IFtch Dcd Exec Mem WB

IFtch Dcd Exec Mem WB

IFtch Dcd Exec Mem WB

Program Flow

Time

Page 17: CS 161 Review for Test 2

17 1999 ©UCB

Problems for Pipelining

°Hazards prevent next instruction from executing during its designated clock cycle, limiting speedup

• Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away)

• Control hazards: conditional branches & other instructions may stall the pipeline delaying later instructions (must check detergent level before washing next load)

• Data hazards: Instruction depends on result of prior instruction still in the pipeline (matching socks in later load)

Page 18: CS 161 Review for Test 2

18 1999 ©UCB

°guess branch taken, then back up if wrong: “branch prediction”• For example, Predict not taken

• Impact: 1 clock per branch instruction if right, 2 if wrong (static: right ~ 50% of time)

• More dynamic scheme: keep history of the branch instruction (~ 90%)

Control Hazard : Solution 1

add

beq

Load

AL

U IM Reg DM Reg

AL

U IM Reg DM Reg

IMA

LUReg DM Reg

Instr.

Order

Time (clock cycles)

Page 19: CS 161 Review for Test 2

19 1999 ©UCB

°Redefine branch behavior (takes place after next instruction) “delayed branch”

° Impact: 1 clock cycle per branch instruction if can find instruction to put in the “delay slot” ( 50% of time)

Control Hazard : Solution 2

add

beq

Misc

AL

U IM Reg DM Reg

AL

U IM Reg DM Reg

IMA

LUReg DM Reg

Load IM

AL

UReg DM Reg

Instr.

Order

Time (clock cycles)

Page 20: CS 161 Review for Test 2

20 1999 ©UCB

Dependencies backwards in time are hazards

Data Hazard on $1: Illustration

add $1,$2,$3

sub $4,$1,$3

and $6,$1,$7

or $8,$1,$9

xor $10,$1,$11

IF ID/RF EX MEM WBAL

UIM Reg DM Reg

AL

UIM Reg DM RegA

LUIM Reg DM Reg

IM

AL

UReg DM Reg

AL

UIM Reg DM Reg

Instr.

Order

Time (clock cycles)

Page 21: CS 161 Review for Test 2

21 1999 ©UCB

• “Forward” result from one stage to another

• “or” OK if implement register file properly

Data Hazard : Solution:

add $1,$2,$3

sub $4,$1,$3

and $6,$1,$7

or $8,$1,$9

xor $10,$1,$11

IF ID/RF EX MEM WBAL

UIM Reg DM Reg

AL

UIM Reg DM RegA

LUIM Reg DM Reg

IM

AL

UReg DM Reg

AL

UIM Reg DM Reg

Instr.

Order

Time (clock cycles)

Page 22: CS 161 Review for Test 2

22 1999 ©UCB

• Must stall pipeline 1 cycle (insert 1 bubble)

lw $1, 0($2)

sub $4,$1,$6

and $6,$1,$7

or $8,$1,$9

IF ID/RF EX MEM WBAL

UIM Reg DM Reg

AL

UIM Reg DM Reg

AL

UIM Reg DM Reg

IM

AL

UReg DM

Time (clock cycles)

bubble

bubble

bubble

Data Hazard Even with Forwarding

Page 23: CS 161 Review for Test 2

23 1999 ©UCB

How to Study for Test 2 : Chap 7°Processor-Memory performance gap: problem for hardware designers and software developers alike

°Memory Hierarchy--The Goal: want to create illusion of single large, fast memory

• access that hit in highest level are processed most quickly

• Exploit Principle of Locality to obtain high hit rate

°Caches vs. Virtual Memory: how are they similar? Different?

Page 24: CS 161 Review for Test 2

24 1999 ©UCB

Memory Hierarchy: Terminology

°Hit Time: Time to access the upper level which consists of

•Time to determine hit/miss +Memory access time

Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor

°Note: Hit Time << Miss Penalty

[Note: “<<“ here means “much less than”]

Page 25: CS 161 Review for Test 2

25 1999 ©UCB

Issues with Direct-Mapped° If block size > 1, rightmost bits of index are really the offset within the indexed block

ttttttttttttttttt iiiiiiiiii oooo

tag index byteto check to offsetif have select

withincorrect block block block

Q: How do Set-Associative and Fully-Associative Designs Look?

Page 26: CS 161 Review for Test 2

26 1999 ©UCB

Read from cache at offset, return word b° 000000000000000000 0000000001 0100

...

ValidTag 0x0-3 0x4-7 0x8-b 0xc-f

01234567

10221023

...

1 0 a b c d

Index

Tag field Index field Offset

0

000000

00

Page 27: CS 161 Review for Test 2

27 1999 ©UCB

Miss Rate Versus Block Size

256

40%

35%

30%

25%

20%

15%

10%

5%

0%

Mis

s ra

te

64164

Block size (bytes) 1 KB

8 KB

16 KB

64 KB

256 KB

totalcachesize• Figure 7.12 -

for direct mapped cache

Page 28: CS 161 Review for Test 2

28 1999 ©UCB

Compromise: N-way Set Associative Cache°N-way set associative:

N cache blocks for each Cache Index• Like having N direct mapped caches operating in parallel

°Example: 2-way set associative cache• Cache Index selects a “set” of 2 blocks from the cache

• The 2 tags in set are compared in parallel

• Data is selected based on the tag result (which matched the address)

• Where is a data written? Based on Replacement Policy, FIFO, LRU, Random

Page 29: CS 161 Review for Test 2

29 1999 ©UCB

Improving Cache Performance° In general, want to minimize

Average Access Time: = Hit Time x (1 - Miss Rate)

+ Miss Penalty x Miss Rate

(recall Hit Time << Miss Penalty)

°Generally, two ways to look at• Larger Block Size

• Larger Cache

• Higher Associativity

• Reducing DRAM latency

°Miss penalty ? ---> L2 cache approach

ReduceMiss Rate

Reduces Miss Penalty

Page 30: CS 161 Review for Test 2

30 1999 ©UCB

Virtual Memory has own terminology°Each process has its own private “virtual address space” (e.g., 232 Bytes); CPU actually generates “virtual addresses”

°Each computer has a “physical address space” (e.g., 128 MegaBytes DRAM); also called “real memory”

°Library analogy: • virtual address is like the title of a book

• physical address is the location of book in the library as given by its Library of Congress call number

Page 31: CS 161 Review for Test 2

31 1999 ©UCB

Mapping Virtual to Physical Address

Virtual Page Number Page Offset

Page OffsetPhysical Page Number

Translation

31 30 29 28 27 .………………….12 11 10

29 28 27 .………………….12 11 10

9 8 ……..……. 3 2 1 0

Virtual Address

Physical Address

9 8 ……..……. 3 2 1 0

1KB page size

Page 32: CS 161 Review for Test 2

32 1999 ©UCB

How Translate Fast?°Observation: since there is locality in pages of data, must be locality in virtual addresses of those pages!

°Why not create a cache of virtual to physical address translations to make translation fast? (smaller is faster)

°For historical reasons, such a “page table cache” is called a Translation Lookaside Buffer, or TLB

°TLB organization is same as Icache or Dcache – Direct-mapped or Set Associative

Page 33: CS 161 Review for Test 2

33 1999 ©UCB

Access TLB and Cache in Parallel?

°Recall: address translation is only for virtual page number, not page offset

° If cache index bits of PA “fit within” page offset of VA, then index is not translated can read cache block while simultaneously accessing TLB

°“Virtually indexed, physically tagged cache” (avoids aliasing problem)

VA

PA

page offsetvirtual page number

tag index ofs