cs 152 computer architecture and engineering lecture 27 mid-term...

51
UC Regents Spring 2005 © UCB CS 152 L26: Synchronization 2005-5-3 John Lazzaro (www.cs.berkeley.edu/~lazzaro) CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term II Review www-inst.eecs.berkeley.edu/~cs152/ TAs: Ted Hong and David Marquardt

Upload: others

Post on 22-Jan-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

2005-5-3John Lazzaro

(www.cs.berkeley.edu/~lazzaro)

CS 152 Computer Architecture and Engineering

Lecture 27 – Mid-Term II Review

www-inst.eecs.berkeley.edu/~cs152/

TAs: Ted Hong and David Marquardt

Page 2: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

CS 152: What’s left ...

Today: Mid-term Review, HKN ...

Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda.

Tuesday 5/10: Final presentations.

This time, more of an overview style ...

No class on Thursday.

Deadline to bring up grading issues:Tues 5/10@ 5PM. Contact John at lazzaro@eecs

Peer Review: For final project.Please send by Friday at 5 PM.

No electronic devices, no notes ...

Page 3: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

2005-3-31John Lazzaro

(www.cs.berkeley.edu/~lazzaro)

CS 152 Computer Architecture and Engineering

Lecture 19 – Error Correcting Codes

www-inst.eecs.berkeley.edu/~cs152/

TAs: Ted Hong and David Marquardt

Page 4: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Understand how Hamming Codes workCosmic ray hit D1. But how do we know that?

D₃D₂D₁P₂D₀P₁P₀

On readout we compute:P₀ xor D₃ xor D₁ xor D₀ = 1 xor 0 xor 0 xor 0 = 1

P₁ xor D₃ xor D₂ xor D₀ = 1 xor 0 xor 1 xor 0 = 0P₂ xor D₃ xor D₂ xor D₁ = 0 xor 0 xor 1 xor 0 = 1

0 11 0 0 1 1We write:

D₃D₂D₁P₂D₀P₁P₀0 01 0 0 1 1Later, we read:

P₂P₁P₀ = b101 = 5

What does “5” mean?

0 01 0 0 1 1The position of the flipped bit!To repair, just flip it back ...

D₃D₂D₁P₂D₀P₁P₀14 36 57 2

Note: we number the least significant bit with 1, not 0! 0 is reserved for “no errors”.

Page 5: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Understand Parity Code math ...Simple case: Two 1KB blocks of data (A and B)

Create a third block, C:

C = A xor B (do xor on each bit of block)

Read all three blocks. If A or B is not available but C is, regenerate A or B:

A = C xor B B = C xor A

The math is easy: the trick is system design! Examples: RAID, voice-over-IP parity FEC.

“Parity codes”

Page 6: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Understand Parity Code system designThe disk will tell you “this block does not exist” or “the disk is dead”, by returningan error code when you do a read.

Often, applications number packets as they send them, by adding a “sequence number” to packet header. Receivers detect a “break” in the number sequence ...

If we know this will happen in advance, what can we do, at the OS or application level?

Page 7: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Understand checksum “big picture” ...Can checksums detect every possible error?

Answer: No -- for a 16-bit checksum, there are many possible packets that have the same checksum. If you are unlucky enough to have your transmission errors convert a block into another block with the same checksum value, you will not detect the error!

Page 8: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

2005-4-5John Lazzaro

(www.cs.berkeley.edu/~lazzaro)

CS 152 Computer Architecture and Engineering

Lecture 20 – Advanced Processors I

www-inst.eecs.berkeley.edu/~cs152/

TAs: Ted Hong and David Marquardt

Page 9: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Understand superpipeline performance Seconds

Program

Instructions

Program= Seconds

Cycle Instruction

Cycles

1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-

Q. Could adding pipeline stages reduce CPI for an application?

ARM XScale8 stages

CPI Problem Possible Solution

Taken branches cause longer

stallsBranch prediction,

loop unrolling

Cache misses take more

clock cycles

Larger caches, add prefetch

opcodes to ISA

A. Yes, due to these problems:

Page 10: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

I1:I2:I3:I4:I5:

t1 t2 t3 t4 t5 t6 t7 t8Time:Inst

I6:

D

PC

Q

+

0x4

Addr Data

Instr

Mem

IR IR

IF (Fetch) ID (Decode) EX (ALU)

IR IR

MEM WB

IF IDIF

EXIDIF

MEM WBEX stage computes if branch is taken

If we predicted incorrectly, these instructions MUST

NOT complete!

We update the PC based on the outputs of the branch predictor. If it is perfect, pipe stays full!Dynamic Predictors: a cache of branch history

I-Cache

Understand branch prediction in-depth

A control instr?

Taken or Not Taken?

The PC a branch

“targets”

Branch Predictor

Predictions

Page 11: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

“In-Depth” means down to this level ...

D Q D Q

Prediction for next branch (1 = take, 0 = not take)

We do not change the prediction the first time it is incorrect. Why?

Was last prediction correct? (1 = yes, 0 = no)

BNE R4,R0,loopSUBI R4,R4,-1loop:

This branch taken 10 times, then not taken once (end of loop). The next time we enter the loop, we would like to predict “take” the first time through.

ADDI R4,R0,11

Page 12: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

!"#$%&

!"#$%

&'"()*+,-*.,,/

012.3-*4++556

789($:;9*<9:$*=)'"'($%":#$:(#

>8#?

>8#?

.*(?( .*(?(

+(?( +(?( +(?(

!"##$

%&%'#&(')

%*+,&*##$

%&%'#&(')

789($:;9*89:$#*)'@%*:9$%"9'A*B:B%A:9%*"%C:#$%"#

!"" ;B%"'9D#*'"%*A'$()%D*E)%9*'9*:9#$"8($:;9*

%9$%"#*'*F89($:;9*89:$*

!"" :9B8$#*$;*'*F89($:;9*89:$*G%1C1-*"%C:#$%"*F:A%H

('9*()'9C%*D8":9C*'*A;9C*A'$%9(?*;B%"'$:;9

'((%B$

'((%B$

!"#$%

&'"()*+,-*.,,/

012.3-*4++550

&8A$:BA%*789($:;9*<9:$#

I7 IJ KL

M4< &%N

7'DD

7N8A

7D:@

I##8%

OPQR#

7PQR#

Example: Superscalar MIPS. Fetches 2instructions at a time. If first integer and

second floating point, issue in same cycle

Understand lockstep superscalar concept

Integer instruction FP instruction

LD F0,0(R1)LD F6,-8(R1)LD F10,-16(R1) ADDD F4,F0,F2LD F14,-24(R1) ADDD F8,F6,F2LD F18,-32(R1) ADDD F12,F10,F2SD 0(R1),F4 ADDD F16,F14,F2SD -8(R1),F8 ADDD F20,F18,F2SD -16(R1),F12SD -24(R1),F16

Two issuesper cycle

One issueper cycle

Page 13: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

2005-4-7John Lazzaro

(www.cs.berkeley.edu/~lazzaro)

CS 152 Computer Architecture and Engineering

Lecture 21 – Advanced Processors II

www-inst.eecs.berkeley.edu/~cs152/

TAs: Ted Hong and David Marquardt

Out of order CPU design will NOT appear on exam: HW was sufficient.

Page 14: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Understand Precise Interrupts ...

!"#$%&

!"#$%

&'"()*+,-*.//0

123.4-*5+.66+,

788%($9:%;%##<

=%;'>9;?*';@*AB$6C86C"@%"*%D%(B$9C;*E'#*89"#$

9>FG%>%;$%@*9;*+H1H*9;*IJ&*41/KH+*LB$*@9@*;C$*#)CE*

BF*9;*$)%*#BL#%MB%;$*>C@%G#*B;$9G*>9@6N9;%$9%#2

!"#$%

&'()*+)

+2*7D(%F$9C;#*;C$*F"%(9#%O

.2*788%($9:%*C;*'*:%"P*#>'GG*(G'##*C8*F"C?"'>#

A;%*>C"%*F"CLG%>*;%%@%@*$C*L%*#CG:%@

!"#$%"&'$%(#)*+%)

!"#$%

&'"()*+,-*.//0

123.4-*5+.66+1

Q"%(9#%*I;$%""BF$#

,$'-.)$'(//+(%'()'0*'(#'0#$+%%./$'0)'$(1+#'2+$3++#'$3"'0#)$%.4$0"#) !"#$%&' #()%&'*+,

- ./0%01102.%31%#44%'(".562.'3("%67%.3%#()%'(246)'(8%&' '".3.#44$%239740.0

- (3%01102.%31%#($%'(".562.'3(%#1.05%&' /#"%.#:0(%74#20

;/0%'(.05567.%/#()405%0'./05%#<35."%./0%75385#9%35%50".#5."%'.%#.%&'*+%=

Definition:

Follows from the “contract” between the architect and the programmer ...

(or exception)

Page 15: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

... in the context of Static Pipelines

!"#$%&

!"#$%

&'"()*+,-*.//0

123.4-*5+.66+7899%($*:;*<;$%""=>$#!"#$%&$%'()'*+%,-.)#/%0

!" !"#! $%&' $%& $(!# )! $*& (+,-./!$ 01)2! $3& $*& $(!% !"#! $4& $%& $*!& 516! $73& $3& $%!' 8!!! $%& $4& $*

()*+(,+(-./-01(23 7'''*'''* .'''7 ('''. +'''+ ( %'''%

-/4*(-/0,#0 -/4*(-/0,"5

9:;<=>?-'=;@?--AB@<

6-/174/078*/--)3*409-/0.7,,71):*0*(0723:/2/8*09*0;7<;043//.+ =98*0*(04*9-*0/>/1)*7(80(,0:9*/-0784*-)1*7(840?/,(-//>1/3*7(801;/1@40,7874;/.0(80/9-:7/-0784*-)1*7(84

!"#$%

&'"()*+,-*.//0

123.4-*5+.66+38?(%>$@:;*A';BC@;D120$!'()'*3/4)$5#67)*8/-)./0)9

C D:E>'?FG?B@=:;'$EHI<'=;'B=B?E=;?'A;@=E'G:JJ=@'B:=;@',0'<@HI?/C KFG?B@=:;<'=;'?H-E=?-'B=B?'<@HI?<':L?--=>?'EH@?-'?FG?B@=:;<C ";M?G@'?F@?-;HE'=;@?--AB@<'H@'G:JJ=@'B:=;@',:L?--=>?':@N?-</C "$'?FG?B@=:;'H@'G:JJ=@O'AB>H@?'9HA<?'H;>'KP9'-?I=<@?-<&'Q=EEHEE'<@HI?<&'=;M?G@'NH;>E?-'P9'=;@:'$?@GN'<@HI?

A4B81;-(8()40!8*/--)3*4

KFG!

P9!

P9";<@R'0?J ! !?G:>? K 0

!H@H'0?J ST

KFGK

P9K

KFG0

P90

C9)4/

D6C

E7::0F0G*9</

E7::0H0G*9</

E7::0D0G*9</

!::/<9:0I31(./

IJ/-,:(=F9*90A..-0D>1/3*

6C0A..-/440D>1/3*7(84

E7::0K-7*/?91@

G/:/1*0L98.:/-06C

!"##$%&

'"$(%

Key observation: architected state only change in memory and register write stages.

Page 16: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

2005-4-12John Lazzaro

(www.cs.berkeley.edu/~lazzaro)

CS 152 Computer Architecture and Engineering

Lecture 22 – Advanced Processors III

www-inst.eecs.berkeley.edu/~cs152/

TAs: Ted Hong and David Marquardt

Page 17: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Krste

November 10, 2004

6.823, L18--3

Multithreading

How can we guarantee no dependencies between instructions in a pipeline?

-- One way is to interleave execution of instructions from different program threads on same pipeline

F D X M W

t0 t1 t2 t3 t4 t5 t6 t7 t8

T1: LW r1, 0(r2)

T2: ADD r7, r1, r4

T3: XORI r5, r4, #12

T4: SW 0(r7), r5

T1: LW r5, 12(r1)

t9

F D X M W

F D X M W

F D X M W

F D X M W

Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe

Last instruction

in a thread

always completes

writeback before

next instruction

in same thread

reads regfile

KrsteNovember 10, 2004

6.823, L18--5

Simple Multithreaded Pipeline

Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage

+1

2 Thread

select

PC1

PC1

PC1

PC1

I$ IRGPR1GPR1GPR1GPR1

X

Y

2

D$

Understand static pipeline multithreading4 CPUs,each run at 1/4 clock

Many variants ...

Page 18: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Understand pros/cons of shared L2 ...

supp

orts

a 1

.875

-Mby

te o

n-ch

ip L

2 ca

che.

Pow

er4

and

Pow

er4+

sys

tem

s bo

th h

ave

32-

Mby

te L

3 ca

ches

, whe

reas

Pow

er5

syst

ems

have

a 3

6-M

byte

L3

cach

e.T

he L

3 ca

che

oper

ates

as a

bac

kdoo

r with

sepa

rate

bus

es fo

r rea

ds a

nd w

rites

that

ope

r-at

e at

hal

f pr

oces

sor

spee

d. I

n Po

wer

4 an

dPo

wer

4+ sy

stem

s, th

e L3

was

an

inlin

e ca

che

for

data

ret

riev

ed fr

om m

emor

y. B

ecau

se o

fth

e hi

gher

tran

sisto

r de

nsity

of t

he P

ower

5’s

130-

nm te

chno

logy

, we c

ould

mov

e the

mem

-or

y co

ntro

ller

on c

hip

and

elim

inat

e a

chip

prev

ious

ly n

eede

d fo

r the

mem

ory

cont

rolle

rfu

nctio

n. T

hese

two

chan

ges

in th

e Po

wer

5al

so h

ave t

he si

gnifi

cant

side

ben

efits

of r

educ

-in

g la

tenc

y to

the

L3 c

ache

and

mai

n m

emo-

ry, a

s w

ell a

s re

duci

ng t

he n

umbe

r of

chi

psne

cess

ary

to b

uild

a sy

stem

.

Chip

overv

iewFi

gure

2 s

how

s th

e Po

wer

5 ch

ip,

whi

chIB

M f

abri

cate

s us

ing

silic

on-o

n-in

sula

tor

(SO

I) d

evic

es a

nd c

oppe

r int

erco

nnec

t. SO

Ite

chno

logy

red

uces

dev

ice

capa

cita

nce

toin

crea

se t

rans

isto

r pe

rfor

man

ce.5

Cop

per

inte

rcon

nect

dec

reas

es w

ire

resi

stan

ce a

ndre

duce

s de

lays

in w

ire-d

omin

ated

chi

p-tim

-

ing

path

s. I

n 13

0 nm

lith

ogra

phy,

the

chi

pus

es ei

ght m

etal

leve

ls an

d m

easu

res 3

89 m

m2 .

The

Pow

er5

proc

esso

r su

ppor

ts th

e 64

-bit

Pow

erPC

arc

hite

ctur

e. A

sin

gle

die

cont

ains

two

iden

tical

pro

cess

or co

res,

each

supp

ortin

gtw

o lo

gica

l thr

eads

. Thi

s ar

chite

ctur

e m

akes

the c

hip

appe

ar as

a fo

ur-w

ay sy

mm

etric

mul

-tip

roce

ssor

to th

e op

erat

ing

syst

em. T

he tw

oco

res s

hare

a 1

.875

-Mby

te (1

,920

-Kby

te) L

2ca

che.

We i

mpl

emen

ted

the L

2 ca

che a

s thr

eeid

entic

al s

lices

with

sep

arat

e co

ntro

llers

for

each

. The

L2

slice

s are

10-

way

set-

asso

ciat

ive

with

512

cong

ruen

ce cl

asse

s of 1

28-b

yte l

ines

.T

he d

ata’s

rea

l add

ress

det

erm

ines

whi

ch L

2sli

ce th

e dat

a is c

ache

d in

. Eith

er p

roce

ssor

core

can

inde

pend

ently

acc

ess e

ach

L2 c

ontr

olle

r.W

e al

so in

tegr

ated

the

dire

ctor

y fo

r an

off-

chip

36-

Mby

te L

3 ca

che o

n th

e Pow

er5

chip

.H

avin

g th

e L3

cach

e dire

ctor

y on

chip

allo

ws

the

proc

esso

r to

che

ck th

e di

rect

ory

afte

r an

L2 m

iss w

ithou

t exp

erie

ncin

g of

f-ch

ip d

elay

s.To

red

uce

mem

ory

late

ncie

s, w

e in

tegr

ated

the m

emor

y co

ntro

ller o

n th

e chi

p. T

his e

lim-

inat

es d

rive

r an

d re

ceiv

er d

elay

s to

an

exte

r-na

l con

trol

ler.

Proce

ssor c

oreW

e de

signe

d th

e Po

wer

5 pr

oces

sor c

ore

tosu

ppor

t bo

th e

nhan

ced

SMT

and

sin

gle-

thre

aded

(ST

) op

erat

ion

mod

es.

Figu

re 3

show

s th

e Po

wer

5’s

inst

ruct

ion

pipe

line,

whi

ch is

iden

tical

to th

e Pow

er4’

s. A

ll pi

pelin

ela

tenc

ies i

n th

e Pow

er5,

incl

udin

g th

e bra

nch

misp

redi

ctio

n pe

nalty

and

load

-to-

use

late

n-cy

with

an

L1 d

ata

cach

e hi

t, ar

e th

e sa

me

asin

the

Pow

er4.

The

iden

tical

pip

elin

e st

ruc-

ture

lets

opt

imiz

atio

ns d

esig

ned

for

Pow

er4-

base

d sy

stem

s pe

rfor

m

equa

lly

wel

l on

Pow

er5-

base

d sy

stem

s. F

igur

e 4

show

s th

ePo

wer

5’s i

nstr

uctio

n flo

w d

iagr

am.

In S

MT

mod

e, th

e Po

wer

5 us

es tw

o se

pa-

rate

inst

ruct

ion

fetc

h ad

dres

s reg

ister

s to

stor

eth

e pr

ogra

m c

ount

ers

for

the

two

thre

ads.

Inst

ruct

ion

fetc

hes

(IF

stag

e)

alte

rnat

ebe

twee

n th

e tw

o th

read

s. I

n ST

mod

e, t

hePo

wer

5 us

es o

nly

one

prog

ram

cou

nter

and

can

fetc

h in

stru

ctio

ns fo

r th

at t

hrea

d ev

ery

cycl

e. I

t ca

n fe

tch

up t

o ei

ght

inst

ruct

ions

from

the

inst

ruct

ion

cach

e (I

C s

tage

) ev

ery

cycl

e. T

he tw

o th

read

s sh

are

the

inst

ruct

ion

cach

e an

d th

e in

stru

ctio

n tr

ansla

tion

faci

lity.

In a

give

n cy

cle,

all f

etch

ed in

stru

ctio

ns co

me

from

the

sam

e th

read

.

42

HOT

CHIP

S15

IEEE M

ICRO

Figu

re 2

. Pow

er5

chip

(FXU

= fi

xed-

poin

t exe

cutio

n un

it, IS

U=

inst

ruct

ion

sequ

enci

ng u

nit,

IDU

= in

stru

ctio

n de

code

uni

t,LS

U =

load

/sto

re u

nit,

IFU

= in

stru

ctio

n fe

tch

unit,

FPU

=flo

atin

g-po

int u

nit,

and

MC

= m

emor

y co

ntro

ller).

(2) Threads on two cores share memory via L2 cache operations.Much faster than2 CPUs on 2 chips.

(1) Threads on two cores that use shared libraries conserve L2 memory.

Also see Lecture 27 slides on this related topics!

Page 19: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Understand Niagara design choices8 cores:

Single-issue6-stage pipeline4-way multi-threadedFast crypto support

Shared resources:3MB on-chip cache4 DDR2 interfaces32G DRAM, 20 Gb/s1 shared FP unitGB Ethernet ports

Sources: Hot Chips, via EE Times, Infoworld. J Schwartz weblog (Sun COO)

Die size: 340 mm² in 90 nm.Power: 50-60 W

Page 20: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

256 KB Local Store -- 128 128-bit RegistersSPU issues 2 inst/cycle (in order) to 7 execution unitsSPU fills Local Store using DMA to DRAM and network

Programmers manage caching explicitly

Understand Cell design choices ...

Page 21: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

2005-4-14John Lazzaro

(www.cs.berkeley.edu/~lazzaro)

CS 152 Computer Architecture and Engineering

Lecture 23 – Buses, Disks, and RAID

www-inst.eecs.berkeley.edu/~cs152/

TAs: Ted Hong and David Marquardt

Page 22: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Understand “bus master” concept ...Figure 2-1 Simplified block diagram

Serial port

1.5 GbpsSerial ATA bus

10/100/1000 Ethernet port

10/100/1000 Ethernet port

400 MHzECC DDR

memory bus

Processor interfacebus running at half theprocessor speed

FireWire 400 port (front)

System activity lights

16-bit 4.8 GBpsHyper Transport

8-bit 1.6 GBpsHyper Transport

33 MHzPCI bus

PCI-X slots

PMU99power

controller

BootROM

USB 2.0 port 480 Mbps

USB 2.0 port 480 Mbps

DIMM slots

Internal hard driveconnectors

FireWire 800 port (rear)

FireWire 800 port (rear)

Main logic board

1.5 GbpsSerial ATA bus

ATA/100 bus

Internal opticaldrive connector

U3Hmemorycontrollerand PCI

bus bridge

K2I/O deviceand diskcontroller

FireWirePHY

PCI USBcontroller

Ethernetcontroller

Processor moduleProcessor module

64-bit PowerPC G5microprocessor

PCI, PCI-X, orgraphics support

Bus B

Bus A PCI-Xbridge

1.5 GbpsSerial ATA bus

64-bit PowerPC G5microprocessor

Xserve G5 has the following separate buses.

! Processor bus: running at half the speed of the processor, 64-bit data throughput per processorconnecting the processor module to the U3H IC

! Dual processor systems have two independent, 64-bit processor buses, each running at half thespeed of the processors

! Memory bus: 400 MHz, 128-bit bus connecting the main ECC DDR SDRAM memory to the U3HIC

! PCI-X bridge bus: supports two 64-bit PCI-X slots

22 Block Diagram and Buses2005-01-04 | © 2002, 2005 Apple Computer, Inc. All Rights Reserved.

C H A P T E R 2

Architecture

Apple Xserve G5 - has 8 DIMM slots, to support 8GB.

Memory controller is the only “bus master” - it can start transactions on the bus, but the DIMMs cannot.

DIMMs respond to transaction requests. Since memory controller is only bus master, and there are a small number of DIMM slots, bus sharing is easy: use dedicated wires to each slot.

Page 23: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Understand “bus vs switch” issues ...Figure 2-1 Simplified block diagram

Serial port

1.5 GbpsSerial ATA bus

10/100/1000 Ethernet port

10/100/1000 Ethernet port

400 MHzECC DDR

memory bus

Processor interfacebus running at half theprocessor speed

FireWire 400 port (front)

System activity lights

16-bit 4.8 GBpsHyper Transport

8-bit 1.6 GBpsHyper Transport

33 MHzPCI bus

PCI-X slots

PMU99power

controller

BootROM

USB 2.0 port 480 Mbps

USB 2.0 port 480 Mbps

DIMM slots

Internal hard driveconnectors

FireWire 800 port (rear)

FireWire 800 port (rear)

Main logic board

1.5 GbpsSerial ATA bus

ATA/100 bus

Internal opticaldrive connector

U3Hmemorycontrollerand PCI

bus bridge

K2I/O deviceand diskcontroller

FireWirePHY

PCI USBcontroller

Ethernetcontroller

Processor moduleProcessor module

64-bit PowerPC G5microprocessor

PCI, PCI-X, orgraphics support

Bus B

Bus A PCI-Xbridge

1.5 GbpsSerial ATA bus

64-bit PowerPC G5microprocessor

Xserve G5 has the following separate buses.

! Processor bus: running at half the speed of the processor, 64-bit data throughput per processorconnecting the processor module to the U3H IC

! Dual processor systems have two independent, 64-bit processor buses, each running at half thespeed of the processors

! Memory bus: 400 MHz, 128-bit bus connecting the main ECC DDR SDRAM memory to the U3HIC

! PCI-X bridge bus: supports two 64-bit PCI-X slots

22 Block Diagram and Buses2005-01-04 | © 2002, 2005 Apple Computer, Inc. All Rights Reserved.

C H A P T E R 2

Architecture

+++ Low cost. One set of wires from memory controller can support up to 8 DIMMs.

--- Latency of bus increases with length of wires (needed to reach all 8 DIMM sockets), and the loading of 8 DIMMs. Must design for worst-case (8 DIMMs), even if only 1 DIMM is present.

--- Shared wires limit maximum bandwidth from memory. If memory controller had 8 sets of dedicated wires, one per DIMM, memory bandwidth would be much better (but more expensive).

Page 24: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Understand serial bus pros and consSerial: Data is sent “bit by bit” over one logical wire.

+++ Low cost: a small number of wires cost less. Also, cheap wires and connectors can be used, since skew is less a problem.

--- When only using one wire, there is a bandwidth limit. Thus, DIMMs uses many wires(a ”parallel” bus, not “serial”).

USB, FireWire Ethernet.

+++ Sending data over many wires introduces “skew” - signals travel on each wire at a slightly different speed. Skew limits speed and length of a bus. Serial buses have fewer skew issues, because they only use one logical wire.

Serial pros and cons:

Page 25: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Understand disk block organization ...

Outer tracks hold more sectors.

2005 desktop rotation speed:7200 RPM

Each ring isa “track”.

A track is dividedinto “sectors”.

A sector codes a fixed # of bytes (ex: 4K blocks).

Many more tracks and

sectors than shown!

Page 26: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Understand the Disk Latency EquationLatency of a disk block read =

Queueing Time +Zero if no other accesses pending.

Controller Time + Usually short.

Seek Time + 2005: about 8 ms

Rotation Time + 4.2 ms @ 7200 RPM1/2 full rotation time

Transfer Time 1 ms @ 7200 RPM

Page 27: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Understand how to reason about RAID5-disk

Recovery Group

D0

D1

D2

D3

Parity

RAID 5: Interleaved Parity Disks

Logical Blocks Bn on Array

B0

B1

B2

B3

P0

B4

B5

B6

B7

P1

B8

B9

B10

P2 . . .

. . .

. . .

. . .

. . .B11

+++ Writes of parity blocks distributed across 5 disks

COD/e3Page 574-580for RAID details.

Will be responsible for level of detail in book for Mid-term II.

Page 28: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

2005-4-19John Lazzaro

(www.cs.berkeley.edu/~lazzaro)

CS 152 Computer Architecture and Engineering

Lecture 24 – Networks

www-inst.eecs.berkeley.edu/~cs152/

TAs: Ted Hong and David Marquardt

Know this material well ...

Page 29: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Understand bottom-up networking ...

IP Packet

801.11b WiFi packet

For this “hop”, IP packet sent “inside” of a wireless 801.11b packet.

IP Packet

Cable modem packet

For this “hop”, IP packet sent “inside” of a cable modem DOCSIS packet.

ISO Layer Names:IP packet: “Layer 3”WiFi and Cable Modem packets: “Layer 2”Radio/cable waveforms: “Layer 1”

Page 30: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

email WWW phone...

SMTP HTTP RTP...

TCP UDP…

IP

Ethernet Wi-Fi…

CSMA async sonet...

copper fiber radio...

Diagram Credit: Steve Deering

Protocol Complexity

Understand the IP abstraction ...

Internet Protocol (IP):An abstraction for applications to target, and for link networks to support.Very simple, very successful.

Link layer is not expected to be perfect.

IP presentslink network errors/losses in an abstract way (not a link specific way).

Page 31: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Understand how IP numbers work

198.211.61.22 ??? A user-friendly form of the 32-bit unsigned value 3335732502, which is:198*2^24 + 211*2^16 + 61*2^8 + 22

IP4 number for this computer: 198.211.61.22Every directly connected host has a unique IP number.

Upper limit of 2^32 IP4 numbers (some are reserved for other purposes).

Next-generation IP (IP6) limit: 2^128.

Page 32: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Understand the IP header fields ...

0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|Version| IHL |Type of Service| Total Length |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Identification |Flags| Fragment Offset |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Time to Live | Protocol | Header Checksum |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Source Address |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Destination Address |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| |+ +| Payload data (size implied by Total Length header field) |+ +| |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

To: IP number

From: IP number Note: Could be a lie ...

IHL field: # of words in header. The typical header (IHL = 5 words) is shown. Longer headers code add extra fields after the destination address.

Header

Data

Bitfield numbers

IP4, IP6, etc ... How the destination should interpret the payload data.

Page 33: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

2005-4-21John Lazzaro

(www.cs.berkeley.edu/~lazzaro)

CS 152 Computer Architecture and Engineering

Lecture 25 – Routers

www-inst.eecs.berkeley.edu/~cs152/

TAs: Ted Hong and David Marquardt

Page 34: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Understand the MGR Router ...

IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 6, NO. 3, JUNE 1998 237

A 50-Gb/s IP RouterCraig Partridge, Senior Member, IEEE, Philip P. Carvey, Member, IEEE, Ed Burgess, Isidro Castineyra, Tom Clarke,

Lise Graham, Michael Hathaway, Phil Herman, Allen King, Steve Kohalmi, Tracy Ma, John Mcallen,Trevor Mendez, Walter C. Milliken, Member, IEEE, Ronald Pettyjohn, Member, IEEE,

John Rokosz, Member, IEEE, Joshua Seeger, Michael Sollins, Steve Storch,Benjamin Tober, Gregory D. Troxel, David Waitzman, and Scott Winterble

Abstract—Aggressive research on gigabit-per-second networkshas led to dramatic improvements in network transmissionspeeds. One result of these improvements has been to putpressure on router technology to keep pace. This paper describesa router, nearly completed, which is more than fast enough tokeep up with the latest transmission technologies. The routerhas a backplane speed of 50 Gb/s and can forward tens ofmillions of packets per second.

Index Terms—Data communications, internetworking, packetswitching, routing.

I. INTRODUCTION

TRANSMISSION link bandwidths keep improving, at

a seemingly inexorable rate, as the result of research

in transmission technology [26]. Simultaneously, expanding

network usage is creating an ever-increasing demand that can

only be served by these higher bandwidth links. (In 1996

and 1997, Internet service providers generally reported that

the number of customers was at least doubling annually and

that per-customer bandwidth usage was also growing, in some

cases by 15% per month.)

Unfortunately, transmission links alone do not make a

network. To achieve an overall improvement in networking

performance, other components such as host adapters, operat-

ing systems, switches, multiplexors, and routers also need to

get faster. Routers have often been seen as one of the lagging

technologies. The goal of the work described here is to show

that routers can keep pace with the other technologies and are

Manuscript received February 20, 1997; revised July 22, 1997; approvedby IEEE/ACM TRANSACTIONS ON NETWORKING Editor G. Parulkar. This workwas supported by the Defense Advanced Research Projects Agency (DARPA).C. Partridge is with BBN Technologies, Cambridge, MA 02138 USA, and

with Stanford University, Stanford, CA 94305 USA (e-mail: [email protected]).P. P. Carvey, T. Clarke, and A. King were with BBN Technologies,

Cambridge, MA 02138 USA. They are now with Avici Systems, Inc.,Chelmsford, MA 01824 USA (e-mail: [email protected]; [email protected];[email protected]).E. Burgess, I. Castineyra, L. Graham, M. Hathaway, P. Herman, S.

Kohalmi, T. Ma, J. Mcallen, W. C. Milliken, J. Rokosz, J. Seeger, M.Sollins, S. Storch, B. Tober, G. D. Troxel, and S. Winterble are with BBNTechnologies, Cambridge, MA 02138 USA (e-mail: [email protected];[email protected]; [email protected]; [email protected]; [email protected]).T. Mendez was with BBN Technologies, Cambridge, MA 02138 USA. He

is now with Cisco Systems, Cambridge, MA 02138 USA.R. Pettyjohn was with BBN Technologies, Cambridge, MA 02138 USA.

He is now with Argon Networks, Littleton, MA 01460 USA (e-mail:[email protected]).D. Waitzman was with BBN Technologies, Cambridge, MA 02138 USA.

He is now with D. E. Shaw and Company, L.P., Cambridge, MA 02139 USA.Publisher Item Identifier S 1063-6692(98)04174-0.

fully capable of driving the new generation of links (OC-48c

at 2.4 Gb/s).

A multigigabit router (a router capable of moving data

at several gigabits per second or faster) needs to achieve

three goals. First, it needs to have enough internal bandwidth

to move packets between its interfaces at multigigabit rates.

Second, it needs enough packet processing power to forward

several million packets per second (MPPS). A good rule

of thumb, based on the Internet’s average packet size of

approximately 1000 b, is that for every gigabit per second

of bandwidth, a router needs 1 MPPS of forwarding power.1

Third, the router needs to conform to a set of protocol

standards. For Internet protocol version 4 (IPv4), this set of

standards is summarized in the Internet router requirements

[3]. Our router achieves all three goals (but for one minor

variance from the IPv4 router requirements, discussed below).

This paper presents our multigigabit router, called the MGR,

which is nearly completed. This router achieves up to 32

MPPS forwarding rates with 50 Gb/s of full-duplex backplane

capacity.2 About a quarter of the backplane capacity is lost

to overhead traffic, so the packet rate and effective bandwidth

are balanced. Both rate and bandwidth are roughly two to ten

times faster than the high-performance routers available today.

II. OVERVIEW OF THE ROUTER ARCHITECTURE

A router is a deceptively simple piece of equipment. At

minimum, it is a collection of network interfaces, some sort of

bus or connection fabric connecting those interfaces, and some

software or logic that determines how to route packets among

those interfaces. Within that simple description, however, lies a

number of complexities. (As an illustration of the complexities,

consider the fact that the Internet Engineering Task Force’s

Requirements for IP Version 4 Routers [3] is 175 pages long

and cites over 100 related references and standards.) In this

section we present an overview of the MGR design and point

out its major and minor innovations. After this section, the rest

of the paper discusses the details of each module.

1See [25]. Some experts argue for more or less packet processing power.Those arguing for more power note that a TCP/IP datagram containing anACK but no data is 320 b long. Link-layer headers typically increase thisto approximately 400 b. So if a router were to handle only minimum-sizedpackets, a gigabit would represent 2.5 million packets. On the other side,network operators have noted a recent shift in the average packet size tonearly 2000 b. If this change is not a fluke, then a gigabit would representonly 0.5 million packets.2Recently some companies have taken to summing switch bandwidth in

and out of the switch; in that case this router is a 100-Gb/s router.

1063–6692/98$10.00 ! 1998 IEEE

The “MGR” Router was a research project in late 1990’s. Kept up with “line rate” of the fastest links of its day (OC-48c, 2.4 Gb/s optical).

Architectural approach is still valid today ...

At the level we presented it in class. However, it will be much easier to understand it at that level if you read the paper (on website).

Page 35: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Know the life of a packet in a router ...

238 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 6, NO. 3, JUNE 1998

Fig. 1. MGR outline.

A. Design Summary

A simplified outline of the MGR design is shown in Fig. 1,

which illustrates the data processing path for a stream of

packets entering from the line card on the left and exiting

from the line card on the right.

The MGR consists of multiple line cards (each supporting

one or more network interfaces) and forwarding engine cards,

all plugged into a high-speed switch. When a packet arrives

at a line card, its header is removed and passed through the

switch to a forwarding engine. (The remainder of the packet

remains on the inbound line card). The forwarding engine

reads the header to determine how to forward the packet and

then updates the header and sends the updated header and

its forwarding instructions back to the inbound line card. The

inbound line card integrates the new header with the rest of

the packet and sends the entire packet to the outbound line

card for transmission.

Not shown in Fig. 1 but an important piece of the MGR

is a control processor, called the network processor, that

provides basic management functions such as link up/down

management and generation of forwarding engine routing

tables for the router.

B. Major Innovations

There are five novel elements of this design. This section

briefly presents the innovations. More detailed discussions,

when needed, can be found in the sections following.

First, each forwarding engine has a complete set of the

routing tables. Historically, routers have kept a central master

routing table and the satellite processors each keep only a

modest cache of recently used routes. If a route was not in a

satellite processor’s cache, it would request the relevant route

from the central table. At high speeds, the central table can

easily become a bottleneck because the cost of retrieving a

route from the central table is many times (as much as 1000

times) more expensive than actually processing the packet

header. So the solution is to push the routing tables down

into each forwarding engine. Since the forwarding engines

only require a summary of the data in the route (in particular,

next hop information), their copies of the routing table, called

forwarding tables, can be very small (as little as 100 kB for

about 50k routes [6]).

Second, the design uses a switched backplane. Until very

recently, the standard router used a shared bus rather than

a switched backplane. However, to go fast, one really needs

the parallelism of a switch. Our particular switch was custom

designed to meet the needs of an Internet protocol (IP) router.

Third, the design places forwarding engines on boards

distinct from line cards. Historically, forwarding processors

have been placed on the line cards. We chose to separate them

for several reasons. One reason was expediency; we were not

sure if we had enough board real estate to fit both forwarding

engine functionality and line card functions on the target

card size. Another set of reasons involves flexibility. There

are well-known industry cases of router designers crippling

their routers by putting too weak a processor on the line

card, and effectively throttling the line card’s interfaces to

the processor’s speed. Rather than risk this mistake, we built

the fastest forwarding engine we could and allowed as many

(or few) interfaces as is appropriate to share the use of the

forwarding engine. This decision had the additional benefit of

making support for virtual private networks very simple—we

can dedicate a forwarding engine to each virtual network and

ensure that packets never cross (and risk confusion) in the

forwarding path.

Placing forwarding engines on separate cards led to a fourth

innovation. Because the forwarding engines are separate from

the line cards, they may receive packets from line cards that

1. Packet arrives in line card. Line card sends the packet header to a forward engine for processing.

1.

1.

Note: We can balance the number of line cards and forwarding engines for efficiency: this is how packet routing parallelizes.

Page 36: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Understand the forwarding problem ... 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|Version| IHL |Type of Service| Total Length |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Identification |Flags| Fragment Offset |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Time to Live | Protocol | Header Checksum |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Source Address |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Destination Address |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| |+ +| Payload data (size implied by Total Length header field) |+ +| |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Header

Data

Bitfield numbers

To: IP number

Forwarding engine looks at the destination address, and decides which outbound line card will get the packet closest to its destination. How?

Page 37: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

And how network structure affects itRouters route to a “network”, not a “host”. /xx means the top xx bits of the 32-bit address identify a single network.

Thus, all of UCB only needs 6 routing table entries.Today, Internet routing table has about 100,000 entries.

Page 38: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Understand how switches work ...

Switch

Line

Line

Engine

Line

Engine

EngineLine

Line

A pipelined arbitration system decides how to connect up the switch. The connections for the transfer at epoch N are computer in epochs N-3, N-2 and N-1, using dedicated switch allocation wires.

Page 39: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

2005-4-26John Lazzaro

(www.cs.berkeley.edu/~lazzaro)

CS 152 Computer Architecture and Engineering

Lecture 26 – Synchronization

www-inst.eecs.berkeley.edu/~cs152/

TAs: Ted Hong and David Marquardt

Page 40: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Understand Sequential ConsistencySequential Consistency: As if each thread takes turns executing, and instructions in each thread execute in program order.

Sequential Consistent architectures get the right answer, but give up many optimizations.

LW R3, head(R0) ; Load queue head into R3spin: LW R4, tail(R0) ; Load queue tail into R4 BEQ R4, R3, spin ; If queue empty, wait LW R5, 0(R3) ; Read x from queue into R5 ADDi R3, R3, 4 ; Shift head by one word SW R3 head(R0) ; Update head memory addr

T2 code(consumer)

T1 code(producer)

ORi R1, R0, x ; Load x value into R1LW R2, tail(R0) ; Load queue tail into R2 SW R1, 0(R2) ; Store x into queueADDi R2, R2, 4 ; Shift tail by one wordSW R2 0(tail) ; Update tail memory addr

1

2

3

4

Legal orders: 1, 2, 3, 4 or 1, 3, 2, 4 or 3, 4, 1 2 ... but not 2, 3, 1, 4!

Page 41: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Understand memory fences

Higher Addresses

LW R3, head(R0) ; Load queue head into R3spin: LW R4, tail(R0) ; Load queue tail into R4 BEQ R4, R3, spin ; If queue empty, wait MEMBAR ; LW R5, 0(R3) ; Read x from queue into R5 ADDi R3, R3, 4 ; Shift head by one word SW R3 head(R0) ; Update head memory addr

T2 code(consumer)

y x

Tail Head

y

Tail Head

After:Before:Higher Addresses

T1 code(producer)

ORi R1, R0, x ; Load x value into R1LW R2, tail(R0) ; Load queue tail into R2 SW R1, 0(R2) ; Store x into queueMEMBAR ;ADDi R2, R2, 4 ; Shift tail by one wordSW R2 0(tail) ; Update tail memory addr

1

2

3

4

Ensures 1 happens before 2, and 3 happens before 4.

Page 42: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

LW R3, head(R0) ; Load queue head into R3spin: LW R4, tail(R0) ; Load queue tail into R4 BEQ R4, R3, spin ; If queue empty, LW R5, 0(R3) ; Read x from queue into R5 ADDi R3, R3, 4 ; Shift head by one word SW R3 head(R0) ; Update head memory addr

Critical section

Assuming sequential consistency: 3 MEMBARs not shown ...

Understand Test and Set ...

Test&Set(m, R)R = M[m];if (R == 0) then M[m]=1;

An example atomic read-modify-write ISA instruction:

What if the OS swaps a process out while in the critical section? “High-latency locks”, a source of Linux audio problems (and others)

P: Test&Set R6, mutex(R0); Mutex check BNE R6, R0, P ; If not 0, spin

V: SW R0 mutex(R0) ; Give up mutex

Note: With Test&Set(), the M[m]=1 state corresponds to last slide’s s=0 state!

Page 43: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

and non-blocking synchronization ...

Compare&Swap(Rt,Rs, m)if (Rt == M[m])then M[m] = Rs; Rs = Rt; status = success;else status = fail;

Another atomic read-modify-write instruction:

If thread swaps out before Compare&Swap, no latency problem;this code only “holds” the lock for one instruction!

try: LW R3, head(R0) ; Load queue head into R3spin: LW R4, tail(R0) ; Load queue tail into R4 BEQ R4, R3, spin ; If queue empty, wait LW R5, 0(R3) ; Read x from queue into R5 ADDi R6, R3, 4 ; Shift head by one word

Compare&Swap R3, R6, head(R0); Try to update head BNE R3, R6, try ; If not success, try again

If R3 != R6, another thread got here first, so we must try again.

Assuming sequential consistency: MEMBARs not shown ...

Page 44: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

2005-4-28John Lazzaro

(www.cs.berkeley.edu/~lazzaro)

CS 152 Computer Architecture and Engineering

Lecture 27 – Multiprocessors

www-inst.eecs.berkeley.edu/~cs152/

TAs: Ted Hong and David Marquardt

Page 45: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Understand the coherency problem ...

CPU0

Cache

Addr Value

CPU1

Shared Main MemoryAddr Value16

Cache

Addr Value

5

CPU0:LW R2, 16(R0)

516

CPU1:LW R2, 16(R0)

16 5

CPU1:SW R0,16(R0)

0

0Write-through caches

View of memory no longer “coherent”.

Loads of location 16 from CPU0 and CPU1 see different values!

Today: What to do ...

Page 46: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

and cache placement impact ...

CPU0 CPU1

Shared Main Memory

For modern clock rates,access to shared cache through switch takes 10+ cycles.

Shared Multi-Bank Cache

Memory Switch Using shared cache as the L1 data cache is tantamount to slowing down clock 10X for LWs. Not good.

This approach was a complete solution in the days when DRAM row access time and CPU clock period were well matched.

Page 47: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Using different architectures ...

CPU0 CPU1

Shared Main Memory

Thus, we need to solve the cache coherency problem for L1 cache.

Shared Multi-Bank L2 Cache

Memory Switch or Bus

Advantages of shared L2 over private L2s:

Processors communicate at cache speed, not DRAM speed.

L1 Caches L1 Caches

Constructive interference, if both CPUs need same data/instr.

Disadvantage: CPUs share BW to L2 cache ...

Page 48: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Understand the write-thru solution ...

CPU0

Cache Snooper

CPU1

Shared Main Memory Hierarchy

Cache SnooperMemory bus

1. Writing CPU takes control of bus.

2. Address to be written is invalidated in all other caches.

3. Write is sent to main memory.

Reads will no longer hit in cache and get stale data.

Reads will cache miss, retrieve new value from main memory

For write-thru caches ...

To a first-order, reads will “just work” if write-thru caches implement this policy.

A “two-state” protocol (cache lines are “valid” or “invalid”).

Page 49: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Understand the NUMA concept ...

CPU 0

Cache

CPU 1023

Interconnection Network

Each CPU has part of main memory attached to it.

Cache

DRAM DRAM

...

To access other parts of main memory, use the interconnection network.

For best results, applications take the non-uniform memory latency into account.

Good for applications that match the machine model ...

Page 50: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

And the cluster concept ...In some applications, each machine can handle a net query by itself.

Example: serving static web pages. Each machine has a copy of the website.

but I intentionally ignore them here because theyare well studied elsewhere and because the issuesin this article are largely orthogonal to the use ofdatabases.

AdvantagesThe basic model that giant-scale services followprovides some fundamental advantages:

! Access anywhere, anytime. A ubiquitous infra-structure facilitates access from home, work,airport, and so on.

! Availability via multiple devices. Because theinfrastructure handles most of the processing,users can access services with devices such asset-top boxes, network computers, and smartphones, which can offer far more functionali-ty for a given cost and battery life.

! Groupware support. Centralizing data frommany users allows service providers to offergroup-based applications such as calendars, tele-conferencing systems, and group-managementsystems such as Evite (http://www.evite.com/).

! Lower overall cost. Although hard to measure,infrastructure services have a fundamental costadvantage over designs based on stand-alonedevices. Infrastructure resources can be multi-plexed across active users, whereas end-userdevices serve at most one user (active or not).Moreover, end-user devices have very low uti-lization (less than 4 percent), while infrastruc-ture resources often reach 80 percent utiliza-tion. Thus, moving anything from the deviceto the infrastructure effectively improves effi-ciency by a factor of 20. Centralizing theadministrative burden and simplifying enddevices also reduce overall cost, but are harderto quantify.

! Simplified service updates. Perhaps the mostpowerful long-term advantage is the ability toupgrade existing services or offer new serviceswithout the physical distribution required bytraditional applications and devices. Devicessuch as Web TVs last longer and gain useful-ness over time as they benefit automaticallyfrom every new Web-based service.

ComponentsFigure 1 shows the basic model for giant-scalesites. The model is based on several assumptions.First, I assume the service provider has limitedcontrol over the clients and the IP network.Greater control might be possible in some cases,however, such as with intranets. The model also

assumes that queries drive the service. This is truefor most common protocols including HTTP, FTP,and variations of RPC. For example, HTTP’s basicprimitive, the “get” command, is by definition aquery. My third assumption is that read-onlyqueries greatly outnumber updates (queries thataffect the persistent data store). Even sites that wetend to think of as highly transactional, such as e-commerce or financial sites, actually have thistype of “read-mostly” traffic1: Product evaluations(reads) greatly outnumber purchases (updates), forexample, and stock quotes (reads) greatly out-number stock trades (updates). Finally, as the side-bar, “Clusters in Giant-Scale Services” (next page)explains, all giant-scale sites use clusters.

The basic model includes six components:

! Clients, such as Web browsers, standalone e-mail readers, or even programs that use XMLand SOAP initiate the queries to the services.

! The best-effort IP network, whether the publicInternet or a private network such as anintranet, provides access to the service.

! The load manager provides a level of indirectionbetween the service’s external name and theservers’ physical names (IP addresses) to preservethe external name’s availability in the presenceof server faults. The load manager balances loadamong active servers. Traffic might flow throughproxies or firewalls before the load manager.

! Servers are the system’s workers, combiningCPU, memory, and disks into an easy-to-repli-cate unit.

IEEE INTERNET COMPUTING http://computer.org/internet/ JULY • AUGUST 2001 47

Giant-Scale Services

Client

Client

Client

Loadmanager

Persistent data store

Client

IP network

Single-site server

Optionalbackplane

Figure 1.The basic model for giant-scale services. Clients connect viathe Internet and then go through a load manager that hides downnodes and balances traffic.Load manager is a special-purpose computer that assigns

incoming HTTP connections to a particular machine.Image from Eric Brewer’s IEEE Internet Computing article.

Page 51: CS 152 Computer Architecture and Engineering Lecture 27 Mid-Term …cs152/sp05/lecnotes/lec16-1... · 2005-05-03 · Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. Tuesday 5/10:

UC Regents Spring 2005 © UCBCS 152 L26: Synchronization

Good luck on the mid-term!

Today: Mid-term Review, HKN ...

Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda.

Tuesday 5/10: Final presentations.

This time, more of an overview style ...

No class on Thursday.

Deadline to bring up grading issues:Tues 5/10@ 5PM. Contact John at lazzaro@eecs

Peer Review: For final project.Please send by Friday at 5 PM.

No electronic devices, no notes ...