lecture 2: processor design, single-processor performancecache: actual implementation demands on...

72
Lecture 2: Processor Design, Single-Processor Performance G63.2011.002/G22.2945.001 · September 14, 2010 Intro Basics Assembly Memory Pipelines

Upload: others

Post on 19-Jul-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Lecture 2: Processor Design,Single-Processor Performance

G63.2011.002/G22.2945.001 · September 14, 2010

Intro Basics Assembly Memory Pipelines

Page 2: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Outline

Intro

The Basic Subsystems

Machine Language

The Memory Hierarchy

Pipelines

Intro Basics Assembly Memory Pipelines

Page 3: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Admin Bits

• Lec. 1 slides posted

• New here? Welcome!Please send in survey info (see lec. 1 slides) via email

• PASI

• Please subscribe to mailing list

• Near end of class: 5-min, 3-question ‘concept check’

Intro Basics Assembly Memory Pipelines

Page 4: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Outline

Intro

The Basic Subsystems

Machine Language

The Memory Hierarchy

Pipelines

Intro Basics Assembly Memory Pipelines

Page 5: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Introduction

Goal for Today

High Performance Computing:Discuss the actual computer end of this. . .. . . and its influence on performance

Intro Basics Assembly Memory Pipelines

Page 6: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

What’s in a computer?

Processor

Intel Q6600 Core2 Quad, 2.4 GHz

Die

(2×) 143 mm2, 2× 2 cores582,000,000 transistors∼ 100W

Memory

Intro Basics Assembly Memory Pipelines

Page 7: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

What’s in a computer?

Processor

Intel Q6600 Core2 Quad, 2.4 GHz

Die

(2×) 143 mm2, 2× 2 cores582,000,000 transistors∼ 100W

Memory

Intro Basics Assembly Memory Pipelines

Page 8: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

What’s in a computer?

Processor

Intel Q6600 Core2 Quad, 2.4 GHz

Die

(2×) 143 mm2, 2× 2 cores582,000,000 transistors∼ 100W

Memory

Intro Basics Assembly Memory Pipelines

Page 9: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

What’s in a computer?

Processor

Intel Q6600 Core2 Quad, 2.4 GHz

Die

(2×) 143 mm2, 2× 2 cores582,000,000 transistors∼ 100W

Memory

Intro Basics Assembly Memory Pipelines

Page 10: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

What’s in a computer?

Processor

Intel Q6600 Core2 Quad, 2.4 GHz

Die

(2×) 143 mm2, 2× 2 cores582,000,000 transistors∼ 100W

Memory

Intro Basics Assembly Memory Pipelines

Page 11: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Outline

Intro

The Basic Subsystems

Machine Language

The Memory Hierarchy

Pipelines

Intro Basics Assembly Memory Pipelines

Page 12: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

A Basic Processor

Internal Bus

Register FileFlags

Data ALU

Address ALU

Control UnitPC

Memory Interface

Insn.fetch

Data Bus

Address Bus

(loosely based on Intel 8086)

Bonus Question:What’s a bus?

Intro Basics Assembly Memory Pipelines

Page 13: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

A Basic Processor

Internal Bus

Register FileFlags

Data ALU

Address ALU

Control UnitPC

Memory Interface

Insn.fetch

Data Bus

Address Bus

(loosely based on Intel 8086)

Bonus Question:What’s a bus?

Intro Basics Assembly Memory Pipelines

Page 14: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

How all of this fits together

Everything synchronizes to the Clock.

Control Unit (“CU”): The brains of theoperation. Everything connects to it.

Bus entries/exits are gated and(potentially) buffered.

CU controls gates, tells other unitsabout ‘what’ and ‘how’:

• What operation?

• Which register?

• Which addressing mode?

Internal Bus

Register FileFlags

Data ALU

Address ALU

Control UnitPC

Memory Interface

Insn.fetch

Data Bus

Address Bus

Intro Basics Assembly Memory Pipelines

Page 15: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

What is. . . an ALU?

Arithmetic Logic UnitOne or two operands A, BOperation selector (Op):

• (Integer) Addition, Subtraction

• (Logical) And, Or, Not

• (Bitwise) Shifts (equivalent tomultiplication by power of two)

• (Integer) Multiplication, Division

Specialized ALUs:

• Floating Point Unit (FPU)

• Address ALU

Operates on binary representations ofnumbers. Negative numbers representedby two’s complement.

A B

Op

R

Intro Basics Assembly Memory Pipelines

Page 16: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

What is. . . a Register File?

Registers are On-Chip Memory

• Directly usable as operands inMachine Language

• Often “general-purpose”

• Sometimes special-purpose: Floatingpoint, Indexing, Accumulator

• Small: x86 64: 16×64 bit GPRs

• Very fast (near-zero latency)

%r0

%r1

%r2

%r3

%r4

%r5

%r6

%r7

Intro Basics Assembly Memory Pipelines

Page 17: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

How does computer memory work?One (reading) memory transaction (simplified):

Processor Memory

CLK

R/W̄

A0..15

D0..15

Observation: Access (and addressing) happensin bus-width-size “chunks”.

Intro Basics Assembly Memory Pipelines

Page 18: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

How does computer memory work?One (reading) memory transaction (simplified):

Processor Memory

CLK

R/W̄

A0..15

D0..15

Observation: Access (and addressing) happensin bus-width-size “chunks”.

Intro Basics Assembly Memory Pipelines

Page 19: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

How does computer memory work?One (reading) memory transaction (simplified):

Processor Memory

CLK

R/W̄

A0..15

D0..15

Observation: Access (and addressing) happensin bus-width-size “chunks”.

Intro Basics Assembly Memory Pipelines

Page 20: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

How does computer memory work?One (reading) memory transaction (simplified):

Processor Memory

CLK

R/W̄

A0..15

D0..15

Observation: Access (and addressing) happensin bus-width-size “chunks”.

Intro Basics Assembly Memory Pipelines

Page 21: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

How does computer memory work?One (reading) memory transaction (simplified):

Processor Memory

CLK

R/W̄

A0..15

D0..15

Observation: Access (and addressing) happensin bus-width-size “chunks”.

Intro Basics Assembly Memory Pipelines

Page 22: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

How does computer memory work?One (reading) memory transaction (simplified):

Processor Memory

CLK

R/W̄

A0..15

D0..15

Observation: Access (and addressing) happensin bus-width-size “chunks”.

Intro Basics Assembly Memory Pipelines

Page 23: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

How does computer memory work?One (reading) memory transaction (simplified):

Processor Memory

CLK

R/W̄

A0..15

D0..15

Observation: Access (and addressing) happensin bus-width-size “chunks”.

Intro Basics Assembly Memory Pipelines

Page 24: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

What is. . . a Memory Interface?

Memory Interface gets and stores binarywords in off-chip memory.

Smallest granularity: Bus width

Tells outside memory

• “where” through address bus

• “what” through data bus

Computer main memory is “Dynamic RAM”(DRAM): Slow, but small and cheap.

Intro Basics Assembly Memory Pipelines

Page 25: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Outline

Intro

The Basic Subsystems

Machine Language

The Memory Hierarchy

Pipelines

Intro Basics Assembly Memory Pipelines

Page 26: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

A Very Simple Program

int a = 5;int b = 17;int z = a ∗ b;

4: c7 45 f4 05 00 00 00 movl $0x5,−0xc(%rbp)b: c7 45 f8 11 00 00 00 movl $0x11,−0x8(%rbp)12: 8b 45 f4 mov −0xc(%rbp),%eax15: 0f af 45 f8 imul −0x8(%rbp),%eax19: 89 45 fc mov %eax,−0x4(%rbp)1c: 8b 45 fc mov −0x4(%rbp),%eax

Things to know:

• Addressing modes (Immediate, Register, Base plus Offset)

• 0xHexadecimal

• “AT&T Form”: (we’ll use this)<opcode><size> <source>, <dest>

Intro Basics Assembly Memory Pipelines

Page 27: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Another Look

Internal Bus

Register FileFlags

Data ALU

Address ALU

Control UnitPC

Memory Interface

Insn.fetch

Data Bus

Address Bus

4: c7 45 f4 05 00 00 00 movl $0x5,−0xc(%rbp)b: c7 45 f8 11 00 00 00 movl $0x11,−0x8(%rbp)

12: 8b 45 f4 mov −0xc(%rbp),%eax15: 0f af 45 f8 imul −0x8(%rbp),%eax19: 89 45 fc mov %eax,−0x4(%rbp)1c: 8b 45 fc mov −0x4(%rbp),%eax

Intro Basics Assembly Memory Pipelines

Page 28: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Another Look

Internal Bus

Register FileFlags

Data ALU

Address ALU

Control UnitPC

Memory Interface

Insn.fetch

Data Bus

Address Bus

4: c7 45 f4 05 00 00 00 movl $0x5,−0xc(%rbp)b: c7 45 f8 11 00 00 00 movl $0x11,−0x8(%rbp)

12: 8b 45 f4 mov −0xc(%rbp),%eax15: 0f af 45 f8 imul −0x8(%rbp),%eax19: 89 45 fc mov %eax,−0x4(%rbp)1c: 8b 45 fc mov −0x4(%rbp),%eax

Intro Basics Assembly Memory Pipelines

Page 29: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

A Very Simple Program: Intel Form

4: c7 45 f4 05 00 00 00 mov DWORD PTR [rbp−0xc],0x5b: c7 45 f8 11 00 00 00 mov DWORD PTR [rbp−0x8],0x1112: 8b 45 f4 mov eax,DWORD PTR [rbp−0xc]15: 0f af 45 f8 imul eax,DWORD PTR [rbp−0x8]19: 89 45 fc mov DWORD PTR [rbp−0x4],eax1c: 8b 45 fc mov eax,DWORD PTR [rbp−0x4]

• “Intel Form”: (you might see this on the net)<opcode> <sized dest>, <sized source>

• Goal: Reading comprehension.

• Don’t understand an opcode?Google “<opcode> intel instruction”.

Intro Basics Assembly Memory Pipelines

Page 30: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Machine Language Loops

int main(){int y = 0, i ;for ( i = 0;

y < 10; ++i)y += i;

return y;}

0: 55 push %rbp1: 48 89 e5 mov %rsp,%rbp4: c7 45 f8 00 00 00 00 movl $0x0,−0x8(%rbp)b: c7 45 fc 00 00 00 00 movl $0x0,−0x4(%rbp)

12: eb 0a jmp 1e <main+0x1e>14: 8b 45 fc mov −0x4(%rbp),%eax17: 01 45 f8 add %eax,−0x8(%rbp)1a: 83 45 fc 01 addl $0x1,−0x4(%rbp)1e: 83 7d f8 09 cmpl $0x9,−0x8(%rbp)22: 7e f0 jle 14 <main+0x14>24: 8b 45 f8 mov −0x8(%rbp),%eax27: c9 leaveq28: c3 retq

Things to know:

• Condition Codes (Flags): Zero, Sign, Carry, etc.

• Call Stack: Stack frame, stack pointer, base pointer

• ABI: Calling conventions

Want to make those yourself?Write myprogram.c.$ cc -c myprogram.c

$ objdump --disassemble myprogram.o

Intro Basics Assembly Memory Pipelines

Page 31: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Machine Language Loops

int main(){int y = 0, i ;for ( i = 0;

y < 10; ++i)y += i;

return y;}

0: 55 push %rbp1: 48 89 e5 mov %rsp,%rbp4: c7 45 f8 00 00 00 00 movl $0x0,−0x8(%rbp)b: c7 45 fc 00 00 00 00 movl $0x0,−0x4(%rbp)

12: eb 0a jmp 1e <main+0x1e>14: 8b 45 fc mov −0x4(%rbp),%eax17: 01 45 f8 add %eax,−0x8(%rbp)1a: 83 45 fc 01 addl $0x1,−0x4(%rbp)1e: 83 7d f8 09 cmpl $0x9,−0x8(%rbp)22: 7e f0 jle 14 <main+0x14>24: 8b 45 f8 mov −0x8(%rbp),%eax27: c9 leaveq28: c3 retq

Things to know:

• Condition Codes (Flags): Zero, Sign, Carry, etc.

• Call Stack: Stack frame, stack pointer, base pointer

• ABI: Calling conventions

Want to make those yourself?Write myprogram.c.$ cc -c myprogram.c

$ objdump --disassemble myprogram.o

Intro Basics Assembly Memory Pipelines

Page 32: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

We know how a computer works!

All of this can be built in about 4000 transistors.(e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600)

So what exactly is Intel doing with the other 581,996,000transistors?

Answer:

Make things go faster!

Goal now:Understand sources of slowness, and how they get addressed.

Remember: High Performance Computing

Intro Basics Assembly Memory Pipelines

Page 33: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

We know how a computer works!

All of this can be built in about 4000 transistors.(e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600)

So what exactly is Intel doing with the other 581,996,000transistors?

Answer: Make things go faster!

Goal now:Understand sources of slowness, and how they get addressed.

Remember: High Performance Computing

Intro Basics Assembly Memory Pipelines

Page 34: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

We know how a computer works!

All of this can be built in about 4000 transistors.(e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600)

So what exactly is Intel doing with the other 581,996,000transistors?

Answer: Make things go faster!

Goal now:Understand sources of slowness, and how they get addressed.

Remember: High Performance Computing

Intro Basics Assembly Memory Pipelines

Page 35: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

The High-Performance Mindset

Writing high-performance Codes

Mindset: What is going to be the limitingfactor?

• ALU?

• Memory?

• Communication? (if multi-machine)

Benchmark the assumed limiting factor rightaway.

Evaluate

• Know your peak throughputs (roughly)

• Are you getting close?

• Are you tracking the right limiting factor?

Intro Basics Assembly Memory Pipelines

Page 36: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Outline

Intro

The Basic Subsystems

Machine Language

The Memory Hierarchy

Pipelines

Intro Basics Assembly Memory Pipelines

Page 37: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Source of Slowness: MemoryMemory is slow.

Distinguish two different versions of “slow”:• Bandwidth• Latency

→ Memory has long latency, but can have large bandwidth.

Size of die vs. distance to memory: big!

Dynamic RAM: long intrinsic latency!

Idea:

Put a look-up table ofrecently-used data ontothe chip.

→ “Cache”

Intro Basics Assembly Memory Pipelines

Page 38: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Source of Slowness: MemoryMemory is slow.

Distinguish two different versions of “slow”:• Bandwidth• Latency

→ Memory has long latency, but can have large bandwidth.

Size of die vs. distance to memory: big!

Dynamic RAM: long intrinsic latency!

Idea:

Put a look-up table ofrecently-used data ontothe chip.

→ “Cache”

Intro Basics Assembly Memory Pipelines

Page 39: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

The Memory Hierarchy

Hierarchy of increasingly bigger, slower memories:

Registers

L1 Cache

L2 Cache

DRAM

Virtual Memory(hard drive)

1 kB, 1 cycle

10 kB, 10 cycles

1 MB, 100 cycles

1 GB, 1000 cycles

1 TB, 1 M cycles

How might data localityfactor into this?

What is a working set?

Intro Basics Assembly Memory Pipelines

Page 40: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

The Memory Hierarchy

Hierarchy of increasingly bigger, slower memories:

Registers

L1 Cache

L2 Cache

DRAM

Virtual Memory(hard drive)

1 kB, 1 cycle

10 kB, 10 cycles

1 MB, 100 cycles

1 GB, 1000 cycles

1 TB, 1 M cyclesHow might data localityfactor into this?

What is a working set?

Intro Basics Assembly Memory Pipelines

Page 41: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Cache: Actual Implementation

Demands on cache implementation:

• Fast, small, cheap, low-power

• Fine-grained

• High “hit”-rate (few “misses”)

MainMemory

CacheMemory

Index Data0 xyz1 pdq2 abc3 rgf

Index Tag Data0 abc2

0 xyz1

Problem:Goals at odds with each other: Access matching logic expensive!

Solution 1: More data per unit of access matching logic→ Larger “Cache Lines”

Solution 2: Simpler/less access matching logic→ Less than full “Associativity”

Other choices: Eviction strategy, size

Intro Basics Assembly Memory Pipelines

Page 42: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Cache: Associativity

Direct Mapped

Memory0123456...

Cache0123

2-way set associative

Memory0123456...

Cache0123

1e-006

1e-005

0.0001

0.001

0.01

0.1

Inf1M256K64K16K4K1K

mis

s ra

te

cache size

Direct

2-way

4-way

8-way

Full

Miss rate versus cache size on the Integer por-tion of SPEC CPU2000 [Cantin, Hill 2003]

Intro Basics Assembly Memory Pipelines

Page 43: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Cache: Associativity

Direct Mapped

Memory0123456...

Cache0123

2-way set associative

Memory0123456...

Cache0123

1e-006

1e-005

0.0001

0.001

0.01

0.1

Inf1M256K64K16K4K1K

mis

s ra

te

cache size

Direct

2-way

4-way

8-way

Full

Miss rate versus cache size on the Integer por-tion of SPEC CPU2000 [Cantin, Hill 2003]

Intro Basics Assembly Memory Pipelines

Page 44: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Cache Example: Intel Q6600/Core2 Quad

--- L1 data cache ---

fully associative cache = false

threads sharing this cache = 0x0 (0)

processor cores on this die= 0x3 (3)

system coherency line size = 0x3f (63)

ways of associativity = 0x7 (7)

number of sets - 1 (s) = 63

--- L1 instruction ---

fully associative cache = false

threads sharing this cache = 0x0 (0)

processor cores on this die= 0x3 (3)

system coherency line size = 0x3f (63)

ways of associativity = 0x7 (7)

number of sets - 1 (s) = 63

--- L2 unified cache ---

fully associative cache false

threads sharing this cache = 0x1 (1)

processor cores on this die= 0x3 (3)

system coherency line size = 0x3f (63)

ways of associativity = 0xf (15)

number of sets - 1 (s) = 4095

More than you care to know about your CPU:http://www.etallen.com/cpuid.html

Intro Basics Assembly Memory Pipelines

Page 45: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Measuring the Cache I

void go(unsigned count, unsigned stride){const unsigned arr size = 64 ∗ 1024 ∗ 1024;int ∗ary = (int ∗) malloc(sizeof( int) ∗ arr size );

for (unsigned it = 0; it < count; ++it){for (unsigned i = 0; i < arr size ; i += stride)

ary [ i ] ∗= 17;}

free (ary );}

20 21 22 23 24 25 26 27 28 29 210

Stride

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

Tim

e [

s]

Intro Basics Assembly Memory Pipelines

Page 46: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Measuring the Cache I

void go(unsigned count, unsigned stride){const unsigned arr size = 64 ∗ 1024 ∗ 1024;int ∗ary = (int ∗) malloc(sizeof( int) ∗ arr size );

for (unsigned it = 0; it < count; ++it){for (unsigned i = 0; i < arr size ; i += stride)

ary [ i ] ∗= 17;}

free (ary );}

20 21 22 23 24 25 26 27 28 29 210

Stride

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

Tim

e [

s]

Intro Basics Assembly Memory Pipelines

Page 47: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Measuring the Cache II

void go(unsigned array size , unsigned steps){int ∗ary = (int ∗) malloc(sizeof( int) ∗ array size );unsigned asm1 = array size − 1;

for (unsigned i = 0; i < steps; ++i)ary [( i∗16) & asm1] ++;

free (ary );}

212 214 216 218 220 222 224 226

Array Size [Bytes]

0

1

2

3

4

5

6

Eff

. B

andw

idth

[G

B/s

]

Intro Basics Assembly Memory Pipelines

Page 48: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Measuring the Cache II

void go(unsigned array size , unsigned steps){int ∗ary = (int ∗) malloc(sizeof( int) ∗ array size );unsigned asm1 = array size − 1;

for (unsigned i = 0; i < steps; ++i)ary [( i∗16) & asm1] ++;

free (ary );}

212 214 216 218 220 222 224 226

Array Size [Bytes]

0

1

2

3

4

5

6

Eff

. B

andw

idth

[G

B/s

]

Intro Basics Assembly Memory Pipelines

Page 49: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Measuring the Cache III

void go(unsigned array size , unsigned stride , unsigned steps){char ∗ary = (char ∗) malloc(sizeof( int) ∗ array size );

unsigned p = 0;for (unsigned i = 0; i < steps; ++i){

ary [p] ++;p += stride;if (p >= array size)

p = 0;}

free (ary );}

100 200 300 400 500 600Stride [bytes]

5

10

15

20

Arr

ay S

ize [

MB

]

Intro Basics Assembly Memory Pipelines

Page 50: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Measuring the Cache III

void go(unsigned array size , unsigned stride , unsigned steps){char ∗ary = (char ∗) malloc(sizeof( int) ∗ array size );

unsigned p = 0;for (unsigned i = 0; i < steps; ++i){

ary [p] ++;p += stride;if (p >= array size)

p = 0;}

free (ary );}

100 200 300 400 500 600Stride [bytes]

5

10

15

20A

rray S

ize [

MB

]

Intro Basics Assembly Memory Pipelines

Page 51: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Programming for the Cache

How can we rearrange programs to be cache-friendly?

Examples:

• Large vectors x , a, bCompute

x ← x + 3a− 5b.

• Matrix-Matrix Multiplication→ Homework 1, posted.

Intro Basics Assembly Memory Pipelines

Page 52: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Programming for the Cache

How can we rearrange programs to be cache-friendly?

Examples:

• Large vectors x , a, bCompute

x ← x + 3a− 5b.

• Matrix-Matrix Multiplication→ Homework 1, posted.

Intro Basics Assembly Memory Pipelines

Page 53: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Outline

Intro

The Basic Subsystems

Machine Language

The Memory Hierarchy

Pipelines

Intro Basics Assembly Memory Pipelines

Page 54: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Source of Slowness: Sequential Operation

IF Instruction fetch

ID Instruction Decode

EX Execution

MEM Memory Read/Write

WB Result Writeback

Intro Basics Assembly Memory Pipelines

Page 55: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Solution: Pipelining

Intro Basics Assembly Memory Pipelines

Page 56: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Pipelining

(MIPS, 110,000 transistors)

Intro Basics Assembly Memory Pipelines

Page 57: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Issues with Pipelines

Pipelines generally helpperformance–but not always.

Possible issues:

• Stalls

• Dependent Instructions

• Branches (+Prediction)

• Self-Modifying Code

“Solution”: Bubbling, extracircuitry

WaitingInstructions

Stage 1: Fetch

Stage 2: Decode

Stage 3: Execute

Stage 4: Write-back

PIP

ELI

NE

CompletedInstructions

0 1 2 3 4 5 6 7 8

Clock Cycle9

Intro Basics Assembly Memory Pipelines

Page 58: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Intel Q6600 PipelineInstruction Fetch

32 Byte Pre-Decode, Fetch Buffer

18 Entry Instruction Queue

7+ Entry µop Buffer

Register Alias Table and Allocator

96 Entry Reorder Buffer (ROB)Retirement Register File(Program Visible State)

Micro-code

ComplexDecoder

SimpleDecoder

SimpleDecoder

SimpleDecoder

32 Entry Reservation Station

ALU ALUSSE

ShuffleALU

SSEShuffleMUL

ALUBranch

SSEALU

128 BitFMULFDIV

128 BitFADD

StoreAddress

StoreData

LoadAddress

Memory Ordering Buffer(MOB)

Memory Interface

Port 0 Port 1 Port 2Port 3 Port 4Port 5

Internal Results BusLoadStore

128 Bit128 Bit

4 µops

4 µops

4 µops

4 µops 1 µop 1 µop 1 µop

128 Bit

6 Instructions

4 µops

New concept:Instruction-levelparallelism(“Superscalar”)

Intro Basics Assembly Memory Pipelines

Page 59: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Intel Q6600 PipelineInstruction Fetch

32 Byte Pre-Decode, Fetch Buffer

18 Entry Instruction Queue

7+ Entry µop Buffer

Register Alias Table and Allocator

96 Entry Reorder Buffer (ROB)Retirement Register File(Program Visible State)

Micro-code

ComplexDecoder

SimpleDecoder

SimpleDecoder

SimpleDecoder

32 Entry Reservation Station

ALU ALUSSE

ShuffleALU

SSEShuffleMUL

ALUBranch

SSEALU

128 BitFMULFDIV

128 BitFADD

StoreAddress

StoreData

LoadAddress

Memory Ordering Buffer(MOB)

Memory Interface

Port 0 Port 1 Port 2Port 3 Port 4Port 5

Internal Results BusLoadStore

128 Bit128 Bit

4 µops

4 µops

4 µops

4 µops 1 µop 1 µop 1 µop

128 Bit

6 Instructions

4 µops

New concept:Instruction-levelparallelism(“Superscalar”)

Intro Basics Assembly Memory Pipelines

Page 60: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Programming for the Pipeline

How to upset a processor pipeline:

for ( int i = 0; i < 1000; ++i)for ( int j = 0; j < 1000; ++j){

if ( j % 2 == 0)do something(i , j );

}

. . . why is this bad?

Intro Basics Assembly Memory Pipelines

Page 61: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

A Puzzle

int steps = 256 ∗ 1024 ∗ 1024;int [] a = new int[2];

// Loop 1for ( int i=0; i<steps; i++) { a[0]++; a[0]++; }

// Loop 2for ( int i=0; i<steps; i++) { a[0]++; a[1]++; }

Which is faster?

. . . and why?

Intro Basics Assembly Memory Pipelines

Page 62: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Two useful Strategies

Loop unrolling:

for ( int i = 0; i < 1000; ++i)do something(i );

→for ( int i = 0; i < 500; i+=2){

do something(i );do something(i+1);}

Software pipelining:

for ( int i = 0; i < 1000; ++i){

do a( i );do b(i );}

for ( int i = 0; i < 500; i+=2){

do a( i );do a( i+1);do b(i );do b(i+1);}

Intro Basics Assembly Memory Pipelines

Page 63: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

SIMD

Control Units are large and expensive.

Functional Units are simple and cheap.

→ Increase the Function/Control ratio:Control several functional units withone control unit.

All execute same operation.

Data

Poo

l

Instruction Pool

PU

PU

PU

PU

SIMD

GCC vector extensions:

typedef int v4si attribute (( vector size (16)));

v4si a, b, c;c = a + b;// +, −, ∗, /, unary minus, ˆ, |, &, ˜, %

Will revisit for OpenCL, GPUs.

Intro Basics Assembly Memory Pipelines

Page 64: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

About HW1

• Open-ended! Want: coherent thought about caches, memoryaccess ordering, latencies, bandwidths, etc.See Berkeley CS267 (lec. 3, . . . ) for more hints.

• Also: Introduction to the machinery.(Clusters, running on them, SSH, git, forge)Why so much machinery?

• Linux lab machines: WWH 229/230

Intro Basics Assembly Memory Pipelines

Page 65: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Rearranging Matrix-Matrix Multiplication

Matrix Multiplication:

Cij =∑k

AikBkj

A

B

C

Intro Basics Assembly Memory Pipelines

Page 66: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Rearranging Matrix-Matrix Multiplication

Matrix Multiplication:

Cij =∑k

AikBkj

Intro Basics Assembly Memory Pipelines

Page 67: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Rearranging Matrix-Matrix Multiplication

Matrix Multiplication:

Cij =∑k

AikBkj

Intro Basics Assembly Memory Pipelines

Page 68: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Rearranging Matrix-Matrix Multiplication

Matrix Multiplication:

Cij =∑k

AikBkj

Intro Basics Assembly Memory Pipelines

Page 69: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Rearranging Matrix-Matrix Multiplication

Matrix Multiplication:

Cij =∑k

AikBkj

Intro Basics Assembly Memory Pipelines

Page 70: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Rearranging Matrix-Matrix Multiplication

Matrix Multiplication:

Cij =∑k

AikBkj

Intro Basics Assembly Memory Pipelines

Page 71: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Questions?

?

Intro Basics Assembly Memory Pipelines

Page 72: Lecture 2: Processor Design, Single-Processor PerformanceCache: Actual Implementation Demands on cache implementation: Fast, small, cheap, low-power Fine-grained High \hit"-rate (few

Image Credits

• Q6600 Wikimedia Commons

• Mainboard: Wikimedia Commons• DIMM: sxc.hu/gobran11

• Q6600 back: Wikimedia Commons• Core 2 die: Intel Corp. / lesliewong.us

• Basic cache: Wikipedia

• Cache associativity: based on Wikipedia

• Cache associativity vs miss rate: Wikipedia ,

• Q6600 Wikimedia Commons• Cache Measurements: Igor Ostrovsky• Pipeline stuff: Wikipedia

• Bubbly Pipeline: Wikipedia

• Q6600 Pipeline: Wikipedia

• SIMD concept picture: Wikipedia

Intro Basics Assembly Memory Pipelines