finishing out eecs 470 a few snapshots of the real world

36
Finishing out EECS 470 A few snapshots of the real world

Upload: scot-watson

Post on 05-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Finishing out EECS 470 A few snapshots of the real world

Finishing out EECS 470

A few snapshots of the real world

Page 2: Finishing out EECS 470 A few snapshots of the real world

Real processors:How they are different than your project.

• What we’ve talked about so far isn’t grounded by the real world in any meaningful way.– That is, we haven’t really looked at how real

processors do things• Today we’ll look at two processors– We’ll start with a 2003 core from AMD• Lots of details available, close to your project

– Jump to the latest Intel core.• Look at performance issue

Page 3: Finishing out EECS 470 A few snapshots of the real world

AMD 64-bit coreMost taken from

http://www.chip-architect.com/

Page 4: Finishing out EECS 470 A few snapshots of the real world
Page 5: Finishing out EECS 470 A few snapshots of the real world
Page 6: Finishing out EECS 470 A few snapshots of the real world

Bit-interleavedbusses running “North-South”

Page 7: Finishing out EECS 470 A few snapshots of the real world
Page 8: Finishing out EECS 470 A few snapshots of the real world
Page 9: Finishing out EECS 470 A few snapshots of the real world

IntegerDecode/Dispatch

• 3 types of instructions– Direct path

• RISC-like

– Vector path• Broken into smaller instructions via micro code.

– Double• 128-bit instructions which can be broken into 2 64-bit

independent instructions are (called Double)• Others are done via microcode• Most 128-bit SSE and SSE2 are made into doubles.

Page 10: Finishing out EECS 470 A few snapshots of the real world

RS

• Each cycle an instruction is issued into one of 3 lanes. – Each lane has • 8 RSs • 1 ALU • 1 AGU (Address Generation Unit)

– Each RS sees broadcasts from all ALUs, AGUs, L/S units etc.

Page 11: Finishing out EECS 470 A few snapshots of the real world

Rename

• Break the physical register file into 2 parts (sort of like P6 scheme with ARF/RoB)– 72 in-flight instructions are kept in the RoB

• The other structure is the IFFRF: Integer Future File and Register File – 16 registers of committed state– 16 “future registers”– 8 scratch-pad registers

Page 12: Finishing out EECS 470 A few snapshots of the real world

Future file• In the P6 scheme we had to look 3 places for the

data– The PRF– The RoB– The CDB (later)

• Here we look in the FF or the CDB-like-things later.– The FF holds the speculative value if it is known. – At execution complete instructions check to see if they

were the last thing to dispatch that writes to a given physical register.• This is done by tagging the FF with the RoB number.

– If they were the last to have that AR as a destination, they update the FF.

Page 13: Finishing out EECS 470 A few snapshots of the real world

How does the • At issue we:

– Check the FF for source operands– Reserve a spot in the RoB– Place our tag (RoB number) in the FF– Mark the FF entry as invalid

• At EX complete we:– Send RoB number and data to the CDB– Send data to the RoB– Update FF if tag matches

• At retire – update ARF value (from RoB)

• At mispredict– Copy ARF value into FF.

Page 14: Finishing out EECS 470 A few snapshots of the real world

What did the FF buy us?

• P6-like advantages– No free-list for PRF– Can just clear the RAT on mis-predict.

• But no need to access the RoB looking for data– RoB data only written once (EX complete) and only

read once (Commit)• Some pain– Early branch resolution looks hard

Page 15: Finishing out EECS 470 A few snapshots of the real world

ROB

• It uses an 8-bit descriptor for 72 entries.

Page 16: Finishing out EECS 470 A few snapshots of the real world

Re-Order-Buffer Tag definition

wrap bit

Instruction In Flight Number

re-order buffer index 0...23 sub-index 0..2

bit 7 bit 6 bit 5 bit 4 bit 3 bit 2 bit 1 bit 0

1) A sub-index 0,1 or 2 which identifies from which of the three lanes the instruction was dispatched. 2) A value 0..23 that identifies the “cycle" in which the instruction was dispatched. The "cycle counter" wraps to 0 after reaching 23. 3) A wrap bit. When two instructions have different wrap bits then the cycle counter has wrapped between the dispatches.

Page 17: Finishing out EECS 470 A few snapshots of the real world

More on the RoB

• What is basically happening is that we have three RoBs– Each one size 24– We cycle through each one so that none get

ahead of the other.– Reduces read/write ports!

Page 18: Finishing out EECS 470 A few snapshots of the real world

Mispredictions

• It looks like they wait until retirement to resolve all exceptions. – Mispredictions are treated as exceptions!

• They just clear everything and have the retired registers overwrite the speculative ones in the IFFRF

Page 19: Finishing out EECS 470 A few snapshots of the real world

More details.

• Each x86 instruction can launch both an ALU and an AGU operation – Because x86 has lots of memory operations this

makes sense.• ALUs broadcast result tag one cycle early– So RS can launch data to the ALU before data

arrives.

Page 20: Finishing out EECS 470 A few snapshots of the real world

8

Lane

Page 21: Finishing out EECS 470 A few snapshots of the real world

Intel’s Haswell

• Latest Intel microarchtecture– 22nm process– 4-wide OoO processor– x86

• An evolution, not revolution– Very similar to architectures from the last 8 years.

http://www.anandtech.com/show/6355/intels-haswell-architecture

Page 22: Finishing out EECS 470 A few snapshots of the real world

Intel

Page 23: Finishing out EECS 470 A few snapshots of the real world
Page 24: Finishing out EECS 470 A few snapshots of the real world

Basics

• Converts x86 instructions into microops– RISC-like instructions– Even more basic than RISC in some cases• Loads and Stores generally turn into two instructions

– Address compute and memory access

Page 25: Finishing out EECS 470 A few snapshots of the real world

What’s interesting?

• Seeing how things have changed compared to previous microarchitectures

• Transactional support

• Power issues

Page 26: Finishing out EECS 470 A few snapshots of the real world

The three recent frontends

Page 27: Finishing out EECS 470 A few snapshots of the real world

Buffer sizes

• 192 RoB entries

• 60 RS

• 72 Loads

• 42 stores

Page 28: Finishing out EECS 470 A few snapshots of the real world
Page 29: Finishing out EECS 470 A few snapshots of the real world
Page 30: Finishing out EECS 470 A few snapshots of the real world

Other key features

• Transactional synchronization– Execute lock-protected

section– Don’t acquire lock– If someone else is doing

the same thing at the same time• Undo all memory accesses• Do again with locks.

• Why?

• New sleep states– More like handheld

devices.

Page 31: Finishing out EECS 470 A few snapshots of the real world
Page 32: Finishing out EECS 470 A few snapshots of the real world
Page 33: Finishing out EECS 470 A few snapshots of the real world
Page 34: Finishing out EECS 470 A few snapshots of the real world
Page 35: Finishing out EECS 470 A few snapshots of the real world

Microarchitecture and performance

void tightloop() { unsigned j; for (j = 0; j < N; ++j) counter += j;

}

void foo() { }

void loop_with_extra_call() { unsigned j; for (j = 0; j < N; ++j) { __asm__("call foo"); counter += j;

} } http://eli.thegreenplace.net/2013/12/03/intel-i7-loop-performance-anomaly/

tightloop() runs in .68 sec

loop_with_extra_call runs in .60 sec

Why

Page 36: Finishing out EECS 470 A few snapshots of the real world

0000000000400530 <tightloop>: 400530: xor %eax,%eax 400532: nopw 0x0(%rax,%rax,1) 400538: mov 0x200b01(%rip),%rdx # 601040 <counter> 40053f: add %rax,%rdx 400542: add $0x1,%rax 400546: cmp $0x17d78400,%rax 40054c: mov %rdx,0x200aed(%rip) # 601040 <counter> 400553: jne 400538 <tightloop+0x8> 400555: repz retq 400557: nopw 0x0(%rax,%rax,1)

0000000000400560 <foo>: 400560: repz retq

0000000000400570 <loop_with_extra_call>: 400570: xor %eax,%eax 400572: nopw 0x0(%rax,%rax,1) 400578: callq 400560 <foo> 40057d: mov 0x200abc(%rip),%rdx # 601040 <counter> 400584: add %rax,%rdx 400587: add $0x1,%rax 40058b: cmp $0x17d78400,%rax 400591: mov %rdx,0x200aa8(%rip) # 601040 <counter> 400598: jne 400578 <loop_with_extra_call+0x8> 40059a: repz retq 40059c: nopl 0x0(%rax)