1 computer architectures m latest processors. 2 first example (2009) nehalem: mono/duo/quad/8-core...

1

Computer architectures M

Latest processors

2

• First example (2009) Nehalem: mono/duo/quad/8-core processor, multithreaded 2 (2/4/8/16 virtual processors), 64 bit parallelism, superscalar 4 (4 CISC instructions to the decoders in parallel => 4+3=7 u-ops per clock), 4 u-ops for RAT per clock, 6 u-ops to the EUs per clock), physical address 40 bit => 1 TeraByte (one million Mbytes), tecnology 45 nm. From 700M to 2G transistors on chip. Sandy bridge tecnology 32nm -> 2,3G transistors

Introduction – new requirments

• No more clock frequency increase (power consumption – temperature – heat dissipation)

• Power consumption and configurations totally scalable

• QPI

• Greater size caches (integrated L1, L2 and L3)

Nehalem Processor

Misc IO

Misc IO

QPI 1

QPI 0

Memory controller

Core Core Core Core

Queue

Cache L3 shared cache

CORE architecturebut with Hyperthreading 4

I levelCache

II level Cache

IO

QPI

DRAM

QPI

Core

Uncore

CORE

IMC QPIPower &Clock

…

……

L3 Cache

QPI

5

• Server: high cores number, very high bandwith and low latency for many different tasks. RAS – Reliability Avalability Serviceability of tantamount importance

Cores, Caches and Links

3 Integrated dynamic Memory Controllers DDR3

CORE

CORE

•Riconfigurable architecture

• Notebook: 1 or 2 cores systems which must have reduced costs, low power consumption and low execution latency time for single tasks

• Desktop: similar characteristics but less importance for the power consumption. High bandwith for graphical applications

6

http://www.agner.org/optimize/microarchitecture.pdf



7• The caches sizes change according to the model

Nehalem characteristics

• QPI bus.

• Each processor (core) has an instruction cache 32KB, 4 ways associative, a data cache 32 KB, 8 ways associative, and an unified (data and instructions) II level cache 256 KB, 8 ways associative. The II level caches is not inclusive

• Each quad-core socket (node) relies on a maximum of three DDR3 which operate up to 32GB/s peak bandwidth. Each channel operates in independent mode and the controller handles the requests OOO so as to minimize the total latency

• Each core can handle up to 10 data cache misses and up to 16 transactions concurrently (i.e. instructions and data retrieval from a higher level cache). In comparison Core 2 could handle 8 data cache misses and 14 transactions.

• Third level cache (inclusive cache).

• There is a central queue which allows the interconnection and arbitration between the cores and the “uncore” region ((common to all cores) that is L3, the memory controller and the QPI interface

• From the performance point of view an inclusive L3 is the ideal configuration since it allows to to handle efficiently the coherence problems of the chip (see later) and avoids data replications. Since it is inclusive any data present in any core is present in L3 too (although possibly in a non coherent state)

Increased power Core

Instruction Fetch

Instruction Queue

Decoder (1+3)

ITLB

Rename/Allocate

Retirement Unit(ReOrder Buffer)

Reservation Station

Execution Units

DTLB

2nd Level TLB7 u-ops

4 -uops

6 u-ops

32kBInstruction Cache

32kBData Cache

256kB2nd Level Cache

L3 and beyond

8

Two unified levels

Macrofusion

• All Core macrofusion cases plus…

– CMP+Jcc macrofusion added for these branch conditions too

• JL/JNGE

• JGE/JNL

• JLE/JNG

• JG/JNLE

9

10

Core loop detector

– … But requires the decoding each cycle

• Exploits the hardware loop detection. The loop detector analyzes the branches and determin whether it is a loop (fixed direction jump with same in the reverse direction) .

– Avoids the repetitive fetch and branch prediction

DecodeBranchPrediction Fetch

LoopStream

Detector

18 CISCinstructions

• Similar concepts but

Higher instructions number considered

Nehalem loop detector Nehalem

11

• After the Loop Stream Detector, the last step is to use a separated stack which removes all u-ops regarding the stack. The u-ops which speculatively point to the stack are handled by a separate adder which writes on a “delta” register (a register apart - RSB) which is periodically synchronized with the architectural register which contains the non speculated stack pointer. All u-ops which manipulate the stack pointer do not enter the execution units,

DecodeLoop

StreamDetector 28

Micro-Ops

Similar tothe trace cache

BranchPrediction

Fetch

Two levels branch predictor

12

• Double level in order to include an ever increasing number of predictions. Mechanism similar to that of the caches

• The first level BTB has 256/512 entries (according to the model): if the data is not found there a second level (activated upon access) is interrogated with 2-8K entries. This reduces the power consumption since the second level BTB very seldom is activated

• Increased RSB slots number (higher speculative execution)

13

More powerful memory subsystem

• Hierarchical TLB

# of Entries

1st Level Instruction TLBs

4k Pages 128 slots – 4 ways

1st Level Data TLBs


2nd Level Unified TLB


• Fast access to non aligned data. Greater freedom for the compiler

14

Nehalem internal architecture

• Nehalem two TLB levels are dynamically allocated between the two threads.

• A very big difference with Core is the caches coverage degree. Core had a 6 MB L2 cache and the TLB had 136 entries (4 ways ). Considering 4k pages the coverage was 136x4x4KB= 2176 KB, a third of L3 (8MB).

• Nehalem has a 576 entries DTLB (512 second level + 64 first level) which means 4x576=2304 which amounts to 2304x4KB=9216 KB memory which covers the entire L3 (8MB). (The meaning is that the address translation has a great possibility of targeting data in L3)

15

(Smart) CachesCore

256kBL2 Cache

32kB L1 Data Cache

32kB L1 Inst. Cache

It has a private power plane and operated at a private frequency (not the same of the cores) and higher voltage. This because the power must be spared and big size caches have very often errors if their voltage is too low.

…

L3 Cache

Core

L2 Cache

L1 Caches

Core

L2 Cache

L1 Caches

Core

L2 Cache

L1 Caches

• Three levels hierarchical

• First level: 32KB Instrucions (4 ways) and 32 KB Data (8 ways)

• Second level: unified 256KB (8 ways – 10 cycles for access

• Third level shared among the various cores. The size depends on the number of cores

For a quad-core 8 MB 16 ways. Latency 30-40 cycles

Designed for future expansion

Inclusive: all addresses in L1 and L2 are present in L3 (possibly with differente states)

Each L3 line has n (4 in case of quad core) “core valid” bits which indicate which (if any) cores have a copy of the line. (If a datum in L1 or L2 certainly is in L3 too but not viceversa).

Inclusive cache vs. non exclusive

Exclusive Inclusive

Core0

Core1

Core2

Core3

L3 Cache

Core0

Core1

Core2

Core3

L3 Cache

An example: the datum requested by Core 0 is not present in its L1 and L2 and therefore is requested to the common L3

16

Core0

Core1

Core2

Core3

L3 Cache

Core0

Core1

Core2

Core3

L3 Cache

• The requested datum cannot be retrieved from L3 too

MISS ! MISS !

Inclusive cache vs. exclusive

17

Exclusive Inclusive

Core0

Core1

Core2

Core3

L3 Cache

Core0

Core1

Core2

Core3

L3 CacheMISS! MISS!

A request is sent to the other cores The datum is not on the chip

Inclusive cache vs. exclusive (miss)

18

Exclusive Inclusive

Core0

Core1

Core2

Core3

L3 Cache

Core0

Core1

Core2

Core3

L3 CacheHIT ! HIT !

No further requests to other cores: if in L3 cannot be in other cores (exclusive)

The datum could be in other cores too but….

Inclusive cache vs. exclusive (hit)

19

Exclusive Inclusive

Inclusive

Core0

Core1

Core2

Core3

L3 CacheHIT! 0 0 0 0


L3 cache has a directory (one bit per core) which indicates if and in which cores the datum is present. A snoop is necessary only if one bit only is set (possible datum modified), If two or more bits are set the line in L3 is «cleas» and can be forwarded form L3 to the requesting core. Directory based coherence

20

Core0

Core1

Core2

Core3

L3 Cache

Core0

Core1

Core2

Core3

L3 CacheMISS! HIT!

All cores must be tested Here only the core with datum (core 2) must be tested

0 0 1 0


21

Exclusive Inclusive

22

Execution unit

Execute 6 u-ops/cycle

• 3 Memory u-ops• 1 Load• 1 Store Address• 1 Store Data

• 3 “Computational” u-ops

Unified Reservation Station

Po

rt 0

Po

rt 1

Po

rt 3

Po

rt 4

Po

rt 2

Load StoreAddress

StoreData

Integer ALU & Shift

Integer ALU &LEA

FP AddFP Multiply

ComplexInteger

Divide

SSE Integer ALUInteger Shuffles

SSE Integer Multiply

Po

rt 5

Integer ALU &Shift

Branch

FP Shuffle

SSE Integer ALUInteger Shuffles

Unified Reservation Station

• Schedules operations to Execution units• Single Scheduler for all Execution Units• Can be used by all integer, all FP, etc.

6 ports as in Core

6 Ports

23

Execution unit

Loop Stream Detector

• Each fetch retrieves 16 bytes from the cache which are inserted in a predecode buffer where 6 instructions at a time are sent to a 18 instructions queue. 4 CISC instructions at a time are sent to the 4 decoders (if possible) and the decoded u-ops are sent to a 28 slots Loop Stream Detector

• A new technology is implemented (Unaligned Cache Accesses) which grants the same execution speed to aligned and non-aligned instructions (i.e. crossing a cache line). Before, non aligned instructions were a big execution penalty which prevented very often the use of particular instructions: from Nehalem on this is not any more the case

24


EU for non-memory instructions

• The RAT can rename up to 4 u-ops per cycle (number of physical registers different according to the implementation) The renamed instructions in ROB and when the operands are ready inserted in the Unified Reservation Station

• The ROB and the RS are shared between the two threads. The ROB is statically subdivided: identical speculative execution “depth” between the two threads.

• The RS stage is competitively” shared according to the situation. A thread could be waiting for a memory data and therefore using little or no entried of the RS: it would be senseless to block entries of a blocked thread

25

EU for memory instructions


• Up to 48 Load and 32 Store in the MOB

• From the Load and Store buffers the u-ops access hierarchically the caches. As in PIV caches and TLB are dynamically subdivided among the two threads. Each core accepts “outstanding misses” (up to 16) for the best use of the increased memory bandwith

26

Nehalem full picture

• Instructions for CRC manipulation (important for the transmissions)

L3 shared amongthe chip cores. Uncore

• Other improvements: SSE 4.2. instructions: string manipulations (very important for XML handling)

27

Execution parallelism improvement

• Increased ROB size (33% - 128 slots)

• Improvement of the relates structures

Structure Intel® Core™ microarchitecture (Merom)

Intel® Core™

microarchitecture (Nehalem)

Comment

Reservation Stations

32 36 Dispatches operations to execution units

Load Buffers 32 48 Tracks all load operations allocated

Store Buffers 20 32 Tracks all store operations allocated

1

30

Nehalem vs Core

• It must be noted that the static sharing between threads provides the threads with a reduced number of ROB slots (64 – 128/2) instead of 96) but INTEL states that in case of a single thread all resources are given to it…

• Core internal architecture modified for QPI

• The execution engine is the same. Some blocks added for Integer and FP optimisation.In practice the increased number of RS allows a full use of the EUs which in Core were sometimes starved.

• For the same reason the multithread was reintroduced (and therefore the greater ROB – 196 slots vs 96 of Core - and the larger number of RS - 36 vs the 32 of Core),

• Load buffers are now 48 (32 in Core ) and store buffers 32 (20 in Core). The buffers are partitioned between the threads (fairness).

33

Power consumption control

PLL

Uncore, LLC

Core

Vcc

Freq.

Sensors

Core

Vcc

Freq.

Sensors

Core

Vcc

Freq.

Sensors

Core

Vcc

Freq.

Sensors

PLL

PLL

PLL

PLL

PCU

BCLKVcc

Power Control Unit

Current, temperature and power controlled in real time

Flexible: sophisticated hardware algorithm for power consumption optimisation

Power supply gate

PLL= Phase Lock LoopFrom a base frequency (quartzed) all requested

frequencies are generated

34

Core power consumption

Total power consumption

Local clocks and logic• Red: clock and gates

Clock distribution• Blue: high frequency design requires an

efficient global clock distribution

Loss currents• Green: high frequency systems are affected by

unavoidable losses

35

Power minimisation

• Idle CPU states are called Cn

C0

Cn

C1

Exit Latency (ms)

Idle

Pow

er (

W)

• Higher is n lower is the power consumption BUT longer the time for exiting the «idle» state

• The OS informs the CPU when no processes must be executed – privileged instruction MWAIT (Cn)

36

Totale power consumption

C states (before Nehalem)

• C3 states: all clocks blocked

• C4,C5 and C6 states: operating voltage progressive reduction

• C higher values were the most expensive for the exiting time (voltage increase, state restore, pipeline restart etc.)

Clock distribution

Loss currents

Local clocks and logic

• C0 state : active CPU

• C1 and C2 states : pipeline clock and other clocks majority blocked

• Cores had only one power plane and all devices had to be idle before reducing the voltage

37

C states behaviour(Core)

Time

Core 0

Core 1

Core Power

0

0

Task completed. No task ready. Instruction MWAIT(C6) .

38


Time

Core 0

Core 1

Core Power

0

0

Core1 execution stopped, its state saved and its clocks stopped . Core 0 keeps executing.

39


Time

Core 0

Core 1

Core Power

0

0Task completed. No task ready. Instruction MWAIT(C6)

40


Time

Core 0

Core 1

Core Power

0

0Reduced voltage and power

Only now it is possible

41


Time

Core 0

Core 1

Core Power

0

0

Core 1 interrupt, voltage increased Core 1 clocks reactivated, state restored, the instruction following MWAIT(C6) is executed. C0 keeps idle.

Increased voltage for both

processors

42


Core 0

Core 1

Core Power

0

0Core 0 interrupt. Core 0 state C0 and the instruction following MWAIT(C6) executed. Core 1 keeps executing

43

C6 Nehalem

Core 0

Core 1

Core 2

Core 3

Core Power

Time

0

0

0

0

Cores 0, 1, 2, and 3 active

Separeted core power supply!!!!

44

C6 Nehalem

Core 0

Core 1

Core 2

Core 3

Core Power

Time

0

0

0

0

Core 2 task completed. No task ready. MWAIT(C6).

45

C6 Nehalem

Core 0

Core 1

Core 2

Core 3

Core Power

Time

0

0

0

0

Core 2 stopped, its clocks stopped. Cores 0, 1, and 3 keep executing

46

C6 Nehalem

Core 0

Core 1

Core 2

Core 3

Core Power

Time

0

0

0

0

Core 2 power gate interdicted: voltage is 0 and state C6. Cores 0, 1, and 3 keep executing

47

C6 Nehalem

Core 0

Core 1

Core 2

Core 3

Core Power

Time

0

0

0

0

Core 0 task completed. No task ready. MWAIT(C6).Core 0 C6. Cores 1 and 3 keep executing

48

C6 Nehalem

Core 0

Core 1

Core 2

Core 3

Core Power

Time

0

0

0

0

Core 2 Interrupt - Core 2 C0 execution from MWAIT(C6). Cores 1 and 3 keep executing

49

C6 Nehalem

Core 0

Core 1

Core 2

Core 3

Core Power

Time

0

0

0

0

Core 0 interrupt. Power gate saturated, clocks reactivated, state restored execution from MWAIT(C6). Core 1,2, and 3 keep executing

50

Uncore losses

Uncore clock distribution

I/O

Uncore logic

Losses

Clock distribution

Clock and logic

Co

res

(x N

)

• All Cores state C6:

o Core power to ~0

Core power consumption

C6 Nehalem

o Uncore clock block

o I/O low power

o Uncore clock distribution blocked

• Entire package package C6

51

Further power reduction

• When a Core enters low power C states its operating voltage is reduced while that of the other Cores is unmodified

• Memoryo Memory clocks are blocked between the requests when the usage is lowo Package memory refresh occurs in C3 (clock block) and C6 (power down) states too

• Linkso Low power when the processor increases its Cx

• The Power Control Unit monitors the interrupts frequency and changes the C states accordinglyo C states linked to the operating system depend on the processor utilizationo In presence of some low workloads the use rate can be low but the latency can be of

tantamount importance (i.e. real time systems)o The CPU can implement complex behaviour optimisations algorithms

• The system changes the operating clock frequency according to the requirements in order to minimize the power consumption (processor P states)

• The Power Control Unit modifies the operating voltage for each clock frequency, operating condition and silicon characteristics

52

Turbo pre-Nehalem (Core)Fr

equency

(F)

No Turbo

Frequency

(F)

Core

0

Core

1

Core

0

Core

1

Clock Stopped

Power reduction in inactive cores

Workload Lightly Threaded

53

Turbo pre-Nehalem (Core)C

ore

0

Core

1

Frequency

(F)

No Turbo

Frequency

(F)

Turbo Mode

In response to workload adds additional

performance bins within headroom

Core

0

Clock Stopped

Power reduction in inactive cores

Workload Lightly Threaded

54

Frequency

(F)

No Turbo

Workload Lightly Threadedor < TDP

Frequency

(F)

C

ore

2

C

ore

3

C

ore

0

C

ore

1

C

ore

2

C

ore

3

C

ore

0

C

ore

1

Power Gating

Zero power for inactive cores

TDP: Thermal Design Power. An indication of the heat (energy) produced by a processor, which is also the max. power that the cooling system must dissipate. Measured in Watts

Turbo Nehalem

• It uses the available clock frequency to maximize the performance both for multi- and single-thread

55

Frequency

(F)

No Turbo

Frequency

(F)

Turbo Mode

In response to workload adds additional performance bins (frequency increase) within

headroom

Core

0

C

ore

1

Power Gating



C

ore

0

C

ore

1

C

ore

2

C

ore

3

Turbo Nehalem

56

Frequency

(F)

No Turbo

Frequency

(F)

Core

0

C

ore

1

Turbo Mode



Power Gating



C

ore

0

C

ore

1

C

ore

2

C

ore

3

Turbo Nehalem

57

C

ore

2

C

ore

3

C

ore

0

C

ore

1

C

ore

2

C

ore

3

C

ore

0

C

ore

1

Frequency

(F)

No Turbo


Frequency

(F)

Active cores running workloads < Thermal Design Power

C

ore

0

C

ore

1

C

ore

2

C

ore

3

Turbo Nehalem

58

Frequency

(F)

No Turbo

Frequency

(F)

Turbo Mode




C

ore

0

C

ore

1

C

ore

2

C

ore

3Active cores running

workloads < TDP

C

ore

2

C

ore

3

C

ore

1

C

ore

0

TDP = Thermal Design Power

Turbo Nehalem

59

Core

0

Core

1

Core

2

Core

3Frequency

(F)

No Turbo

Frequency

(F)

Turbo Mode



Power Gating



C

ore

2

C

ore

3

C

ore

1

C

ore

0

Turbo Nehalem

60

Turbo enabling

• Turbo Mode is transparent

– Frequency transitions are handled in hw

– The operating system asks for P-state changes (frequency and voltage) in a transparent way activating the Turbo mode only when needed for better performance

– The Power Control Unit keeps the silicon within the required boundaries

61• 1GB page support

Westmere

• Westmere is the name of the 32 nm Nehalem and is the basis of Core i3, Core i5, and multiple cores (i7plus).

Characteristics

• 2-12 native cores (multithreaded => up to 24 processors)

• 12 MB L3 cache

• Some versions have an integrated graphic controller

• A new instructions set (I7) for the Advanced Encryption Standard (AES) and a new instruction PCLMLQDQ which executes multiplications without carry as required by the cryptography (i.e. disks encryption)

http://en.wikipedia.org/wiki/Advanced_Encryption_Standard

http://en.wikipedia.org/wiki/Advanced_Encryption_Standard

63

Roadmap

64

Hexa-Core Gulftown – 12 Threads

65

Westmere: 32nm Nehalem

66

EVGA W555 - Dual processor Westmere – (2x6x2) = 24 threads !!

Video controller

67

Processors

6x2 DIMM DDR3

7 PCI

NON standardE-ATX size1 or 2 processors and overclockin

8 SATA ports2x6 GB/sec and

6x3 GB/sec

Intel 5520 chipset and two nForce 200controllers under

the power dissipator

http://www.intel.com/products/server/chipsets/5500-5520/5500-5520-overview.htm

http://www.bit-tech.net/hardware/motherboards/2007/12/17/first_look_nvidia_nforce_780i_sli/1

http://www.bit-tech.net/hardware/motherboards/2007/12/17/first_look_nvidia_nforce_780i_sli/1

69

Coolingtowers

70

Roadmap

22 nm technology – 3D trigate transistors – Pipeline 14 stages – Dual channel DDR3 - 64KB L1 (32K instr + 32K data) and 256 KB L2 per core, Reduced cache latency – Can use DDR4 – Three possible GPU: the most powerful (GT3) has 20 Eus.. Integrated voltage regulator (from the motherboard to the chip). Better power consumption. Up to 100W TDP. 10% improved performance 5 CISC decoded instructions macrofused produce 4 u-ops per clock – Up to 8 u-ops dispatched per clock

71

Haswell front end

72

Haswell front end

73

Haswell execution unit

Increasing the OoO window allows the execution units to extract more parallelism and thus improve single threaded performancePrioritized extracting instruction level parallelism

8 ports !!

74

Haswell execution unit

75

Haswell

1 computer architectures m latest processors. 2 first example (2009) nehalem: mono/duo/quad/8-core...

Documents