1 computer architectures m latest processors. 2 first example (2009) nehalem: mono/duo/quad/8-core...
TRANSCRIPT
1
Computer architectures M
Latest processors
2
• First example (2009) Nehalem: mono/duo/quad/8-core processor, multithreaded 2 (2/4/8/16 virtual processors), 64 bit parallelism, superscalar 4 (4 CISC instructions to the decoders in parallel => 4+3=7 u-ops per clock), 4 u-ops for RAT per clock, 6 u-ops to the EUs per clock), physical address 40 bit => 1 TeraByte (one million Mbytes), tecnology 45 nm. From 700M to 2G transistors on chip. Sandy bridge tecnology 32nm -> 2,3G transistors
Introduction – new requirments
• No more clock frequency increase (power consumption – temperature – heat dissipation)
• Power consumption and configurations totally scalable
• QPI
• Greater size caches (integrated L1, L2 and L3)
Nehalem Processor
Misc IO
Misc IO
QPI 1
QPI 0
Memory controller
Core Core Core Core
Queue
Cache L3 shared cache
CORE architecturebut with Hyperthreading 4
I levelCache
II level Cache
IO
QPI
DRAM
QPI
Core
Uncore
CORE
IMC QPIPower &Clock
…
……
L3 Cache
QPI
5
• Server: high cores number, very high bandwith and low latency for many different tasks. RAS – Reliability Avalability Serviceability of tantamount importance
Cores, Caches and Links
3 Integrated dynamic Memory Controllers DDR3
CORE
CORE
•Riconfigurable architecture
• Notebook: 1 or 2 cores systems which must have reduced costs, low power consumption and low execution latency time for single tasks
• Desktop: similar characteristics but less importance for the power consumption. High bandwith for graphical applications
6
http://www.agner.org/optimize/microarchitecture.pdf
7• The caches sizes change according to the model
Nehalem characteristics
• QPI bus.
• Each processor (core) has an instruction cache 32KB, 4 ways associative, a data cache 32 KB, 8 ways associative, and an unified (data and instructions) II level cache 256 KB, 8 ways associative. The II level caches is not inclusive
• Each quad-core socket (node) relies on a maximum of three DDR3 which operate up to 32GB/s peak bandwidth. Each channel operates in independent mode and the controller handles the requests OOO so as to minimize the total latency
• Each core can handle up to 10 data cache misses and up to 16 transactions concurrently (i.e. instructions and data retrieval from a higher level cache). In comparison Core 2 could handle 8 data cache misses and 14 transactions.
• Third level cache (inclusive cache).
• There is a central queue which allows the interconnection and arbitration between the cores and the “uncore” region ((common to all cores) that is L3, the memory controller and the QPI interface
• From the performance point of view an inclusive L3 is the ideal configuration since it allows to to handle efficiently the coherence problems of the chip (see later) and avoids data replications. Since it is inclusive any data present in any core is present in L3 too (although possibly in a non coherent state)
Increased power Core
Instruction Fetch
Instruction Queue
Decoder (1+3)
ITLB
Rename/Allocate
Retirement Unit(ReOrder Buffer)
Reservation Station
Execution Units
DTLB
2nd Level TLB7 u-ops
4 -uops
6 u-ops
32kBInstruction Cache
32kBData Cache
256kB2nd Level Cache
L3 and beyond
8
Two unified levels
Macrofusion
• All Core macrofusion cases plus…
– CMP+Jcc macrofusion added for these branch conditions too
• JL/JNGE
• JGE/JNL
• JLE/JNG
• JG/JNLE
9
10
Core loop detector
– … But requires the decoding each cycle
• Exploits the hardware loop detection. The loop detector analyzes the branches and determin whether it is a loop (fixed direction jump with same in the reverse direction) .
– Avoids the repetitive fetch and branch prediction
DecodeBranchPrediction Fetch
LoopStream
Detector
18 CISCinstructions
• Similar concepts but
Higher instructions number considered
Nehalem loop detector Nehalem
11
• After the Loop Stream Detector, the last step is to use a separated stack which removes all u-ops regarding the stack. The u-ops which speculatively point to the stack are handled by a separate adder which writes on a “delta” register (a register apart - RSB) which is periodically synchronized with the architectural register which contains the non speculated stack pointer. All u-ops which manipulate the stack pointer do not enter the execution units,
DecodeLoop
StreamDetector 28
Micro-Ops
Similar tothe trace cache
BranchPrediction
Fetch
Two levels branch predictor
12
• Double level in order to include an ever increasing number of predictions. Mechanism similar to that of the caches
• The first level BTB has 256/512 entries (according to the model): if the data is not found there a second level (activated upon access) is interrogated with 2-8K entries. This reduces the power consumption since the second level BTB very seldom is activated
• Increased RSB slots number (higher speculative execution)
13
More powerful memory subsystem
• Hierarchical TLB
# of Entries
1st Level Instruction TLBs
4k Pages 128 slots – 4 ways
1st Level Data TLBs
4k Pages 64 slots – 4 ways
2nd Level Unified TLB
4k Pages 512 slots – 4 ways
• Fast access to non aligned data. Greater freedom for the compiler
14
Nehalem internal architecture
• Nehalem two TLB levels are dynamically allocated between the two threads.
• A very big difference with Core is the caches coverage degree. Core had a 6 MB L2 cache and the TLB had 136 entries (4 ways ). Considering 4k pages the coverage was 136x4x4KB= 2176 KB, a third of L3 (8MB).
• Nehalem has a 576 entries DTLB (512 second level + 64 first level) which means 4x576=2304 which amounts to 2304x4KB=9216 KB memory which covers the entire L3 (8MB). (The meaning is that the address translation has a great possibility of targeting data in L3)
15
(Smart) CachesCore
256kBL2 Cache
32kB L1 Data Cache
32kB L1 Inst. Cache
It has a private power plane and operated at a private frequency (not the same of the cores) and higher voltage. This because the power must be spared and big size caches have very often errors if their voltage is too low.
…
L3 Cache
Core
L2 Cache
L1 Caches
Core
L2 Cache
L1 Caches
Core
L2 Cache
L1 Caches
• Three levels hierarchical
• First level: 32KB Instrucions (4 ways) and 32 KB Data (8 ways)
• Second level: unified 256KB (8 ways – 10 cycles for access
• Third level shared among the various cores. The size depends on the number of cores
For a quad-core 8 MB 16 ways. Latency 30-40 cycles
Designed for future expansion
Inclusive: all addresses in L1 and L2 are present in L3 (possibly with differente states)
Each L3 line has n (4 in case of quad core) “core valid” bits which indicate which (if any) cores have a copy of the line. (If a datum in L1 or L2 certainly is in L3 too but not viceversa).
Inclusive cache vs. non exclusive
Exclusive Inclusive
Core0
Core1
Core2
Core3
L3 Cache
Core0
Core1
Core2
Core3
L3 Cache
An example: the datum requested by Core 0 is not present in its L1 and L2 and therefore is requested to the common L3
16
Core0
Core1
Core2
Core3
L3 Cache
Core0
Core1
Core2
Core3
L3 Cache
• The requested datum cannot be retrieved from L3 too
MISS ! MISS !
Inclusive cache vs. exclusive
17
Exclusive Inclusive
Core0
Core1
Core2
Core3
L3 Cache
Core0
Core1
Core2
Core3
L3 CacheMISS! MISS!
A request is sent to the other cores The datum is not on the chip
Inclusive cache vs. exclusive (miss)
18
Exclusive Inclusive
Core0
Core1
Core2
Core3
L3 Cache
Core0
Core1
Core2
Core3
L3 CacheHIT ! HIT !
No further requests to other cores: if in L3 cannot be in other cores (exclusive)
The datum could be in other cores too but….
Inclusive cache vs. exclusive (hit)
19
Exclusive Inclusive
Inclusive
Core0
Core1
Core2
Core3
L3 CacheHIT! 0 0 0 0
Inclusive cache vs. exclusive
L3 cache has a directory (one bit per core) which indicates if and in which cores the datum is present. A snoop is necessary only if one bit only is set (possible datum modified), If two or more bits are set the line in L3 is «cleas» and can be forwarded form L3 to the requesting core. Directory based coherence
20
Core0
Core1
Core2
Core3
L3 Cache
Core0
Core1
Core2
Core3
L3 CacheMISS! HIT!
All cores must be tested Here only the core with datum (core 2) must be tested
0 0 1 0
Inclusive cache vs. exclusive
21
Exclusive Inclusive
22
Execution unit
Execute 6 u-ops/cycle
• 3 Memory u-ops• 1 Load• 1 Store Address• 1 Store Data
• 3 “Computational” u-ops
Unified Reservation Station
Po
rt 0
Po
rt 1
Po
rt 3
Po
rt 4
Po
rt 2
Load StoreAddress
StoreData
Integer ALU & Shift
Integer ALU &LEA
FP AddFP Multiply
ComplexInteger
Divide
SSE Integer ALUInteger Shuffles
SSE Integer Multiply
Po
rt 5
Integer ALU &Shift
Branch
FP Shuffle
SSE Integer ALUInteger Shuffles
Unified Reservation Station
• Schedules operations to Execution units• Single Scheduler for all Execution Units• Can be used by all integer, all FP, etc.
6 ports as in Core
6 Ports
23
Execution unit
Loop Stream Detector
• Each fetch retrieves 16 bytes from the cache which are inserted in a predecode buffer where 6 instructions at a time are sent to a 18 instructions queue. 4 CISC instructions at a time are sent to the 4 decoders (if possible) and the decoded u-ops are sent to a 28 slots Loop Stream Detector
• A new technology is implemented (Unaligned Cache Accesses) which grants the same execution speed to aligned and non-aligned instructions (i.e. crossing a cache line). Before, non aligned instructions were a big execution penalty which prevented very often the use of particular instructions: from Nehalem on this is not any more the case
24
Nehalem internal architecture
EU for non-memory instructions
• The RAT can rename up to 4 u-ops per cycle (number of physical registers different according to the implementation) The renamed instructions in ROB and when the operands are ready inserted in the Unified Reservation Station
• The ROB and the RS are shared between the two threads. The ROB is statically subdivided: identical speculative execution “depth” between the two threads.
• The RS stage is competitively” shared according to the situation. A thread could be waiting for a memory data and therefore using little or no entried of the RS: it would be senseless to block entries of a blocked thread
25
EU for memory instructions
Nehalem internal architecture
• Up to 48 Load and 32 Store in the MOB
• From the Load and Store buffers the u-ops access hierarchically the caches. As in PIV caches and TLB are dynamically subdivided among the two threads. Each core accepts “outstanding misses” (up to 16) for the best use of the increased memory bandwith
26
Nehalem full picture
• Instructions for CRC manipulation (important for the transmissions)
L3 shared amongthe chip cores. Uncore
• Other improvements: SSE 4.2. instructions: string manipulations (very important for XML handling)
27
Execution parallelism improvement
• Increased ROB size (33% - 128 slots)
• Improvement of the relates structures
Structure Intel® Core™ microarchitecture (Merom)
Intel® Core™
microarchitecture (Nehalem)
Comment
Reservation Stations
32 36 Dispatches operations to execution units
Load Buffers 32 48 Tracks all load operations allocated
Store Buffers 20 32 Tracks all store operations allocated
1
30
Nehalem vs Core
• It must be noted that the static sharing between threads provides the threads with a reduced number of ROB slots (64 – 128/2) instead of 96) but INTEL states that in case of a single thread all resources are given to it…
• Core internal architecture modified for QPI
• The execution engine is the same. Some blocks added for Integer and FP optimisation.In practice the increased number of RS allows a full use of the EUs which in Core were sometimes starved.
• For the same reason the multithread was reintroduced (and therefore the greater ROB – 196 slots vs 96 of Core - and the larger number of RS - 36 vs the 32 of Core),
• Load buffers are now 48 (32 in Core ) and store buffers 32 (20 in Core). The buffers are partitioned between the threads (fairness).
33
Power consumption control
PLL
Uncore, LLC
Core
Vcc
Freq.
Sensors
Core
Vcc
Freq.
Sensors
Core
Vcc
Freq.
Sensors
Core
Vcc
Freq.
Sensors
PLL
PLL
PLL
PLL
PCU
BCLKVcc
Power Control Unit
Current, temperature and power controlled in real time
Flexible: sophisticated hardware algorithm for power consumption optimisation
Power supply gate
PLL= Phase Lock LoopFrom a base frequency (quartzed) all requested
frequencies are generated
34
Core power consumption
Total power consumption
Local clocks and logic• Red: clock and gates
Clock distribution• Blue: high frequency design requires an
efficient global clock distribution
Loss currents• Green: high frequency systems are affected by
unavoidable losses
35
Power minimisation
• Idle CPU states are called Cn
C0
Cn
C1
Exit Latency (ms)
Idle
Pow
er (
W)
• Higher is n lower is the power consumption BUT longer the time for exiting the «idle» state
• The OS informs the CPU when no processes must be executed – privileged instruction MWAIT (Cn)
36
Totale power consumption
C states (before Nehalem)
• C3 states: all clocks blocked
• C4,C5 and C6 states: operating voltage progressive reduction
• C higher values were the most expensive for the exiting time (voltage increase, state restore, pipeline restart etc.)
Clock distribution
Loss currents
Local clocks and logic
• C0 state : active CPU
• C1 and C2 states : pipeline clock and other clocks majority blocked
• Cores had only one power plane and all devices had to be idle before reducing the voltage
37
C states behaviour(Core)
Time
Core 0
Core 1
Core Power
0
0
Task completed. No task ready. Instruction MWAIT(C6) .
38
C states behaviour(Core)
Time
Core 0
Core 1
Core Power
0
0
Core1 execution stopped, its state saved and its clocks stopped . Core 0 keeps executing.
39
C states behaviour(Core)
Time
Core 0
Core 1
Core Power
0
0Task completed. No task ready. Instruction MWAIT(C6)
40
C states behaviour(Core)
Time
Core 0
Core 1
Core Power
0
0Reduced voltage and power
Only now it is possible
41
C states behaviour(Core)
Time
Core 0
Core 1
Core Power
0
0
Core 1 interrupt, voltage increased Core 1 clocks reactivated, state restored, the instruction following MWAIT(C6) is executed. C0 keeps idle.
Increased voltage for both
processors
42
C states behaviour(Core)
Core 0
Core 1
Core Power
0
0Core 0 interrupt. Core 0 state C0 and the instruction following MWAIT(C6) executed. Core 1 keeps executing
43
C6 Nehalem
Core 0
Core 1
Core 2
Core 3
Core Power
Time
0
0
0
0
Cores 0, 1, 2, and 3 active
Separeted core power supply!!!!
44
C6 Nehalem
Core 0
Core 1
Core 2
Core 3
Core Power
Time
0
0
0
0
Core 2 task completed. No task ready. MWAIT(C6).
45
C6 Nehalem
Core 0
Core 1
Core 2
Core 3
Core Power
Time
0
0
0
0
Core 2 stopped, its clocks stopped. Cores 0, 1, and 3 keep executing
46
C6 Nehalem
Core 0
Core 1
Core 2
Core 3
Core Power
Time
0
0
0
0
Core 2 power gate interdicted: voltage is 0 and state C6. Cores 0, 1, and 3 keep executing
47
C6 Nehalem
Core 0
Core 1
Core 2
Core 3
Core Power
Time
0
0
0
0
Core 0 task completed. No task ready. MWAIT(C6).Core 0 C6. Cores 1 and 3 keep executing
48
C6 Nehalem
Core 0
Core 1
Core 2
Core 3
Core Power
Time
0
0
0
0
Core 2 Interrupt - Core 2 C0 execution from MWAIT(C6). Cores 1 and 3 keep executing
49
C6 Nehalem
Core 0
Core 1
Core 2
Core 3
Core Power
Time
0
0
0
0
Core 0 interrupt. Power gate saturated, clocks reactivated, state restored execution from MWAIT(C6). Core 1,2, and 3 keep executing
50
Uncore losses
Uncore clock distribution
I/O
Uncore logic
Losses
Clock distribution
Clock and logic
Co
res
(x N
)
• All Cores state C6:
o Core power to ~0
Core power consumption
C6 Nehalem
o Uncore clock block
o I/O low power
o Uncore clock distribution blocked
• Entire package package C6
51
Further power reduction
• When a Core enters low power C states its operating voltage is reduced while that of the other Cores is unmodified
• Memoryo Memory clocks are blocked between the requests when the usage is lowo Package memory refresh occurs in C3 (clock block) and C6 (power down) states too
• Linkso Low power when the processor increases its Cx
• The Power Control Unit monitors the interrupts frequency and changes the C states accordinglyo C states linked to the operating system depend on the processor utilizationo In presence of some low workloads the use rate can be low but the latency can be of
tantamount importance (i.e. real time systems)o The CPU can implement complex behaviour optimisations algorithms
• The system changes the operating clock frequency according to the requirements in order to minimize the power consumption (processor P states)
• The Power Control Unit modifies the operating voltage for each clock frequency, operating condition and silicon characteristics
52
Turbo pre-Nehalem (Core)Fr
equency
(F)
No Turbo
Frequency
(F)
Core
0
Core
1
Core
0
Core
1
Clock Stopped
Power reduction in inactive cores
Workload Lightly Threaded
53
Turbo pre-Nehalem (Core)C
ore
0
Core
1
Frequency
(F)
No Turbo
Frequency
(F)
Turbo Mode
In response to workload adds additional
performance bins within headroom
Core
0
Clock Stopped
Power reduction in inactive cores
Workload Lightly Threaded
54
Frequency
(F)
No Turbo
Workload Lightly Threadedor < TDP
Frequency
(F)
C
ore
2
C
ore
3
C
ore
0
C
ore
1
C
ore
2
C
ore
3
C
ore
0
C
ore
1
Power Gating
Zero power for inactive cores
TDP: Thermal Design Power. An indication of the heat (energy) produced by a processor, which is also the max. power that the cooling system must dissipate. Measured in Watts
Turbo Nehalem
• It uses the available clock frequency to maximize the performance both for multi- and single-thread
55
Frequency
(F)
No Turbo
Frequency
(F)
Turbo Mode
In response to workload adds additional performance bins (frequency increase) within
headroom
Core
0
C
ore
1
Power Gating
Zero power for inactive cores
Workload Lightly Threadedor < TDP
C
ore
0
C
ore
1
C
ore
2
C
ore
3
Turbo Nehalem
56
Frequency
(F)
No Turbo
Frequency
(F)
Core
0
C
ore
1
Turbo Mode
In response to workload adds additional
performance bins within headroom
Power Gating
Zero power for inactive cores
Workload Lightly Threadedor < TDP
C
ore
0
C
ore
1
C
ore
2
C
ore
3
Turbo Nehalem
57
C
ore
2
C
ore
3
C
ore
0
C
ore
1
C
ore
2
C
ore
3
C
ore
0
C
ore
1
Frequency
(F)
No Turbo
Workload Lightly Threadedor < TDP
Frequency
(F)
Active cores running workloads < Thermal Design Power
C
ore
0
C
ore
1
C
ore
2
C
ore
3
Turbo Nehalem
58
Frequency
(F)
No Turbo
Frequency
(F)
Turbo Mode
In response to workload adds additional
performance bins within headroom
Workload Lightly Threadedor < TDP
C
ore
0
C
ore
1
C
ore
2
C
ore
3Active cores running
workloads < TDP
C
ore
2
C
ore
3
C
ore
1
C
ore
0
TDP = Thermal Design Power
Turbo Nehalem
59
Core
0
Core
1
Core
2
Core
3Frequency
(F)
No Turbo
Frequency
(F)
Turbo Mode
In response to workload adds additional
performance bins within headroom
Power Gating
Zero power for inactive cores
Workload Lightly Threadedor < TDP
C
ore
2
C
ore
3
C
ore
1
C
ore
0
Turbo Nehalem
60
Turbo enabling
• Turbo Mode is transparent
– Frequency transitions are handled in hw
– The operating system asks for P-state changes (frequency and voltage) in a transparent way activating the Turbo mode only when needed for better performance
– The Power Control Unit keeps the silicon within the required boundaries
61• 1GB page support
Westmere
• Westmere is the name of the 32 nm Nehalem and is the basis of Core i3, Core i5, and multiple cores (i7plus).
Characteristics
• 2-12 native cores (multithreaded => up to 24 processors)
• 12 MB L3 cache
• Some versions have an integrated graphic controller
• A new instructions set (I7) for the Advanced Encryption Standard (AES) and a new instruction PCLMLQDQ which executes multiplications without carry as required by the cryptography (i.e. disks encryption)
62
63
Roadmap
64
Hexa-Core Gulftown – 12 Threads
65
Westmere: 32nm Nehalem
66
EVGA W555 - Dual processor Westmere – (2x6x2) = 24 threads !!
Video controller
67
Processors
6x2 DIMM DDR3
7 PCI
NON standardE-ATX size1 or 2 processors and overclockin
8 SATA ports2x6 GB/sec and
6x3 GB/sec
Intel 5520 chipset and two nForce 200controllers under
the power dissipator
68
69
Coolingtowers
70
Roadmap
22 nm technology – 3D trigate transistors – Pipeline 14 stages – Dual channel DDR3 - 64KB L1 (32K instr + 32K data) and 256 KB L2 per core, Reduced cache latency – Can use DDR4 – Three possible GPU: the most powerful (GT3) has 20 Eus.. Integrated voltage regulator (from the motherboard to the chip). Better power consumption. Up to 100W TDP. 10% improved performance 5 CISC decoded instructions macrofused produce 4 u-ops per clock – Up to 8 u-ops dispatched per clock
71
Haswell front end
72
Haswell front end
73
Haswell execution unit
Increasing the OoO window allows the execution units to extract more parallelism and thus improve single threaded performancePrioritized extracting instruction level parallelism
8 ports !!
74
Haswell execution unit
75
Haswell