haswell:the fourth-generation i core...

.................................................................................................................................................................................................................

HASWELL: THE FOURTH-GENERATIONINTEL CORE PROCESSOR

.................................................................................................................................................................................................................

HASWELL, THE FOURTH-GENERATION INTEL CORE PROCESSOR ARCHITECTURE, DELIVERS A

RANGE OF CLIENT PARTS, A CONVERGED CORE FOR THE CLIENT AND SERVER, AND

TECHNOLOGIES USED ACROSS MANY PRODUCTS. IT USES AN OPTIMIZED VERSION OF INTEL

22-NM PROCESS TECHNOLOGY. HASWELL PROVIDES ENHANCEMENTS IN POWER-

PERFORMANCE EFFICIENCY, POWER MANAGEMENT, FORM FACTOR AND COST, CORE AND

UNCORE MICROARCHITECTURE, AND THE CORE’S INSTRUCTION SET.

......Haswell, the fourth-generationIntel Core Processor, delivers a family of pro-cessors with new innovations.1,2 Haswelldelivers a range of client parts, a convergedmicroprocessor core for the client and server,and technologies used across many products.

Many of Haswell’s innovations are in theareas of improving power-performance effi-ciency and power management. Power-performance efficiency has been enhanced toincrease the processor’s operating range andimprove its inherent performance in power-limited scenarios and its battery life.Improvements in power management in-clude additional idle states, specifically thenew active idle state S0ix, which enables 20�reduction in idle power. One key enabler forpower-performance improvements is thefully integrated voltage regulator (FIVR),which also improves board space and cost.

Performance improvements in the core andgraphics come with corresponding improve-ments in cache hierarchies; the first two cachelevels have twice the bandwidth. For thetop graphics configurations, Intel Iris ProGraphics, Haswell also introduces a newfourth-level, 128-Mbyte on-package cachethat enables a new level of integrated graphicsperformance.

Haswell is a “tock”—a significant micro-architecture change over the previous-generation Ivy Bridge. Haswell is built withan SoC design approach that allows fast andeasy creation of derivatives and variations onthe baseline. Graphics and media come withmore scalability that lets designers build effi-cient configurations from the lowest to highestend. The core comes with power-performanceenhancements and a set of new instructions,such as floating-point fused multiply-add(FMA) and transactional synchronizationextensions (TSX).

Haswell uses an enhanced version of Intel’s22-nm process technology, which has en-hanced tri-gate transistors to reduce leakagecurrent by a factor of 2� to 3� with the samefrequency capability. Haswell’s version of the22-nm process has 11 metal interconnectlayers, compared to nine for Ivy Bridge, to opti-mize for better performance, area, and cost.

Power efficiency and managementCurrent processors operate in power-

constrained modes; they must maximize theperformance they deliver inside a fixed powerenvelope. This power constraint is true forboth server and mobile applications. One of

Per Hammarlund

Alberto J. Martinez

Atiq A. Bajwa

David L. Hill

Erik Hallnor

Hong Jiang

Martin Dixon

Michael Derr

Mikal Hunsaker

Rajesh Kumar

Randy B. Osborne

Ravi Rajwar

Ronak Singhal

Reynold D’Sa

Robert Chappell

Shiv Kaushik

Srinivas Chennupaty

Stephan Jourdan

Steve Gunther

Tom Piazza

Ted Burton

Intel.......................................................

6 Published by the IEEE Computer Society 0272-1732/14/$31.00�c 2014 IEEE

the most important goals of a new processorgeneration is to dramatically improve power-performance efficiency. In Figure 1, the basicnonlinear relationship between power andperformance is shown in the solid line. To im-prove power-performance efficiency across thevoltage-frequency scaling range, we mustachieve three goals, as shown in the dashed line:

� extending the operating range down-ward to allow the processor to go intosmaller form factors that are evenmore power constrained,

� improving the basic power-perform-ance efficiency of the processor bypushing each operating point to theright and down, and

� extending the operating range upwardfor more burst and Turbo headroom.

In Haswell, we employ multiple techniquesto improve power-performance efficiency. Wecan describe them in three categories: low-levelimplementation, high-level architecture, andplatform power management.

Examples of low-level implementationimprovements include the following:

� Optimized manufacturing, process tech-nology, and circuits help achieve allthree goals just listed. These improve-ments are enabled by Intel’s manu-facturing capability and a deepcollaboration across the different Intelteams.

� Optimized microarchitecture andalgorithms. In each generation, weevaluate for sufficient power-performance efficiency. Areas that fallbelow our goals will be reimple-mented in ways that improve thepower-performance efficiency.

� Optimization of design and imple-mentation through continued focuson gating unused logic and usinglow-power modes.

An example of a high-level architectureimprovement in Haswell is extending the useof independent voltage-frequency domains.Figure 2 shows a conceptual block diagram ofthe different voltage-frequency domains.Cores, caches, graphics, and the system agentare all running at dedicated, individually con-trolled voltage-frequency points. A power con-trol unit (PCU) dynamically allocates thepower budget among the domains to maxi-mize performance. Prioritization based on

Pow

er

Performance

Figure 1. Power and performance voltage-

frequency scaling improvements. The

baseline (solid line) is improved (dashed

line) by being lowered and by being

extended for better burst and Turbo

headroom.

DMI

Systemagent

Core

Core

Core

Core LLC

LLC

LLC

LLC

Processor graphics

IMCDisplay

PCI Express*

Figure 2. Conceptual block diagram of the

Haswell processor showing the different

independent voltage domains. The figure

also shows Haswell’s cache hierarchy and

memory controller, which features

bandwidth, load balancing, and DRAM

efficiency improvements.

.............................................................

MARCH/APRIL 2014 7

runtime characteristics select the domain withthe highest-performance return. For example,for a graphics-focused workload, most of theprocessor power is allocated to the graphicsdomain. Sufficient power is allocated to therest of the blocks that the graphics domaindepends on for performance, such as the sys-tem agent to provide memory bandwidth.

At a platform level, we improved batterylife to deliver “all-day experiences.” To achievethis, we focused both on active workloads,such as media playback, and on idle power.Haswell achieves a 20� improvement inidle power. Haswell has evolutionary power-management improvements, such as improve-ments in C-states (CPU idle states). Haswellhas both new, deeper C-states and improve-ments in the entry-exit latencies to C-states.These latency improvements let Haswell moreaggressively enter deep C-states.

Haswell also has revolutionary power-management improvements—for example,the introduction of a new active idle-powerstate, S0ix. We leverage learnings from pastphone and tablet development to deliver 20�improvements in idle power compared to theprior generation. This improvement enablessignificant improvements in realizable batterylife. S0ix appears to software as an active state,while in actuality the hardware autonomouslyenters and exits deep idle states with lowlatency. The new power state is transparent towell-written software. Power management ofplatform components is continuous and finegrained; everything that is not needed is indi-vidually turned off.

Fully integrated voltage regulatorPower delivery to higher-performance

processors comes with many conflictingrequirements, such as the need for higherpower for extended burst capability, a greaternumber of individually controlled voltagerails, and the need for a physically smallerfootprint for new form factors. In responseto these requirements, Haswell processorsare powered by a 140-MHz, multiphaseFIVR.2-4 The industry’s first large-scaledeployment of high current switching regula-tors integrated into a VLSI die and package.FIVR is the enabling technology behind keyHaswell improvements, including a 2� to3� increase in peak available power (which

converts into burst performance), a substan-tial battery life increase, and a 70 to 80 per-cent platform footprint reduction.

Figure 3 gives an overview of FIVR. Afirst-stage voltage regulator (VR), which is onthe motherboard, converts from the powersupply or battery voltage (12 to 20 V) toapproximately 1.8 V, and the second conver-sion stage is provided by parallel FIVRs (onefor each major architectural domain). Asillustrated, FIVR eliminates four VRs fromthe prior platform. To support the new IntelIris Pro Graphics variants of Haswell, thoseplatform VRs would have grown in both sizeand number. With FIVR, a platform-sizereduction opportunity was achieved insteadof what would have been a substantialgrowth. That platform space can be used toadd platform features, increase the batterysize, and reduce the platform dimensions inmany Haswell mobile products.

At the onset of the Haswell design, FIVR’sexpected benefits fell into half a dozencategories:

� Battery life increase. FIVR’s 140-MHzswitching frequency enables severalorders of magnitude less output decou-pling and much lower input decouplingthan the prior generation’s voltage rails,allowing input and output voltages tobe quickly reduced or powered off tosave power, and quickly ramped backup for brief high-performance bursts.

� Increased available power for in-creased burst performance, whereFIVR can direct the entire packagepower to the unit that needs the mostpower, compared to separate VRs forseparate units in the previous platform.

� Decreased power required for a givenlevel of performance or, almostequivalently, increased performancefor a given power consumed.

� Decreased platform cost and sizefrom removal of components andexternal power rails.

� Improved product flexibility andscalability; for example, new unitscan be added with little impact onthe platform power-delivery systems.

FIVR delivered benefits in every category,some larger than expected.

..............................................................................................................................................................................................

HOT CHIPS

............................................................

8 IEEE MICRO

Haswell Peripheral Controller HubSignificant power reduction was achieved

by the power-management improvements inthe Peripheral Controller Hub (PCH).The Haswell PCH is responsible for

providing I/O management, and its activity

depends on direct-memory-access traffic to

and from peripheral devices in the PCI,

USB, Serial ATA, Wi-Fi and audio-voice-

speech subsystems.

VRVR

VccInVccIn

DDRx

Vccin

3.5 to 1

surface area

Core VRvariablevoltage

PLL VR1.8 V

0 V-1.2 V

Ivy Bridgeprocessor

Ivy Bridge platform

(a)

(b)

GraphicsVRvariablevoltage0 V-1.2 V

System agentVR

DDRx

DDRx

DDRVR

Haswell platform

Haswell processor

FIVR VRs:VccsaVccio

0 V-1.8 V

Logic

VccioaVccCore 1VccCore 2VccCore 3

blocksVccCore 4VccCacheGraphics0Graphics1

DDRx

DDR

VccEDRAMVccOPIO

Example voltage planes

input/outputVR 1.0 V

Figure 3. Example of possible platform improvements. Haswell’s fully integrated voltage

regulator (FIVR), shown at the bottom, enables substantial board space and cost savings

compared to the Ivy Bridge platform, shown at the top (a). Multiple voltage regulators (VRs) in

the previous platform are combined into one VR (b).

.............................................................

MARCH/APRIL 2014 9

A significant innovation is the introduc-tion of exit latency timers that are pro-grammed with the latency demands of devicesattached to the I/O links. The power manage-ment controller (PMC) in the PCH uses theselatencies to calculate and gracefully adapt theI/O subsystem from “high performance withlow latency” to “low power with higherlatency” by stopping clocks, shutting downphase-locked loops and, finally, locally power-gating the I/O control modules of each systemindependently. These changes allowed for a40� power reduction in the I/O subsystemfrom the previous-generation PCH.

Haswell graphics and mediaHaswell graphics and media are built to

scale across a wide range of processor config-urations, from low to very high integratedgraphics and media performance. To achievethis scale, Haswell graphics were built fromthe start with scale in mind (see Figure 4).

Haswell graphics and media are roughlysplit into six domains:

� Global assets, including geometryfront-end up to setup.

� Slice common—shared functions,including a rasterizer, level-3 cache(internal to graphics), and pixel back-end.

� Subslice, including shaders (execu-tion units [EUs]), instruction caches,and texture samplers. These subslicesare scalable in number to achieve thedesired performance.

� Multiformat video codec engine.� Video quality enhancement engine.� Display pipelines.

EUs are general-purpose programmablecores that support a rich instruction set thathas been optimized to support various 3DAPI shader languages as well as media func-tions (primarily video) processing.

Shared functions are hardware units thatprovide specialized supplemental functional-ity for the EUs. A shared function is imple-mented where the demand for a givenspecialized function is insufficient to justifythe costs on a per-EU basis. Instead, a singleinstantiation of that specialized function isimplemented as a stand-alone entity outsidethe EUs and shared among the EUs.

The graphics and media configurationscan vary by the number of subslices, videodecoders, and samplers to vary power andperformance profiles.

The generic and traditional renderingpipeline—fetch!shade (vertex, hull, do-main, geometry), rasterize, pixel shade—maps onto the global assets. A global (acrossgraphics) cache provides storage and coher-ence between shared elements and allowsspecific cache configurations for GPGPU(general-purpose computing on GPUs)computing.

The media functions are woven into thegraphics architecture and provide a scalableand programmable option for video encod-ing, decoding, and postprocessing. The func-tions include the following:

� fixed function, such as multiformatvideo decoders sized for ultra high-power efficiency;

� scalable assets, such as motion esti-mation hardware; and

� programmable postprocessing filters.

Haswell graphics are designed to belatency tolerant, because they share the mem-ory subsystem with latency-sensitive CPUcores. Graphics and media share last-levelcache in a programmable fashion with theCPU, allowing for a software-tunablecache-sharing policy for graphics andCPU hardware for optimal performance.This is further enhanced in the presence ofthe larger embedded DRAM (eDRAM)cache, which lets graphics access a highbandwidth.

Cache hierarchy and the eDRAM cacheHaswell delivers substantial performance

improvements in cores, media, and graphics,and needs a corresponding improvement inmemory bandwidth. In addition to the tradi-tional double-data-rate (DDR) memory-speed enhancements, Haswell has furtherimprovements in cache hierarchy pipelines,single-thread streaming write bandwidth, loadbalancing, and memory-scheduling efficiency.

We optimized all of the pipelines of thecache and memory hierarchy for efficiencyimprovements. Microarchitecture work im-proved pipeline efficiency. For example,

..............................................................................................................................................................................................

HOT CHIPS

............................................................

10 IEEE MICRO

optimizing the concurrency of requests—that is, handling different kinds of requests inseparate pipelines—resulted in efficiencyimprovement of up to 40 percent. Further-more, the cache hierarchy exploits the weaklyordered nature of write-combining stores toincrease the maximum number of concurrentrequests from a single Intel Architecture (IA)core from 10 to 40þ. This change greatlyincreased the streaming write bandwidth tomemory for single-thread workloads.

Load balancing between request agentswas improved. The credit-based bandwidthmanagement system was optimized to effi-ciently share resources. The management

system achieves fairness between reliablelow-latency accesses to resources with thepossibility of high-bandwidth access for,for example, the graphics engine.

Haswell also has improvements in thememory schedulers for better write through-put. The Haswell memory controllers havedeeper pending queues, more decoupling,and better scheduling algorithms.

These cache hierarchy improvements aregood for most of the Haswell configurations.For the Haswell top-end graphics configura-tion, Intel Iris Pro Graphics, we needed evenmore bandwidth and developed a dedicatedsolution.

Commandstreamer (CS)

Vertexfetch (VF)

Vertexshader (VS)

Video frontend (VFE) Video

qualityengine

Multi-format

CODECBlitter Display

3D samplerEU

EU

EU

EU

Mediasampler

Data port

Tex$

3D samplerEU

EU

EU

EU

Mediasampler

Data port

Tex$

L1 IC$

L1 IC$

Rasterizer/depth

L3$ Pixelops

Render$depth$

Hull shader(HS)

Tessellator

Rin

g b

us/L

LC/m

emor

y

Domainshader

(DS)

Geometryshader(GS)

Stream-out (SOL)

Clip/setup

Thre

ad d

isp

atch

Figure 4. Haswell graphics block diagram. Haswell graphics and media are built with a modular approach, which enables scale

across a range of system configurations.

.............................................................

MARCH/APRIL 2014 11

eDRAM memory bandwidth solutionMeeting Intel Iris Pro Graphics perform-

ance targets required much more bandwidththan the two channels of DDRx for clients.Additional external memory channels wouldhave been very costly and had adverse effectson the physical size in the platform. Thebandwidth solution we implemented is a128-Mbyte L4 (fourth-level) cache providing102 Gbytes/second peak bandwidth(51 Gbytes/second read and simultaneouslyup to 51 Gbytes/second write). The L4 cachedata store is a discrete die made using InteleDRAM process technology, providing bothhigh-density memory and high-speed logicfor high-bandwidth I/Os.5-7 The eDRAM

die is on-package to exploit close proximityfor low latency and low-power interconnect.

The L4 cache architecture with eDRAMgives the following benefits compared toother candidates:

� superior bandwidth per watt overDDRx and GDDRx,

� a unified memory bandwidth solu-tion for both IA and graphics,

� minimized motherboard real estatefor small form factors, and

� an in-package memory solution toenable package-based performanceupgrade opportunities.

The Intel Iris Pro Graphics CPU andeDRAM are in a multichip package (MCP)connected using a full-duplex on-packageI/O (OPIO), as shown in Figure 5. The CPUhosts the L4 cache controller, and the tagsand the enhancements needed in the powercontrol unit. A low-power, high-bandwidthlink connects the CPU to the L4 andeDRAM cache die.

The eDRAM was architected to be inte-grated with the existing cache hierarchy withminimal impact. The eDRAM acts primarilyas a victim cache for the L3, being filled byevictions. Unlike the on-die L3 cache, theeDRAM is not inclusive of the core caches,allowing graphics data to be read and writtendirectly without the need to fill into the on-die L3, saving the L3 storage for morelatency-critical IA core accesses.

The eDRAM tags are stored in traditionalstatic RAM (SRAM) on the processor die. Tosave power and area, each tag entry, called asuperline, represents a 1-Kbyte region madeup of sixteen 64-byte cache lines. All incom-ing requests look up the on-die L3 andeDRAM in parallel.

Intel Iris Pro Graphics adds several cachecontrols that help graphics use the L3 andeDRAM caches. Either cache can be parti-tioned between IA or graphics traffic to pro-vide quality of service. Graphics can preventsurfaces from allocating in the caches if theywould not benefit from caching. Graphics isalso able to use the eDRAM to cache display-able surfaces for the first time.

OPIO exploits the short trace lengthswithin the MCP to simplify I/O and clock-ing circuits to significantly reduce power,

CPU/eDRAM

Multichip package

(a)

(b)

OPIOinterface

Rx

Tx

CPU

Controlle

CLK

REQ

Rx

Tx

Rx

Tx

data

data

request

r

Sideband

Rx

Tx

Rx

Tx

eDRAMCLK

REQ

Rx

Tx

Rx

Tx

Sideband

Rx

Tx

Figure 5. Intel Iris Pro Graphics multichip package ball grid array: photo of

the package view (a) and block diagram (b).

..............................................................................................................................................................................................

HOT CHIPS

............................................................

12 IEEE MICRO

area, and latency while providing high band-width. OPIO uses only 1 W of total power todeliver 102 Gbytes/second. This is 3� band-width at 1/10 total power compared to the32 Gbytes/second DDR3.

The Intel Iris Pro Graphics eDRAM die isIntel’s first product instantiation of eDRAMtechnology. The ability to combine denseeDRAM memory technology and high-speedlogic gave the capability to have a single diewith both large capacity storage and highbandwidth.

The eDRAM die is implemented in Intel22-nm technology. The eDRAM array is atotal of 128 Mbytes and architected for high-bandwidth efficiency (also called low-loadedlatency). There are 128 banks, each with a rowcycle time of 4 ns, minimizing both the proba-bility of a bank conflict and the penalty whenit occurs. The array operates at 1.6 GHz, proc-essing a command per clock. The entire data-path to the array and within the array is fullduplex, enabling write data transfers simulta-neous with read data transfers. Each datatransfer takes two clocks.

The Intel Iris Pro Graphics Power ControlUnit (PCU) dynamically manages the eDRAMdevice state depending on runtime evaluationof workload characteristics, performance goals,and energy efficiency considerations. The PCU

maintains the eDRAM subsystem in one ofthree states: on, off, or self-refresh. In the“on” state, both eDRAM and OPIO are attheir active voltage, and the eDRAM control-ler is sending refreshes to the eDRAM deviceacross the OPIO link. In the “off” state, botheDRAM and OPIO clocks are stopped, andvoltage is reduced to 0 V. In “self-refresh,”EDRAM is kept at its active voltage, butclocks in the OPIO domain are stopped andvoltage is dropped to 0 V. While the pro-cessor is active, the PCU can choose betweenon and off depending on the workload andpower and performance goals. While the pro-cessor is idle, the PCU can choose to turneDRAM off or put it into the self-refreshstate. The PCU periodically evaluates theeDRAM’s potential performance benefitsand decides whether to power on theeDRAM subsystem.

eDRAM performanceThe Haswell graphics performance im-

provements are the main bandwidth driverthat motivates eDRAM. Figure 6 shows theperformance gain for 128-Mbyte eDRAMover a baseline of two-channel DDR3-1600only, for a broad set of graphics benchmarksand game titles (including a beta version of

1.00

DirectX

*11 C

iviliz

ation

5*

LateV

iew19

20x1

200

DirectX

*10 H

awx*

1920

x120

0

DirectX

*10 F

ar C

ry 2*

1920

x120

0

Left4

Dead2*

2048

x153

6

Left4

Dead2*

1920

x120

0

DOTA2*

1920

x120

0

DOTA2*

2048

x153

6

DirectX

*10 R

esiden

tEvil

5* 19

20x1

200

3DMar

k06*

scor

e

DirectX

*11 A

lien v

s Pre

dator*

2560

x160

0

DirectX

*11 C

iviliz

ation

5* Le

ader

s 192

0x12

00

Left4

Dead2*

1920

x120

0 Anti‐A

liasin

g

DirectX

*10 A

ssas

sin’s

Creed

* 192

0x12

00

DirectX

*11 S

tone G

iant 1

920x

1200

DirectX

*11 S

tone G

iant 2

560x

1600

PCMark V

antag

e* sc

ore

1.10

1.20

1.30

1.40

1.50

1.60

1.70

1.80Speedup vs. no eDRAM

Figure 6. Performance for graphics workloads using the Intel Iris Pro Graphics eDRAM cache.

(See the disclaimer in the Acknowledgments section.)

.............................................................

MARCH/APRIL 2014 13

DOTA2). On average, eDRAM can increaseperformance by 30 to 40 percent.

The 128M eDRAM cache is large enoughto exploit both inter- and intraframe datareuse. It’s well known that graphics data, suchas textures, are reused multiple times insideof a single frame. Extensive presilicon simula-tion, confirmed by silicon measurements,showed that data can also be reused betweenframes.

In addition to providing high-bandwidthpower and energy efficiently to boost graphicsperformance, eDRAM is also carefully de-signed to minimize latency, as the outstand-ingly small latency sensitivity with loadsustainable for random traffic shown inFigure 7, which is especially important foreDRAM to benefit not only graphics but alsogeneral CPU workloads.

Haswell coreHaswell’s core is the next major evolution

in general-purpose out-of-order microarchi-tecture, delivering both higher performanceand improved power efficiency across a broadrange of workloads and form factors.

Like its predecessor, Ivy Bridge, theHaswell front-end pipeline supplies micro-

operations (lops) for execution from two pri-mary sources. The first source, a traditionalinstruction cache/decoder pipeline, supplieslops by decoding up to 16 bytes per cycle ofcomplex instructions into up to four com-pound lops. The second source, a lop cache,stores decoded lops natively and suppliesthem at a rate equivalent to 32 bytes per cycle.

Up to four compound lops per cycle allo-cate resources for out-of-order execution andare split into simple lops. Up to eight simplelops per cycle can be executed by heteroge-neous execution ports. Once complete, up tofour compound lops can be retired per cycle.Each Haswell core shares its executionresources between two threads of executionvia Intel Hyperthreading.

Haswell’s core contains several innova-tions for performance and power, includingthe following:

� Intelligent speculation. Haswell decou-ples branch prediction, instructiontag accesses, and instruction transla-tion look-aside buffer (TLB) accessesfrom the supply of lops to execution.State-of-the-art advances in branchprediction algorithms enable accuratefetch requests to “run ahead” of lop

Iris pro (eDRAM) Iris (no eDRAM)

CRW loaded latency

105110

859095

100

70758085

55606570

Late

ncy

(ns)

40455055

3035

0 5 10 15 20 25 30 35 40 45 50

100% Rd BW (GBps)

Figure 7. Loaded latency of the Intel Iris Pro Graphics eDRAM cache. (See the disclaimer in

the Acknowledgments section.)

..............................................................................................................................................................................................

HOT CHIPS

............................................................

14 IEEE MICRO

supply to hide instruction TLB andcache misses. A new data TLB pre-fetcher keeps page walk latency fromdelaying many memory accesses.

� A large out-of-order window. Haswellmaintains 192 lops in flight in itsreorder buffer and the supportingstructures, such as load buffers (72),store buffers (42), reservation stations(60), and physical register files (168).This is approximately a 15 percentincrease over Ivy Bridge to extractmore parallelism.

� Raw execution horsepower. Haswellprovides eight heterogeneous execu-tion ports, an increase over the six inIvy Bridge. The two new ports andadditions to existing ports combineto provide a fourth integer ALU, asecond branch unit, a second float-ing-point multiplier, and a third storeaddress-generation unit. The addi-tional resources provide higher peakthroughput and fewer false resourceconflicts. See Figure 8 for details.

� Aggressive clock and data gating, andnew dynamic power-saving modes. TheHaswell core achieves power efficiencyby aggressively gating idle logic. Newpower-saving modes adapt to work-load phases to save power without sac-rificing performance.

With these improvements, the Haswell coredelivers improvements across a range ofworkloads.

Instruction set enhancementsHaswell delivers performance improve-

ments on legacy and unchanged codes, andalso adds new instructions for even more per-formance on suitable workloads.

Advanced Vector Extensions 2With Advanced Vector Extensions 2

(AVX2), Haswell adds instructions for FMA,256-bit integer vector computation, full-width element permute, and vector gather tobenefit high-performance computing, audioand video processing, and games.

Unified reservation station

Port 0

Port 1

Port 2

Port 3

Port 4

Port 5

Port 6

Port 7

IntergerALU and shift

IntergerALU and LEA

Load andstore address

Storedata

2xFMA• Doubles peak flops• Two FP multiples benefit legacy

• Great for integer workloads

• Reduces Port 0 conflicts

• 2nd EU for high branch code• Leaves Ports 2 and 3 open for loads

New AGU for stores

Branch

Vectorlogicals

Vector intALU

Vectorshuffle

IntegerALU and LEA

IntegerALU and shift

Storeaddress

• Frees Ports 0 and 1 for vector

New branch unit

4th ALU

FMAFP multiply

FMA FP multFP add

Vector intmultiply

Vector intALU

Vectorlogicals

Vectorlogicals

Branch

Divide

Vectorshifts

Figure 8. Haswell’s execution ports and execution units. The figure shows which functional units are mapped onto which

port, and which new units and ports have been added. (LEA: load effective address.)

.............................................................

MARCH/APRIL 2014 15

Each Haswell core provides up to 32single-precision or 16 double-precision float-ing-point operations per cycle using AVX2’sFMA instructions and Haswell’s two FMAhardware units. FMA operations can be donewith the same latency as a floating-pointmultiply (five cycles) and are fully pipelined.This achieves a 1.6� latency reduction versusthe previous generation. Haswell’s FMA canbe used for both computational throughputenhancement and for latency reductions.

The Haswell core complements the addi-tion of AVX2 with doubled L1 and L2 cachebandwidth. Each L1 data cache port, twoloads and one store, natively supports full-width AVX2 operations at 32 bytes. The L2cache can return a full cache line of data, 64bytes, to the L1 data cache. Despite thesebandwidth increases, the L1 and L2 cache sizesand latencies remain the same as Ivy Bridge.

New instructions for hashing and cryptographyHaswell provides several packages of new

instructions to benefit high-value workloadsthat rely on bit-field manipulation and arbi-trary precision arithmetic. These new instruc-tions benefit many algorithms ranging fromfast indexing, to cyclical redundancy checking(CRC), to widespread encryption algorithms.Figure 9 summarizes the performance.

Intel Transactional SynchronizationExtensions

With Intel Transactional SynchronizationExtensions (Intel TSX), Haswell adds hard-ware support to improve the performance oflock-based synchronization commonly usedin multithreaded applications. These applica-tions take advantage of the increasingnumber of cores to improve performance.However, when writing such applications,programmers must use concurrency-controlprotocols to ensure threads properly coordi-nate access to shared data. Otherwise, threadsmight observe unexpected data values, result-ing in failures. These protocols ensure thatthreads serialize access to shared data, oftenby using a critical section. A software con-struct, referred to as a lock, protects access tothe critical section.

Because serialized access to a critical sec-tion limits parallelism, programmers try toreduce the impact of critical sections, eitherby minimizing synchronization or by usingfine-granularity locks, where multiple locksprotect different data. Unfortunately, this is adifficult and error-prone process and a com-mon source of bugs. Furthermore, pro-grammers must use information availableonly at the time of program development todecide whether synchronization is required

2.4

SHA-256

SHA-256MultiBuffer

RSA-2048

AES-GCM

2.2

2.0

1.8

1.6

1.4

1.2

1.0Westmere Sandy Bridge Ivy Bridge Haswell

SHA-256:RORX – Rotates, AVX2

RSA 2048:MULX – Multiply

AES GSM:AES-NI

PCLMULQDQ

SHA-256 MBAVX2 – Wider

Figure 9. Examples of performance improvements with new instructions in the Haswell core.

(See the disclaimer in the Acknowledgments section.)

..............................................................................................................................................................................................

HOT CHIPS

............................................................

16 IEEE MICRO

and when to serialize. This often leads pro-grammers to use synchronization conserva-tively, thus limiting parallelism.

To address this challenge, Intel TSX pro-vides hardware support for the processor todetermine dynamically if it should serializecritical section execution and expose any hid-den concurrency.

OverviewWith Intel TSX, the processor executes

critical sections, also referred to as transac-tional regions, transactionally using a techni-que known as lock elision. Such an executiononly reads the lock, and does not acquire it.This makes the critical section available toother threads executing at the same time.

However, because the lock no longer seri-alizes access, hardware now ensures thatthreads access shared data correctly by main-taining the illusion of exclusive access to thedata. It does so by tracking memory addressesaccessed and by buffering any memoryupdates performed within the transactionalexecution. The processor also checks theseaccesses against conflicting accesses fromother threads. A conflicting access occurs ifanother thread reads a location that thistransactional thread wrote, or another threadwrites a location that was accessed (eitherusing a read or a write) by the transactionalthread. When such an access occurs, thehardware alone cannot maintain the illusionof exclusive access.

To commit a transactional execution, thehardware ensures that all memory operationsperformed during the execution appear tooccur instantaneously when viewed fromother threads. Furthermore, any memoryupdates during execution become visible toother threads only after a successful commit.

Not all transactional executions can be suc-cessfully committed. For example, the execu-tion could encounter conflicting data accessesfrom other threads. In that event, the processorperforms a transactional abort. This processdiscards all updates, memory, and registersduring the execution and makes it appear as ifthe transactional execution never occurred.The subsequent re-execution can retry lock eli-sion or fall back to acquiring the lock.

Because a successful transactional execu-tion ensures an atomic commit, the processor

can execute the programmer-specified codesection optimistically without synchronizingthrough the lock. If synchronization wasunnecessary, the execution can commit with-out any cross-thread serialization.

Programming interfaceIntel TSX provides two programming

interfaces to specify transactional regions.The first interface, called Hardware Lock Eli-sion (HLE), is a pair of legacy-compatibleprefixes called XACQUIRE and XRELEASE.These prefixes appear as NOPs to previous-generation cores. The second interface, calledRestricted Transactional Memory (RTM), isa pair of new instructions called XBEGINand XEND. Programmers who also want torun Intel TSX-enabled software on hardwarewithout Intel TSX support would use theHLE interface to implement lock elision.Programmers who do not have legacy hard-ware requirements and who deal with morecomplex locking primitives would use theRTM interface to implement lock elision.

Programmers can use the HLE interfaceby adding prefixes to the instructions thatperform the lock acquire and release. Pro-grammers can use the RTM interface by pro-viding an additional code path to the existingsynchronization routines. In this path,instead of acquiring the lock, the routine usesthe XBEGIN instruction and provides a fall-back handler to execute if a transactionalabort occurs. The code also tests the lock var-iable inside the transactional region to ensurethat it’s free and to enable the hardware tolook for subsequent conflicts. These changesto enable lock elision are localized to syn-chronization routines; the application itselfdoes not need to be changed. This substan-tially eases enabling of applications to takeadvantage of Intel TSX.

Intel TSX provides two additional in-structions—XTEST and XABORT. TheXTEST instruction tests if a logical processoris executing transactionally, whereas the XA-BORT instruction can explicitly abort atransactional region.

Implementation on HaswellThe first implementation of Intel TSX on

the fourth-generation core processor uses thefirst-level 32-Kbyte data cache (L1) to track

.............................................................

MARCH/APRIL 2014 17

the memory addresses accessed (both read andwritten) during a transactional execution andto buffer any transactional updates performedto memory. The implementation makes theseupdates visible to other threads only on a suc-cessful commit. The implementation uses thecache coherence protocol to detect conflictingaccesses from other threads. Because hardwareis finite, transactional regions that access exces-sive state can exceed hardware buffering. Evict-ing a transactionally written line from the datacache will cause a transactional abort. How-ever, evicting a transactionally read line doesnot immediately cause an abort. The hardwaremoves the line to a secondary structure forsubsequent tracking.

The Intel 64 Architecture Software Devel-oper Manual has a detailed specification forIntel TSX,8 and the Intel 64 Architecture Opti-mization Reference Manual provides detailedguidelines for program optimization withIntel TSX.9 The Intel TSX web resources site(www.intel.com/software/tsx) presents informa-tion on various tools and practical guidelines.

H aswell uses an optimized version ofIntel 22-nm process technology to

provide comprehensive enhancements inpower-performance efficiency, power man-agement, form factor and cost, core anduncore microarchitecture enhancements, andnew instructions in the core. For example,the core delivers performance enhancementsfor high-performance computing with thenew FMA instructions and for parallel work-loads with the new Intel TSX synchroniza-tion primitives.10,11

During Haswell’s design, we focused onperformance, power, and form factors toensure that Haswell delivers a compellinguser experience in new form factors. MICR O

AcknowledgmentsWe thank Kevin Zhang, Fatih Hamzaoglu,

Eric Wang, Ruth Brain and the Intel TMGOrganization, Dave Dimarco, Manoj Lal,Steve Kulick, the Haswell architecture team,and the entire Intel CCDO team, and PattyKummrow and the SDG Org for their signifi-cant contributions to the success of eDRAMand the Intel Iris Pro Graphics product.

Disclaimer for Figures 6, 7, and 9: Softwareand workloads used in performance tests may

have been optimized for performance only onIntel microprocessors. Performance tests, suchas SYSmark and MobileMark, are measuredusing specific computer systems, components,software, operations, and functions. Anychange to any of those factors may cause theresults to vary. You should consult other infor-mation and performance tests to assist you infully evaluating your contemplated purchases,including the performance of that productwhen combined with other products. Resultshave been simulated and are provided forinformational purposes only. Results werederived using simulations run on an architec-ture simulator or model. Any difference insystem hardware or software design or con-figuration may affect actual performance.Requires a system with Intel Turbo BoostTechnology. Intel Turbo Boost Technology andIntel Turbo Boost Technology 2.0 are onlyavailable on select Intel processors. Consultyour system manufacturer. Performance variesdepending on hardware, software, and systemconfiguration. For more information, visithttp://www.intel.com/go/turbo. Iris graphics isavailable on select systems. Consult your systemmanufacturer.

Optimization notice: Intel’s compilersmay or may not optimize to the same degreefor non-Intel microprocessors for optimiza-tions that are not unique to Intel microproc-essors. These optimizations include SSE2,SSE3, and SSE3 instruction sets and otheroptimizations. Intel does not guarantee theavailability, functionality, or effectiveness ofany optimization on microprocessors notmanufactured by Intel. Microprocessor-dependent optimizations in this product areintended for use with Intel microprocessors.Certain optimizations not specific to Intelmicroarchitecture are reserved for Intelmicroprocessors. Please refer to the applica-ble product User and Reference Guides formore information regarding the specificinstruction sets covered by this notice.

....................................................................References1. P. Hammarlund, “Intel 4th Generation Core

Processor (Haswell),” Hot Chips 25, 2013.

2. N. Siddique et al., “Haswell: A Family of IA

22nm Processors,” to be published in Proc.

IEEE Int’l Solid-State Circuits Conf., 2014.

..............................................................................................................................................................................................

HOT CHIPS

............................................................

18 IEEE MICRO

3. E. Burton et al., “FIVR—Fully Integrated Volt-

age Regulators on 4th Generation Intel Core

SoCs,” to be published in Proc. IEEE Applied

Power Electronics Conf. and Exposition, 2014.

4. D. Kanter, “Haswell’s FIVR Extends Battery

Life,” Microprocessor Report, 30 July 2013.

5. R. Brain et al., “eDRAM process in 22nm

Technology,” Proc. Symp. VLSI Circuits,

2013, pp. T16-T17.

6. Y. Wang et al., “Retention Time Optimiza-

tion for eDRAM in 22nm Tri-Gate CMOS

Technology,” Proc. IEEE Int’l Electron Devi-

ces Meeting, 2013, pp. 240-243.

7. F. Hamzaoglu et al., “A 1Gb 2GHz

Embedded DRAM in 22nm Tri-Gate CMOS

Technology,” to be published in Proc. IEEE

Int’l Solid-State Circuits Conf., 2014.

8. Intel 64 and IA-32 Architectures Software

Developer Manual, Intel, 2013.

9. Intel 64 and IA-32 Architectures Optimiza-

tion Reference Manual, Intel, 2013.

10. R. Yoo et al., “Performance Evaluation of Intel

Transactional Synchronization Extensions for

High Performance Computing,” Proc. Int’l

Conf. High Performance Computing, Net-

working, Storage and Analysis, 2013,

doi:10.1145/2503210.2503232.

11. T. Karnagel et al., “Improving In-Memory

Database Index Performance with Intel Trans-

actional Synchronization Extensions,” to be

published in Proc. 20th Int’l Symp. High-

Performance Computer Architecture, 2014.

Per Hammarlund is an Intel fellow. Hisresearch interests include performance mod-eling, microarchitecture, SMT, power-performance efficiency and modeling, inte-gration, and SoC design. Hammarlund hasa PhD in computer science from the RoyalInstitute of Technology (KTH), Stockholm.

Alberto J. Martinez is a senior principalengineer at Intel and the chief architect forthe Embedded Subsystems and IP Group.His research focuses on embedded subsystemssoftware and hardware and PC I/O architec-tures. Martinez has an MS in electrical engi-neering from Sacramento State University.

Atiq A. Bajwa is the director of micro-processor architecture at Intel. He is respon-

sible for the architectural definition anddevelopment of microprocessors for Intel’smobile, desktop, workstation, and servercomputing segments. Bajwa has an MS inelectrical engineering from Yale University.

David L. Hill is a senior principal engineerat Intel. His research interests include mod-ular high-performance caches, coherentinterconnects, and system memory technol-ogies. His team was responsible for themodular uncore architecture sections of theHaswell and Broadwell product families.Hill has a BS in electrical engineering fromthe University of Minnesota.

Erik Hallnor is a cache and coherent fabricarchitect, focusing on client and SoC prod-ucts at Intel. He was the lead architect forthe Haswell coherent fabric, consisting ofthe ring interconnect, LLC, and eDRAMcache integration. Hallnor has a PhD incomputer science and engineering from theUniversity of Michigan.

Hong Jiang is an Intel Fellow, the chief mediaarchitect for the Platform Engineering Group,and the director of the Visual and ParallelComputing Group’s Media ArchitectureTeam at Intel. He leads the media architectureof processor graphics and its derivatives. Jianghas a PhD in electrical engineering from theUniversity of Illinois at Urbana-Champaign.

Martin Dixon is a principal engineer in theIntel Product Development Group, where he’sworking to develop and enhance the overallinstruction set and SoC architecture. Dixonhas a BS in electrical and computer engineer-ing from Carnegie Mellon University.

Michael Derr is a principal engineer atIntel. His work focuses on power managingPC I/O architectures. Derr has an MS inelectrical engineering from the GeorgiaInstitute of Technology.

Mikal Hunsaker is a senior principal engi-neer at Intel. His research interests focus onchipset high-speed serial I/O design, includ-ing PCI Express, SATA, and USB3. Hun-saker has an MS in electrical engineeringfrom Utah State University.

.............................................................

MARCH/APRIL 2014 19

Rajesh Kumar is a senior fellow, director ofcircuit and power technologies, and the leadinterface to process technology at Intel. ForHaswell, he guided the development of powerdelivery integration (FIVR), on-package I/O,and the novel process-technology needs. Kumarhas a master’s degree in electrical engineeringfrom the California Institute of Technology.

Randy B. Osborne is a principal engineerworking to improve performance of mem-ory hierarchies at Intel. His research inter-ests include memory controllers, memoryinterconnects, memory devices, and largecaches, including the introduction of eDRAMinto Intel products. Osborne has a PhD inelectrical engineering from the MassachusettsInstitute of Technology.

Ravi Rajwar is a principal engineer in theIntel Product Development Group, workingon various aspects of SoC architecture anddevelopment. His research interests includethe IA synchronization architecture. Rajwarhas a PhD in computer science from theUniversity of Wisconsin–Madison.

Ronak Singhal is a senior principal engi-neer at Intel. His research interests includeserver architecture development, ISA devel-opment, and performance analysis and mo-deling. Singhal has an MS in electrical andcomputer engineering from Carnegie Mel-lon University.

Reynold D’Sa is vice president of the Plat-form Engineering Group and general man-ager of the Devices Development Group atIntel. He leads the design engineering teamsresponsible for designing and developingSoC products for Intel’s next-generation cli-ent and mobile platforms, including tabletsand smartphones. D’Sa has an MS in electri-cal engineering from Cornell University.

Robert Chappell is a CPU architect at Intel,where he works to define the performance,power, reliability, and ISA features of theCPU cores used in various products rangingfrom phones to servers. He was the lead archi-tect for the Haswell memory execution cluster,the Haswell core in its later stages, and theHaswell follow-on core (Broadwell). Chappell

has a PhD in computer science from the Uni-versity of Michigan.

Shiv Kaushik is a fellow at Intel, where heleads the Windows OS Division in Intel’s Soft-ware and Services Group. His research interestsinclude the design of platform hardware andfirmware interfaces to operating systems andvirtualization software for power management,scaling, performance, and reliability. Kaushikhas a PhD in computer science and engineeringfrom Ohio State University.

Srinivas Chennupaty is a CPU and SoCarchitect at Intel. His research interestsinclude the Intel microprocessor architec-ture and instruction set development. Chen-nupaty has an MS in computer engineeringfrom the University of Texas at Austin.

Stephan Jourdan is a senior principal engi-neer at Intel, where he leads the architectureengineering teams responsible for definingand developing SoCs for device products.He was the chief architect on HSW ULT.Jourdan has a PhD in computer sciencefrom the University of Toulouse.

Steve Gunther is a senior principal engineerand a lead power architect for Intel, wherehe leads a team responsible for defining thepower management architecture for Intel’smicroprocessor product line. His researchinterests include power analysis, powermanagement, and power reduction.Gunther has a BS in electrical engineeringfrom Oregon State University.

Tom Piazza is a senior fellow and directorof graphics architecture at Intel. His researchinterests include computer graphics. Piazzahas a BS in electrical engineering from thePratt Institute.

Ted Burton is a senior principal engineerworking on advanced technologies in theDevices Development Group at Intel. Hisresearch interests include power delivery.Burton has a BS in physics from BrighamYoung University.

Direct questions and comments aboutthis article to Per Hammarlund, Intel, 2111NE 25th Ave., Hillsboro, OR 97124; [email protected].

..............................................................................................................................................................................................

HOT CHIPS

............................................................

20 IEEE MICRO

haswell:the fourth-generation i core...

Documents