knights landing:second g intel xeon phi...

.................................................................................................................................................................................................................

KNIGHTS LANDING: SECOND-GENERATION INTEL XEON PHI PRODUCT

.................................................................................................................................................................................................................THE KNIGHTS LANDING PROCESSOR TARGETS HIGH-PERFORMANCE COMPUTING AND

OTHER PARALLEL WORKLOADS. IT PROVIDES SIGNIFICANTLY HIGHER PERFORMANCE AND

MEMORY BANDWIDTH OVER THE PREVIOUS GENERATION. IT IS A STANDARD, SELF-

BOOTING CPU THAT IS BINARY COMPATIBLE WITH PRIOR INTEL PROCESSORS. ITS

INNOVATIONS INCLUDE A POWER-EFFICIENT CORE, A 512-BIT VECTOR INSTRUCTION SET, A

NEW MEMORY ARCHITECTURE, A HIGH-BANDWIDTH ON-DIE INTERCONNECT, AND AN ON-

PACKAGE NETWORK FABRIC.

......Knights Landing (KNL) is thecode name for the second-generation IntelXeon Phi product family, which targets high-performance computing (HPC) and other par-allel computing segments. It is a brand-newmany-core architecture that delivers massivethread parallelism, data parallelism, and mem-ory bandwidth in a CPU form factor for high-throughput workloads. KNL brings manyinnovations to market. It is a standard, stand-alone processor that can boot an off-the-shelfoperating system. This is a big change fromthe previous-generation Intel Xeon Phi co-processor, Knights Corner (KNC),1 which wasa PCI Express- (PCIe-)connected coprocessor.In addition, KNL is binary compatible withprior Intel Xeon processors, which ensures thatany legacy software will run on KNL. Further-more, KNL improves on scalar and vector per-formance over KNC by about three times,packing more than 3 Teraflops of double-precision and 6 Teraflops of single-precisionpeak floating-point performance in the chip. Italso introduces an innovative memory archi-tecture comprising two types of memory,which provides both the high bandwidth and

large capacity needed to run large HPC work-loads. Finally, KNL integrates the Intel Omni-Path Fabric on the package to provide a morescalable and cost-effective fabric solution forlarge systems.

KNL will be available in three productvarieties: a KNL self-booting processor, KNLself-booting with integrated fabric, and aKNL PCIe-connected coprocessor card as afollow-on product to KNC. The primaryfocus of this article will be the KNL self-booting processor, which is the main productand the baseline for the other two products.

KNL Architecture OverviewWe begin by providing an overview of theKNL architecture. Figure 1a shows the KNLCPU’s block diagram, and Figure 2 shows itsdie photo. The KNL CPU comprises 38 phys-ical tiles, of which at most 36 are active; theremaining two tiles are for yield recovery. Eachtile (see Figure 1b) comprises two cores, twovector processing units (VPUs) per core, and a1-Mbyte level-2 (L2) cache that is sharedbetween the two cores. The core is a new

Avinash SodaniRoger Gramunt

Jesus CorbalHo-Seop KimKrishna Vinod

Sundaram ChinthamaniSteven HutsellRajat AgarwalYen-Chen Liu

Intel

.......................................................

34 Published by the IEEE Computer Society 0272-1732/16/$33.00!c 2016 IEEE

two-wide, out-of-order core that is derivedfrom the Intel Atom core (based on Silvermontmicroarchitecture).2 It includes heavy modifi-cations over the inherited microarchitecture toincorporate features necessary for HPC work-loads, such as four threads per core, deeperout-of-order buffers, higher cache bandwidth,new instructions, better reliability, larger trans-lation look-aside buffers (TLBs), and largercaches. KNL introduces the new AdvancedVector Extensions instruction set, AVX-512,which provides 512-bit-wide vector instruc-tions and more vector registers.3 In addition, itcontinues to support all legacy x86 instruc-

tions, making it completely binary-compatiblewith prior Intel processors.

KNL introduces a new 2D, cache-coherentmesh interconnect that connects the tiles,memory controllers, I/O controllers, andother agents on the chip. The mesh intercon-nect provides the high-bandwidth pathwaysnecessary to deliver the huge amount of mem-ory bandwidth provisioned on the chip to thedifferent parts of the chip. The mesh supportsthe MESIF (modified, exclusive, shared, inva-lid, forward) cache-coherent protocol.4 Itemploys a distributed tag directory to keep theL2 caches in all tiles coherent with each other.Each tile contains a caching/home agent thatholds a portion of the distributed tag directoryand also serves as a connection point betweenthe tile and the mesh.

KNL has two types of memory: multi-channel DRAM (MCDRAM) and doubledata rate (DDR) memory. MCDRAM is the16-Gbyte high-bandwidth memory compris-ing eight devices (2 Gbytes each) integratedon-package and connected to the KNL die viaa proprietary on-package I/O. All eightMCDRAM devices together provide an aggre-gate Stream triad benchmark bandwidth ofmore than 450 Gbytes per second (GBps; see

MCDRAM

MCDRAM MCDRAM MCDRAM MCDRAM

MCDRAM MCDRAM MCDRAM

2 × 161 × 4

X4DMI

EDC

DDR MC DDR MC

EDC EDC EDC

EDC EDC misc

PCIeGen 3

EDC EDC

Tile

3 D

D R

4 ch

anne

ls

3 D

D R

4 ch

anne

ls

DM

I

2 VPU 2 VPU

Core

16-GbyteMCDRAM

KNLDDR 4 Omni-

Path

Omni-Pathports100Gbps/port

Package

×16 PCIe*

×4 PCIe

CHA

1-MbyteL2Core

36 tilesconnected by

2D meshinterconnect

(a) (c)

(b)

Figure 1. Knights Landing (KNL) block diagram: (a) the CPU, (b) an example tile, and (c) KNL with Omni-Path Fabric integrated

on the CPU package. (CHA: caching/home agent; DDRMC: DDR memory controller; DMI: Direct Media Interface; EDC:

MCDRAM controllers; MCDRAM: multichannel DRAM; PCIe: PCI Express; VPU: vector processing unit.)

Figure 2. KNL CPU die photo.

.............................................................

MARCH/APRIL 2016 35

www.cs.virginia.edu/stream). KNL has sixDDR4 channels running up to 2,400 MHz,with three channels on each of two memorycontrollers, providing an aggregate bandwidthof more than 90 GBps. Each channel can sup-port at most one memory DIMM. The totalDDR memory capacity supported is up to384 Gbytes. The two types of memory arepresented to users in three memory modes:cache mode, in which MCDRAM is a cachefor DDR; flat mode, in which MCDRAM istreated like standard memory in the sameaddress space as DDR; and hybrid mode, inwhich a portion of MCDRAM is cache andthe remainder is flat.

KNL supports a total of 36 lanes of PCIeGen3 for I/O, split into two x16 lanes andone x4 lane. It also has four lanes of propriet-ary Direct Media Interface to connect to theSouthbridge chip, just like Intel Xeonprocessors. The Southbridge chip providessupport for legacy features necessary for aself-booting system.

KNL integrates the Intel Omni-Path Fab-ric5 on-package in one of the product incar-nations (see Figure 1c). The Omni-PathFabric is connected via the two x16 lanes ofPCIe to the KNL die and provides two 100-Gbits-per-second ports out of the package.The typical power (thermal design power)for KNL (including MCDRAM memory)when running a computationally intensiveworkload is 215 W without the fabric and230 W with the fabric.

Vision and MotivationIn this section, we provide the motivationsbehind some of the choices made in the KNLarchitecture. The most fundamental choicewe made was to make it a self-booting, stand-ard CPU. The primary motivation was to getaway from the PCIe bottlenecks inherent in acoprocessor and to provide the computationand bandwidth necessary for HPC workloadsin a standard CPU form factor. Once thischoice was made, several other choices fol-lowed. With KNL as a standard CPU, it isexpected to run all kinds of software—notjust the highly parallel code—such as stand-ard operating systems, system infrastructurecode, general libraries, and debug tools. Thismotivated binary compatibility of KNL with

existing Intel processors, because requiringthat every piece of support software berecompiled to run on KNL would be a bigoverhead for users. Being a standard CPUalso implied that KNL would run entireapplications. This motivated us to develop anew core with significantly higher scalar per-formance than KNC to ensure KNL also ranserial portions of the applications well.

Binary compatibility motivated the needto define 512-bit vector extensions in a man-ner consistent with existing AVX and AVX2instruction sets. This resulted in the intro-duction of AVX512 instructions.

We designed KNL’s memory architectureto support its large computational capabilityand its stand-alone processor status. Wechose MCDRAM memory to provide thehigh bandwidth needed to feed the largecomputational capability, and we choseDDR to provide the large memory capacityneeded to run an entire application with allthe necessary support software on KNL. Wedid not include a shared on-die L3 cache,because we found that the targeted HPCworkloads benefited less from it compared toadding more cores; many HPC workloadseither could be blocked in each per-tile L2 orwere too large to fit in any reasonably sizedon-die L3 cache.

Finally, the mesh interconnect had twomotivations: the need to provide a networkthat could easily support the huge memorybandwidth provisioned on the chip, and theneed to provide lower latency connections—that is, fewer average hops to go from onepoint to another—between different agentson the large chip.

Tile ArchitectureWe now provide more details on the maincomponents within a tile: the core, VPU,and L2 cache.

Core and VPU ArchitectureThe KNL core differs from the parent IntelAtom cores (based on the Silvermont micro-architecture) because of the HPC modifica-tions. Figure 3 shows the block diagram of thecore and the VPU, which is divided roughlyinto five units: the front-end unit (FEU), theallocation unit, the integer execution unit

..............................................................................................................................................................................................HOT CHIPS

............................................................

36 IEEE MICRO

(IEU), the memory execution unit (MEU),and the VPU. Although the core is two-widefrom the point of view of decoding, allocating,and retiring operations, it can execute up tosix operations per cycle in the middle.

Front-end unit. The core’s FEU comprises a32-Kbyte instruction cache (IL1) and a 48-entry instruction TLB. In case of a hit, theinstruction cache can deliver up to 16 bytesper cycle. These bytes are then sent to a two-wide decoder. Most instructions are decodedinto a single micro-op, but a few complex onesthat produce more micro-ops are handled by atwo-wide microsequencer engine. Fetch direc-tion is provided by a gskew-style branch pre-dictor. Decoded micro-ops are placed into a32-entry instruction queue.

Allocation unit. The allocation unit readstwo micro-ops per cycle from the instructionqueue. It assigns the necessary pipelineresources required by the micro-ops, such asreorder buffer (ROB) entries (72), renamebuffer entries (72), store data buffers (16),gather-scatter table entries (4), and reserva-tion station entries. It also renames the regis-ter sources and destinations in the micro-ops.The rename buffer stores the results of the in-flight micro-ops until they retire, at whichpoint the results are transferred to the archi-tectural register file. After the allocation unit,the micro-ops are sent to one of three execu-tion units—IEU, MEU, or VPU—depend-ing on their opcode types. Some micro-opscould get sent to more than one executionunit. For example, an Add instruction thathas a memory source will be sent to theMEU to read the memory source, and thento the IEU to execute the Add operation.

Integer execution unit. The IEU executesinteger micro-ops, which were defined asthose that operate on general-purpose regis-ters R0 through R15. There are two IEUs inthe core. Each IEU contains one 12-entryreservation station that issues one micro-opper cycle. The integer reservation station isfully out-of-order in its scheduling. Mostoperations have one-cycle latency and aresupported by both IEUs. But a few havethree- or five-cycle latency (for example,

“multiplies”) and are only supported by oneof the IEUs.

Memory execution unit. The MEU executesmemory micro-ops and also services fetchrequests for instruction cache misses andinstruction TLB misses.

Up to two memory operations, eitherload or store, can be executed in the MEU ina given cycle. Memory operations are issuedin-order from the 12-entry memory reserva-tion station, but they can execute and com-plete out of order. Micro-ops that do notcomplete successfully are allocated into therecycle buffer and reissued to the MEU pipe-line once their conflict conditions areresolved. Completed loads are kept in thememory ordering queue until they are retiredto maintain consistency. While stores arekept in the store buffer after address transla-tion, they can forward data to dependentloads. Stores are committed to memory inthe program order, one per cycle.

The 64-entry, eight-way set-associative L1micro TLB is backed up by the 256-entry,eight-way set-associative L2 data TLB. The

I-cache(32-Kbyte

8-way)

Fetch anddecode

Bpred

Allocate/rename

Retire

FP RS(20)

FP RS(20)

VectorALUs

VectorALUs

MEMRS (12)

FP rename buffers

Integer rename buffer

Integer RF

Int RS(12)

Int RS(12)

ALUD-cache

(32-Kbyte 8-way)

TLBs

FP RF

ALU

Recyclebuffer

Legacy

iTLB

Figure 3. Core and VPU dataflow block diagram showing important

microarchitectural structures and data bus connections between them.

(ALU: arithmetic logic unit; iTLB: instruction translation look-aside buffer; RF:

register file; RS: reservation station.)

.............................................................

MARCH/APRIL 2016 37

data TLB also contains an eight-way, 128-entry table for 2-Mbyte pages and a fully asso-ciative, 16-entry table for 1-Gbyte pages. Thewriteback, nonblocking, 32-Kbyte, eight-wayset-associative L1 data cache (DL1) supportstwo simultaneous 512-bit reads and one 512-bit write, with a load-to-use latency of fourcycles for integer loads and five cycles for float-ing-point loads. The L1 hardware prefetchermonitors memory address patterns and gener-ates data prefetch requests to the L2 cache tobring cache lines in advance.

The MEU supports unaligned memoryaccesses without any penalties and supportsaccesses that split into two cache lines with atwo-cycle penalty. This helps HPC workloadswhere memory accesses might not always bealigned to natural data size. In addition, theMEU supports 48 virtual and 46 physicaladdress bits (up to 39 physical bits for address-ing cacheable memory and the rest foruncacheable memory) to provide the largememory addressing capability commensuratewith a stand-alone, standard CPU.

Finally, the MEU contains specialized logicto handle gather-scatter instructions effi-ciently. A single gather-scatter instruction canaccess many memory locations. This genera-tion of multiple accesses is done very close tothe L1 cache pipeline. This allows maximumutilization of the two memory pipelines while

consuming minimal resources in rest of thecore (such as the FEU, allocation unit, reserva-tion stations, and reorder buffer).

Vector processing unit. The VPU is KNL’svector and floating-point execution unit andprovides support for x87, MMX, StreamingSIMD Extensions (SSE), AVX, and AVX-512 instructions, as well as integer divides.Two VPUs are connected to the core. Theyare tightly integrated into the pipeline, withthe allocation unit dispatching instructionsdirectly into the VPUs. The VPUs are mostlysymmetrical, and each can provide a steady-state throughput of one AVX-512 instructionper cycle, providing a peak of 64 single-precision or 32 double-precision floating-point operations per cycle from both VPUs.One of the VPUs is extended to provide sup-port for the legacy floating-point instruc-tions, such as x87, MMX, and a subset ofbyte and word SSE instructions.

Each VPU contains a 20-entry floating-point reservation station that issues onemicro-op per cycle out of order. The float-ing-point reservation station differs from theIEU and MEU reservation stations in that itdoes not hold source data to help reduce itssize; the floating-point micro-ops read theirsource data from the floating-point renamebuffer and the floating-point register file afterthey issue from the floating-point reservationstation, spending an extra cycle between thereservation station and execution comparedto integer and memory micro-ops. Mostfloating-point arithmetic operations have alatency of six cycles, whereas the rest of theoperations have a latency of two or threecycles, depending on the operation type.

The VPU also supports the new transcen-dental and reciprocal instructions, AVX-512ER,3 and the vector conflict detectioninstructions, AVX-512CD,3 introduced inKNL.

L2 ArchitectureOn a KNL tile, the two cores share a 16-wayassociative, 1-Mbyte unified L2 cache. Thebus interface unit (BIU) maintains intra-tilecoherency (see Figure 4) and also acts as thelocal shared L2 cache management unit.Lines in the L2 cache are maintained in oneof the MESIF states. Requests are made by

CHA

L2 HWP VPUVPU

DL1DL1

IL1IL1

L2Core 1Core 0

BIU

Figure 4. Cache hierarchy inside a tile. The figure shows the level-1

instruction (IL1) and data (DL1) caches in each core and the level-2 (L2)

cache with its hardware prefetcher (HWP) between the two cores. (BIU:

bus interface unit.)

..............................................................................................................................................................................................HOT CHIPS

............................................................

38 IEEE MICRO

the cores to the BIU via a dedicated requestinterface to each core. Cacheable requestslook up the L2 tags for hit, miss, and line-state evaluation, whereas other requests arebypassed directly out to the mesh to be serv-iced by the targeted caching/home agent.

KNL implements a unique cache topol-ogy to minimize coherency maintenancetraffic. First, the L2 cache includes DL1 butnot IL1. The lines brought into IL1 fill inthe L2 cache, but when those lines areevicted, the corresponding IL1 line is notinvalidated. This avoids invalidations due tohot-IL1/cold-L2 scenarios, in which a linein active use in IL1 gets invalidated due toeviction of the corresponding line from L2cache due to inactivity. Second, the L2 cachestores “presence” bits per line to track whichof them are actively used in DL1. Thisinformation is used to filter the coherencyprobes for inclusive DL1 when notrequired. It also factors into the L2 victim-selection algorithm to minimize eviction ofin-use lines.

The BIU also contains an L2 hardwareprefetcher that is trained on requests comingfrom the cores. It supports up to 48 inde-pendent prefetch streams. Once a stream isdetected to be stable either forward or back-ward, prefetch requests are issued to succes-sive cache lines in that stream (that is, theunit cache line stride distance).

ThreadingA KNL core supports up to four hardwarecontexts or threads using hyperthreadingtechniques.6 Core resources can be dynami-cally partitioned, shared, or replicated, andpipelines are regulated by thread selectors.The goal is to maximize resource utilizationamong active threads.

In general terms, threads become inactiveafter they execute a halt or monitor wait(mwait) instruction6; otherwise, they are con-sidered to be active. Three threading modesare defined depending on the number ofactive threads: single-thread mode when onlyone thread is active (any one), dual-threadmode when any two threads are active, andquad-thread mode when any three or all fourthreads are active.

Dynamically partitioned resources areequally distributed among threads, depending

on the thread mode. For example, in single-thread mode, the active thread is allowed touse the full 72 entries in the ROB; in dual-thread mode, each thread will get 36 entries;and in quad-thread mode, each active threadwill get 18 entries. Besides the ROB, othernoteworthy structures that are also dynami-cally partitioned are the rename buffers, reser-vation stations, store data buffers, instructionqueue, and gather-scatter table entries.

Shared resources do not enforce partition-ing, although some structures have a smallnumber of per-thread reserved entries toavoid deadlocks. Threads get shared resourceson a first-come, first-served basis. Notewor-thy shared resources are caches, TLBs, mostMEU structures, branch predictors, andhardware prefetcher tables. Except for caches(in which lines are thread unaware), a partic-ular entry in a shared structure is owned byonly a single thread.

Replicated structures are limited to thebare minimum. In replicated structures, eachthread has its dedicated structure that is nottaken away when the thread becomes inactive.These structures include rename tables, archi-tectural registers, and other control registers.

The core pipeline has thread selectors atseveral points to maximize utilization andthroughput of the pipeline and maintain fair-ness among threads. These exist mainly inthe in-order portion of the pipeline. Thethread selector considers the availability ofresources down the pipeline and tries tomake informed selections. The importantthread selectors are located in the FEU, allo-cation unit, retire pipeline, and MEU reser-vation stations. The out-of-order parts of themachine are thread agnostic and pick instruc-tions based on readiness.

Instruction-Set ArchitectureFigure 5 shows how KNL’s instruction-set archi-tecture compares to those of recent Intel Xeonprocessors.6 KNL supports all legacy instruc-tions, including 128-bit SSE and SSE4.2 and256-bit AVX and AVX2 technologies.

KNL introduces Intel AVX-512 instruc-tions. AVX-512 provides 512-bit SIMD sup-port, 32 logical registers, native support fortrue vector predication via eight new maskregisters, and a generalization of indirect vec-tor accesses via gathers (for loading sparse

.............................................................

MARCH/APRIL 2016 39

data) and scatters (for storing sparse data). Asingle AVX-512 instruction can pack eightdouble-precision multiply-add operations(16 flops) or 16 single-precision multiply-add operations (32 flops), with an optionalmemory location used as an operand that caneither be a vector or a scalar.

Intel AVX-512 instructions fall into fourcategories: foundation instructions (AVX-512F), which are the base 512-bit extensionsdescribed earlier; conflict-detection instruc-tions (AVX-CD); exponential and reciprocalinstructions (AVX-ER); and prefetch instruc-tions (AVX-PF). These capabilities provideefficient conflict detection to allow more loops(such as histogram updates) to be vectorized,fast exponential and reciprocal operations tospeed up transcendental functions, and newsparse prefetch capabilities, respectively.

We defined AVX-512 with programm-ability in mind. Most AVX-512 program-ming occurs in high-level languages—suchas C/Cþþ and Fortran, with the help of

vectorizing compilers and pragmas to guidethe compilers—or in libraries with optimizedinstruction sequences, which could useintrinsics.

KNL does not implement Intel’s Transac-tional Synchronization Extensions. The soft-ware is expected to confirm hardwaresupport using a CPUID feature bit to ensurethat it will run on machines with and withoutthese extensions.

Mesh ArchitectureThe Intel mesh on-die interconnect architec-ture (see Figure 6a) is based on the Ringarchitecture, which has been widely deployedin the Intel Xeon processor family. The meshfeatures four parallel networks, each of whichdelivers different types of packets (for exam-ple, commands, data, and responses) and ishighly optimized for the KNL traffic flowsand protocols. The mesh can deliver greaterthan 700 GBps of total aggregate bandwidth.

The mesh is rows and columns of halfrings, which fold upon themselves at the end-points. The mesh enforces a YX routing rule,which means a transaction always travels ver-tically first until it hits the target row, makes aturn, and then travels horizontally until itreaches its destination. Messages arbitratewith the existing traffic on the mesh at injec-tion points as well as when making a turn,with the existing traffic on the mesh takinghigher priority. The static YX routing helpsreduce deadlock cases and thereby simplifiesthe protocol. One hop on mesh takes oneclock in the Y direction and two clocks in theX direction.

Cache Coherency and Distributed Tag DirectoryThe entire KNL chip is cache coherent andimplemented with a MESIF protocol. Theintroduction of the F (forward) state is toallow efficient sharing. The core with the Fstate will be able to provide shared datadirectly to the requesting core without goingto memory. A distributed tag directory servesas a snoop filter that tracks lines that are cur-rently owned or shared by the cores. It is dis-tributed among all the tiles. The directory ineach tile owns a portion of the address spacebased on an address hash. Figure 6 shows asimple L2 cache miss flow.

E5-2600(Sandy Bridge)

SSE*

AVX

E5-2600v3(Haswell)

SSE*

AVX

AVX2

AVX-512CD

x87/MMX x87/MMX

TSX

KNLXeon Phi

SSE*

AVX

AVX2

x87/MMX

AVX-512F

BMI

AVX-512ER

AVX-512PF

BMILegacy

Figure 5. KNL’s instruction-set architecture as compared to the

architectures of prior Xeon processors. It shows that KNL supports all legacy

instructions present on Xeon processors while adding new instructions.

(AVX: Advanced Vector Extensions; BMI: Bit Manipulation Instructions;

SSE: Streaming SIMD Extensions; TSX: Transactional Synchronization

Extensions.)

..............................................................................................................................................................................................HOT CHIPS

............................................................

40 IEEE MICRO

Cluster ModesThe Intel mesh supports three modes ofclustered operation that provide differentlevels of address affinity to improve overallperformance. These cluster modes aim tolower latencies and improve bandwidth bylowering the distance that protocol flowstraverse on the chip. These modes areselectable from the basic I/O system (BIOS)at boot time.

All-to-all mode. This mode lacks any affinitybetween the tile, directory, and memory. It isthe most general mode and has no specificrequirement for the software for memoryconfigurations, but it typically has lower per-formance than the other cluster modes. Fig-ure 6b shows how an L2 miss can traverse the

mesh to fill a cache line from memory for theall-to-all cluster mode.

Quadrant mode. This mode divides the KNLchip into four virtual quadrants, each ofwhich provides affinity between the directoryand the memory. There is no affinity betweena tile and the directory or the memory—thatis, a request from any tile can land on anydirectory. However, the directory will accessonly the memory in its own quadrant. Thiscluster mode requires symmetric memory(that is, the same total capacity on bothDDR memory controllers). It provides betterlatency than the all-to-all mode and is trans-parent to software support. Figure 6c showshow L2 miss transactions flow in the quad-rant mode.

(a) (b)

(c) (d)

Figure 6. Example of an L2 miss flow: (1) tile with L2 miss encountered, (2) request sent to the tile with tag directory, (3)

request forwarded to memory after miss in tag directory, (4) data read from memory and sent to the requester. (a) Mesh

architecture, (b) all-to-all cluster mode, (c) quadrant cluster mode, and (d) sub-NUMA (nonuniform memory access) clustering

(SNC) cluster mode.

.............................................................

MARCH/APRIL 2016 41

Sub-NUMA clustering (SNC). This modeextends the quadrant mode further by affini-tizing the tile with the directory and thememory. In this mode, the KNL chip is div-ided and exposed as two or four nonuniformmemory access (NUMA) domains (or clus-ters) to the OS. For NUMA-optimized soft-ware (see http://man7.org/linux/man-pages/man7/numa.7.html), there will be an affinitybetween the tile, directory, and memory—thatis, a request from a tile will access a directory inits cluster and, in turn, the directory will accessthe memory controllers that are also withinthat cluster (the MCDRAM controllers arelabeled as EDC, and the DDR controllers arelabeled as MC in Figure 6). This cluster modehas the best latency profile among all modes—especially under loaded operations—becausemost traffic will be contained within the localcluster. For software to exploit this mode’s per-formance, it must be NUMA optimized—thatis, it needs to allocate memory in the sameNUMA cluster where it runs. Figure 6d showsthe L2 miss transactions in SNC mode.

Memory ArchitectureKNL supports two levels of memory:MCDRAM and DDR. MCDRAM is astacked DRAM architecture (see Figure 7)comprising multiple channels or vaults oneach DRAM die that are vertically connected

with through-silicon vias.7 Note that this is aconceptual picture and not an actual layout.Each DRAM die in the stack is partitionedwith multiple banks in each partition. Corre-sponding partitions from all DRAM dieswhen taken together form a channel or vault.Each channel has its own set of command,address, and data pins. All of the channelscan be accessed in parallel and provide highbandwidth for HPC applications. For eachfour-high stack, there are a total of 16 inter-nal channels, with eight banks per channel,for a total of 128 banks. The large number ofbanks allows more parallelism. The page sizeis 2,048 bytes.

The DRAM dies are stacked on top of alogic buffer die that connects to the CPUthrough a proprietary on-package I/O, whichruns at a higher frequency than that of theDRAM core, allowing for a narrower butmuch faster interface to the CPU. MCDRAMmemory supports self-refresh and power-down modes.

Memory ModesKNL features three primary memory modes:cache, flat, and hybrid (see Figure 8). Thesemodes are selectable through the BIOS atboot time.

Cache mode. In this mode, MCDRAM isconfigured as a memory-side cache for theentire DDR memory. It is a direct-mappedcache with 64-byte cache lines. The tags arestored in the error-correcting code bits corre-sponding to the cache lines. The tags are readat the same time as the cache lines from theMCDRAM, allowing determination of hit ormiss without requiring additional tag accessesfrom the MCDRAM. Because MCDRAM ismuch larger than traditional caches, there isno significant change in conflict misses due tothe direct-mapped policy in most cases.

In this mode, all requests first go toMCDRAM for a cache lookup and then, incase of a miss, are sent to the DDR. The stateof a line in the MCDRAM cache uses amodified, exclusive, garbage, invalid (MEGI)protocol. A garbage state means the line hasbeen reserved for a future write that is guar-anteed, but the current data is not valid.

Cache mode does not require any soft-ware change and works well for many

Multiple banks

Concept only;not a layout

Memory

Memory

Memory

Memory

Logic base

Partition

Partition

Partition

Partition

Vault

Logic(vault controller)

Figure 7. MCDRAM architecture. It shows a conceptual picture of DRAM

dies stacked over a logic die and their organization.

..............................................................................................................................................................................................HOT CHIPS

............................................................

42 IEEE MICRO

applications. For applications that do notshow a good hit rate in MCDRAM, theother two memory modes provide more con-trol to better use MCDRAM.

Flat mode. This mode presents bothMCDRAM memory and DDR memory asregular memory mapped in the same systemaddress space. Flat mode is ideal for applica-tions that can separate their data into a larger,lower-bandwidth region and a smaller, higher-bandwidth region. Accesses to MCDRAM inflat mode see guaranteed high bandwidthcompared to cache mode, where it dependson the hit rates. Flat mode requires softwaresupport to enable the application to takeadvantage of this mode, unless the workloadcan fit entirely in MCDRAM.

Hybrid mode. In this mode, we split eitherhalf or a quarter of the MCDRAM as cache;the rest is used as flat memory. The cacheportion will serve all of the DDR memory.This is ideal for a mixture of applications thatbenefit from general caching, but also cantake advantage by storing critical or fre-quently accessed data in flat memory. As withthe flat mode, software enabling is requiredto access the flat mode section of MCDRAMwhen software does not entirely fit into it.The cache mode section does not require anysoftware support.

Flat MCDRAM software architecture. Flatmode requires software to explicitly allocatememory into MCDRAM. We created soft-ware architecture for exposing MCDRAM as

memory by relying on mechanisms that arealready supported in the existing softwarestack. This minimizes any major enablingeffort and ensures that applications writtenfor flat MDRAM remain portable to sys-tems without flat MCDRAM. This softwarearchitecture is based on the NUMA mem-ory support (see http://man7.org/linux/man-pages/man7/numa.7.html) that existsin current operating systems and is widelyused to optimize software for current multi-socket systems, as shown in Figure 9b. Weuse the same mechanism to expose the twotypes of memory on KNL as two separateNUMA nodes (see Figure 9a). This pro-vides software with a way to address the twotypes of memory using NUMA mecha-nisms. By default, BIOS sets KNL coresto have higher affinity to DDR thanMCDRAM. This affinity helps direct alldefault and noncritical memory allocationsto DDR, keeping them out of MCDRAM.To allocate critical memory in MCDRAM,we provide a HBW malloc library (https://github.com/memkind/memkind) withmemory allocation function calls, and aFortran language annotation called FAST-MEM. Figure 9c shows how we use thesetwo methods.

The code written using the HBW malloclibrary and FASTMEM attribute will remainportable. On systems that do not haveMCDRAM memory, these mechanisms willdefault to standard memory-allocation poli-cies. MCDRAM memory behaves just like aregular NUMA memory in a two-socket sys-tem for all purposes.

Hybrid mode

DDR4- or 8-GbyteMCDRAM

8- or 12-GbyteMCDRAM

DDR

Flat mode

Phys

ical

add

ress

DDR16-GbyteMCDRAM

16-GbyteMCDRAM

Cache mode

Phys

ical

add

ress

Figure 8. Three memory modes in KNL. These modes—cache, flat, and hybrid—are

selectable through the BIOS at boot time.

.............................................................

MARCH/APRIL 2016 43

PerformanceFigure 10 compares the relative performanceand performance per watt of preproductionKNL versus a one-socket ES-2697 V3 seriesIntel Xeon processor (codenamed Haswell)for a set of applications covering bandwidthbenchmarks (Stream), dense linear algebrabenchmarks (Linpack), general-purpose work-loads (SpecInt and SpecFP), scientific algo-rithms (QCD-Wilson Dslash, MILC, AMG,SNAP, MiniFE, HPCG, and N-body),weather modeling code (WRF), earthquakemodeling code (SeiSol), machine learning(CNN), and financial services code (AmericanOption Approximation). We use cache modefor AMG, SNAP, MILC, MiniFE, SpecInt,and SpecFP, and flat mode for the rest. We usequadrant mode for Linpack, Stream, andInterest Rate Simulations, and all-to-all modefor the rest.

On Linpack and Stream, which repre-sent computing- and bandwidth-boundworkloads, KNL provides 3.7 and 8 timesthe performance and 2.5 and 5.5 times theperformance per watt over the one-socketHaswell, respectively. For other categoriesof workloads that are computing andbandwidth bound to different degrees,

early KNL silicon data, mostly unoptimizedfor KNL, show 2 to 5.3 times the perform-ance and 1.3 to 3.6 times the performanceper watt over one-socket Haswell. On Spec-Int and SpecFP rates—which exemplify out-of-box, general-purpose performance—earlyKNL silicon data shows 1.2 to 1.7 times theperformance and 0.8 to 1.1 times the per-formance per watt over one-socket Haswell.The performance of KNL production siliconwill generally be higher than these numbers.This shows that KNL delivers respectableperformance and performance per watt forgeneral-purpose, out-of-box, unoptimized,parallel code, while providing a significantperformance boost for optimized computing-and bandwidth-bound workloads.

K nights Landing’s innovations includeits power-efficient core, AVX-512 vec-

tor instruction set, dual MCDRAM andDDR memory architecture, high-bandwidthon-die interconnect, and an integrated on-package network fabric. These innovationsenable the processor to significantly improveperformance for computationally intensiveand bandwidth-bound workloads while stillproviding good performance on unoptimized

Node 0

Xeon Xeon DDRDDRKNL MC DRAMDDR

Node 1Node 0 Node 1

Xeon with 2 NUMA nodesKNL with 2 NUMA nodes

float *fv;fv = (float *)malloc(sizeof(float)*100);

Allocate into DDR

float *fv;fv = (float *)hbw_malloc(sizeof(float) * 100);

Allocate into MCDRAM

c Declare arrays to be dynamicREAL, ALLOCATABLE :: A(:)

!DEC$ ATTRIBUTES, FASTMEM :: ANSIZE=1024

c allocate array ‘A’from MCDRAMc

ALLOCATE (A(1:NSIZE))

Allocate into MCDRAM

C/C++ (*https://github.com/memkind) Intel Fortran

(a)

(c)

(b)

Figure 9. Flat MCDRAM analogy with two-socket Xeon and MCDRAM memory allocation

example. (a) KNL in flat MCDRAM modes boots up, showing two NUMA domains. (b) Two-

socket Xeon with its own memory showing two NUMA domains. (c) Simple code examples

showing the use of functions in HBW malloc library and FASTMEM attribute in Fortran.

..............................................................................................................................................................................................HOT CHIPS

............................................................

44 IEEE MICRO

0

1

2

3

4

5

6

7

8

Linpa

ck

Stream

BW

SpecInt

rate

(est)*

SpecFP

rate

(est)*

HPCG

QCD-Wils

on D

slash

WRF 12K

m

SeisSol

Mt Mera

pi

America

n Opt. A

pprox.

AMGSNAP

MILC

MiniFE

CNN Alex

Net

N-bod

y+

Rel

ativ

e pe

rform

ance

0

1

2

3

4

5

6

7

8

Rel

ativ

e pe

rform

ance

/wat

t

Relative performance Relative performance/watt

Measurements on early KNL silicon. Results subject to change on production silicon.* Early A0 silicon runs, but not SPEC compliant+ Code and measurement done by colfaxresearch.com

Figure 10. Performance and performance-per-watt comparison of a one-socket KNL at 215 W

(preproduction part) versus one-socket Xeon E5-2697v3 at 145 W. Numbers may vary with

production parts.

legacy workloads, without requiring any spe-cial way of programming other than thestandard CPU programming model. MICRO

....................................................................References1. G. Chrysos, Intel Xeon Phi X100 Family Cop-

rocessor—The Architecture, white paper,

Intel, Nov. 2012.

2. B. Kuttanna, Technology Insight: Intel Silver-

mont Microarchitecture, Intel, 2013.

3. Intel Architecture Instruction Set Exten-

sions Programming Reference, no. 319433-

023, Intel, Aug. 2015.

4. An Introduction to the Intel QuickPath Inter-

connect, white paper, Intel, Jan 2009.

5. M.S. Birrittella et al., Intel Omni-Path Archi-

tecture Technology Overview, Intel, Aug.

2015.

6. Intel 64 and IA-32 Architectures Software

Developer’s Manual: Combined Volumes: 1,

2A, 2B, 2C, 3A, 3B, 3C and 3D, no. 325462-

057US, Intel, June 2015.

7. M. Motoyoshi, “Through-Silicon Via (TSV),”

Proc. IEEE, vol. 97, no. 1, 2009, pp. 43–48.

Avinash Sodani is a senior principal engi-neer at Intel and the chief architect of theKnights Landing processor. His researchinterests include computer architecture, high-performance computing, parallel program-ming, and applications. Sodani received aPhD in computer science from the Universityof Wisconsin–Madison. Contact him [email protected].

Roger Gramunt is a senior computer archi-tect at Intel and one of the primary archi-tects of the Knights Landing core. Hisresearch interests include core architecture,graphics architecture, binary translation,and vector instruction sets. Gramunt re-ceived an MS in computer engineering from

.............................................................

MARCH/APRIL 2016 45

the Universitat Politecnica de Catalunya.Contact him at [email protected].

Jesus Corbal is a principal engineer at Inteland one of the primary architects of theKnights Landing vector processing unit. Hisresearch interests include vector microarchi-tectures and instruction-set architectureSIMD extensions. Corbal received a PhD incomputer engineering from the UniversitatPolitecnica de Catalunya. Contact him [email protected].

Ho-Seop Kim is a senior staff computerarchitect at Intel. His research interestsinclude core architecture, caching and virtu-alization, performance monitoring, anddynamic binary translation. Kim received aPhD in electrical and computer engineeringfrom the University of Wisconsin–Madison.Contact him at [email protected].

Krishna Vinod is a senior computer archi-tect at Intel and the architect of the Knights

Landing L2 cache and core-to-mesh inter-connect. His research interests include cach-ing, coherency, and multicore messaging.Vinod received an MS in electrical engineer-ing from Case Western Reserve University.Contact him at [email protected].

Sundaram Chinthamani is a seniorperformance architect at Intel and the per-formance lead of the Knights Landingprocessor. His research interests include high-performance computing, computer perform-ance modeling using simulation, and queuingnetworks. Chinthamani received a PhD incomputer engineering from the University ofSaskatchewan. Contact him at [email protected].

Steven Hutsell is a staff microprocessorarchitect at Intel and one of the primarymemory architects for the Knights Landingprocessor. His research interests includeCPU and graphics architecture, memoryand caching systems, multichip intercon-nects, and quantum computing. Hutsellreceived a PhD in electrical and computerengineering from Portland State University.Contact him at [email protected].

Rajat Agarwal is a principal engineer at Intel.His research focuses on high-memory-bandwidth solutions for high-performancecomputing products. Agarwal received aBTech in electronics and communicationsengineering from the National Institute ofTechnology, India. Contact him at [email protected].

Yen-Chen Liu is a senior server architect atIntel. His research interests include on-chip/off-chip interconnect fabric, multisocketcache coherency, and chip multiprocessormulticore processor designs. Liu received anMS in electrical and computer engineeringfrom the University of Wisconsin–Madison.Contact him at [email protected].

..............................................................................................................................................................................................HOT CHIPS

............................................................

46 IEEE MICRO

knights landing:second g intel xeon phi...

Documents