intel xeon phi hotchips architecture presentation
TRANSCRIPT
Intel® Xeon Phi™ coprocessor (codename Knights Corner)
George Chrysos Senior Principal Engineer Hot Chips, August 28, 2012
Legal Disclaimers Copyright © 2012 Intel Corporation. All rights reserved.
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT.
EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF
INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU
SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND
REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL
OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall
have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm%20
Intel, the Intel logo, Xeon, Intel Core and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries Other names and brands may be claimed as the property of others.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and
functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other
products.
For more complete information about performance and benchmark results, visit Performance Test Disclosure
This document contains information on products in the design phase of development.
All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not
guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.
Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides
for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804
WARNING: Altering clock frequency and/or voltage may: (i) reduce system stability and useful life of the system and processor; (ii) cause the processor and other system components to fail; (iii) cause reductions in system performance; (iv) cause additional heat or other
damage; and (v) affect system data integrity. Intel has not tested, and does not warranty, the operation of the processor beyond its specif ications. Intel assumes no responsibility that the processor, including if used with altered clock frequencies and/or voltages, will be fit
for any particular purpose. For more information, visit Overclocking Intel Processors
Warning: Altering PC memory frequency and/or voltage may (i) reduce system stability and use life of the system, memory and processor; (ii) cause the processor and other system components to fail; (iii) cause reductions in system performance; (iv) cause additional
heat or other damage; and (v) affect system data integrity. Intel assumes no responsibility that the memory, included if used with altered clock frequencies and/or voltages, will be fit for any particular purpose. Check with memory manufacturer for warranty and additional
details
Available on select Intel® Core™ Intel® Xeon® and Intel® Xeon Phi™ processors. Requires an Intel® HT Technology-enabled system. Consult your PC manufacturer. Performance will vary depending on the specific hardware and software used. For more information
including details on which processors support HT Technology, visit http://www.intel.com/info/hyperthreading.
Requires a system with a 64-bit enabled processor, chipset, BIOS and software. Performance will vary depending on the specific hardware and software you use. Consult your PC manufacturer for more information. For more information, visit
http://www.intel.com/info/em64t
Requires a system with Intel® Turbo Boost Technology. Intel Turbo Boost Technology and Intel Turbo Boost Technology 2.0 are only available on select Intel® processors. Consult your PC manufacturer. Performance varies depending on hardware, software, and
system configuration. For more information, visit http://www.intel.com/go/turbo
ENERGY STAR is a system-level energy specification, defined by the Environmental Protection Agency, that relies on all system components, such as processor, chipset, power supply, etc.) For more information, visit http://www.intel.com/technology/epa/index.html
Intel® Many Integrated Core (Intel MIC) Architecture
Targeted at highly parallel HPC workloads
• Physics, Chemistry, Biology, Financial Services
Power efficient cores, support for parallelism
• Cores: less speculation, threads, wider SIMD
• Scalability: high BW on die interconnect and memory
General Purpose Programming Environment
• Runs Linux (full service, open source OS)
• Runs applications written in Fortran, C, C++, …
• Supports X86 memory model, IEEE 754
• x86 collateral (libraries, compilers, Intel® VTune™ debuggers, etc)
Copyright © 2012 Intel Corporation. All rights reserved. 3 Visual and Parallel Computing Group
Knights Corner Coprocessor
Copyright © 2012 Intel Corporation. All rights reserved. 4 Visual and Parallel Computing Group
KN
KNC Card
KN
Intel® Xeon®
Processor PCIe x16
>= 8GB GDDR5 memory
TCP/IP
System Memory
> 50 Cores
Linux OS
GDDR5
Channel … PC e x16
KNC Card GDDR5
Channel
GDDR5
Channel … GDDR5
Channel
GD
DR
5
Ch
an
ne
l …
G
DD
R5
Ch
an
ne
l
Knights Corner – Power Efficient
Performance per Watt of a prototype Knights Corner Cluster compared to the 2 Top Graphics Accelerated Clusters
Copyright © 2012 Intel Corporation. All rights reserved. 5 Visual and Parallel Computing Group
1381 1380 1266
0
200
400
600
800
1000
1200
1400
MF
LOP
S/W
att
Higher is Better Source: www.green500.org
Intel Corp Knights Corner Top500 #150 72.9 kW
Nagasaki Univ. ATI Radeon Top500 #456 47 kW
Barcelona Supercomputing Center Nvidia Tesla 2090 Top500 #177 81.5 kW
+ + +
Knights Corner Micro-architecture
PCIe
Client
Logic
Core
L2
Core
L2
Core
L2
Core
L2
TD TD TD TD
Core
L2
Core
L2
Core
L2
Core
L2
TD TD TD TD GDDR MC
GDDR MC
GDDR MC
GDDR MC
Copyright © 2012 Intel Corporation. All rights reserved. 6 Visual and Parallel Computing Group
Knights Corner Core
X86 specific logic < 2% of core + L2 area
L2 Ctl
L1 TLB and 32KB
Code Cache
T0 IP
4 Threads In-Order
TLB Miss
Code Cache Miss
Decode uCode
16B/Cycle (2 IPC)
Pipe 0
X87 RF Scalar RF
X87 ALU 0 ALU 1
VPU RF
VPU 512b SIMD
Pipe 1
TLB Miss Handler
L2 TLB
T1 IP
T2 IP
T3 IP
L1 TLB and 32KB Data Cache DCache Miss
TLB Miss
To On-Die Interconnect
HWP
Core
512KB L2 Cache
PPF PF D0 D1 D2 E WB
Copyright © 2012 Intel Corporation. All rights reserved. 7 Visual and Parallel Computing Group
Vector Processing Unit
PPF PF D0 D1 D2 E WB
VC2 V1-V4 WB D2 E VC1
VC2 V1 V2 D2 E VC1 V3 V4
DEC
VPU
RF
3R, 1W
Mask
RF
Scatter
Gather
ST
LD
EMU Vector ALUs
16 Wide x 32 bit
8 Wide x 64 bit
Fused Multiply Add
Copyright © 2012 Intel Corporation. All rights reserved. 8 Visual and Parallel Computing Group
Interconnect
Core
L2
Data
Core
L2
Core
L2
Core
L2
TD TD TD TD
Core
L2
Core
L2
Core
L2
Core
L2
TD TD TD TD
BL - 64 Bytes
AD
AK
Copyright © 2012 Intel Corporation. All rights reserved. 9 Visual and Parallel Computing Group
BL – 64 Bytes
AD
AK
Command and Address
Coherence and Credits
Distributed Tag Directories
Core
L2
Core
L2
Core
L2
Core
L2
TD TD TD TD
Core
L2
Core
L2
Core
L2
Core
L2
TD TD TD TD
Tag Directories track cache-lines in all L2s
TAG Core Valid Mask State
Copyright © 2012 Intel Corporation. All rights reserved. 10 Visual and Parallel Computing Group
TAG Core Valid Mask State
Interleaved Memory Access
Copyright © 2012 Intel Corporation. All rights reserved. 11 Visual and Parallel Computing Group
Core
L2
Core
L2 GD
DR
MC
Co
re
L2
Co
re
L2
GDDR MC
Co
re
L2
Co
re
L2
GDDR MC
Core
L2
Core
L2 GD
DR
MC
TD TD TD
T
D
TD
T
D
TD TD
Interconnect: 2X AD/AK
Core
L2
Core
L2
Core
L2
Core
L2
TD TD TD TD
Core
L2
Core
L2
Core
L2
Core
L2
TD TD TD TD
BL - 64 Bytes
AD
AK
Copyright © 2012 Intel Corporation. All rights reserved. 12 Visual and Parallel Computing Group
BL – 64 Bytes
AD
AK
2x
Multi-threaded Triad – Saturation for 1 AD/AK Ring
Results measured in development labs at Intel on Knights Corner prototype hardware and systems. For more information go to http://www.intel.com/performance
Copyright © 2012 Intel Corporation. All rights reserved. 13 Visual and Parallel Computing Group
0 5 10 15 20 25 30 35 40 45 50
Pe
rfo
rma
nce
Cores Running
Simulation Data indicates
saturation for a single
AD/AK ring
0 5 10 15 20 25 30 35 40 45 50
Multi-threaded Triad – Benefit of Doubling AD/AK
Results measured in development labs at Intel on Knights Corner prototype hardware and systems. For more information go to http://www.intel.com/performance
Silicon Data for
2 AD + AK rings > 40%
Copyright © 2012 Intel Corporation. All rights reserved. 14 Visual and Parallel Computing Group
Pe
rfo
rma
nce
Cores Running
Simulation Data indicates
saturation for a single
AD/AK ring
Streams Triad for (i=0; i<HUGE; i++) A[i] = k*B[i] + C[i];
Without Streaming Stores Read A, B, C, Write A 256 Bytes transferred to/from memory per iteration
With Streaming Stores Read B, C, Write A 192 Bytes transferred to/from memory per iteration
Streaming Stores
Copyright © 2012 Intel Corporation. All rights reserved. 15 Visual and Parallel Computing Group
0 5 10 15 20 25 30 35 40 45 50
Multi-threaded Triad — with Streaming Stores
Results measured in development labs at Intel on Knights Corner prototype hardware and systems. For more information go to http://www.intel.com/performance
Silicon Data
Streaming Stores > 30%
Copyright © 2012 Intel Corporation. All rights reserved. 16 Visual and Parallel Computing Group
Pe
rfo
rma
nce
Cores Running
Cache Hierarchy Micro-architecture Choices
L2 TLB 64 entry, holds PTEs and PDEs vs. no L2 TLB
Dcache Capability Simultaneous 512b load and 512b store vs. 1 load or store per cycle
L2 Cache 512 KB vs. 256 KB
Hardware Prefetcher 16 stream detectors, prefetch into the L2 vs. no HWP (rely only on software prefetching)
Copyright © 2012 Intel Corporation. All rights reserved. 17 Visual and Parallel Computing Group
Per-Core ST Performance Improvement (per cycle)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Spec FP 2006
Performance impact of KNC core uArch improvements
Results measured in development labs at Intel on Knights Corner and Knights Ferry prototype hardware and systems. For more information go to http://www.intel.com/performance
Copyright © 2012 Intel Corporation. All rights reserved. 18 Visual and Parallel Computing Group
>1.8x Average Performance/Cycle Improvement – 1 Core, 1 Thread
0
5
10
15
20
25
30
35
40
45
50
Memory BW L2 Cache BW L1 Cache BW
Relative BW Relative BW/Watt
Caches – For or Against?
Copyright © 2012 Intel Corporation. All rights reserved. 19 Visual and Parallel Computing Group
Coherent Caches are a key MIC Architecture Advantage
Results have been simulated and are provided for informational purposes only. Results were derived using simulations run on an architecture simulator or model. Any difference in system hardware or software design or configuration may affect actual performance.
Caches:
high data BW
low energy per byte of data supplied
programmer friendly (coherence just works)
Example: Stencils
L2$ Sized
spatial time-step simulation of a physical system
Copyright © 2012 Intel Corporation. All rights reserved. 20 Visual and Parallel Computing Group
Cache blocking promotes much higher performance and performance/watt vs. memory streaming
Power Management: All On and Running
PCIe
Client
Logic GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
Core
L2
Core
L2
Core
L2
Core
L2
TD TD TD TD
Core
L2
Core
L2
Core
L2
Core
L2
TD TD TD TD
PCIe IO
GD
DR
IO G
DD
R IO
GDDR MC
GDDR MC
GDDR MC
GDDR MC
Copyright © 2012 Intel Corporation. All rights reserved. 21 Visual and Parallel Computing Group
Core C1: Clock Gate Core
PCIe
Client
Logic GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
Core
L2
Core
L2
Core
L2
Core
L2
TD TD TD TD
Core
L2
Core
L2
Core
L2
Core
L2
TD TD TD TD
GD
DR
IO G
DD
R IO
GDDR MC
GDDR MC
GDDR MC
GDDR MC
Copyright © 2012 Intel Corporation. All rights reserved. 22 Visual and Parallel Computing Group
PCIe IO
When all 4T on a core have halted, core clock gates itself
Core C6: Power Gate Core
PCIe
Client
Logic GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
Core
L2
Core
L2
Core
L2
Core
L2
TD TD TD TD
Core
L2
Core
L2
Core
L2
Core
L2
TD TD TD TD
GD
DR
IO G
DD
R IO
GDDR MC
GDDR MC
GDDR MC
GDDR MC
Copyright © 2012 Intel Corporation. All rights reserved. 23 Visual and Parallel Computing Group
PCIe IO
C1 time-out, power gate core, save leakage, requires core-re-init
Package Auto C3
PCIe
Client
Logic GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
Core
L2
Core
L2
Core
L2
Core
L2
TD TD TD TD
Core
L2
Core
L2
Core
L2
Core
L2
TD TD TD TD
GD
DR
IO G
DD
R IO
GDDR MC
GDDR MC
GDDR MC
GDDR MC
Copyright © 2012 Intel Corporation. All rights reserved. 24 Visual and Parallel Computing Group
PCIe IO
Timeout when all cores have been in C6, clock gate the L2 and interconnect
Package C6
PCIe
Client
Logic GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
Core
L2
Core
L2
Core
L2
Core
L2
TD TD TD TD
Core
L2
Core
L2
Core
L2
Core
L2
TD TD TD TD
GD
DR
IO G
DD
R IO
GDDR MC
GDDR MC
GDDR MC
GDDR MC
Copyright © 2012 Intel Corporation. All rights reserved. 25 Visual and Parallel Computing Group
PCIe IO
Host Driver can initiate Package C6 – Uncore Voltage Off, requires partial restart
Summary
Intel® Xeon Phi™ coprocessor provides:
Performance and Performance/Watt for highly parallel HPC with cores, threads, wide-SIMD, caches, memory BW
Intel Architecture general purpose programming environment
advanced power management technology
Copyright © 2012 Intel Corporation. All rights reserved. 26 Visual and Parallel Computing Group
KNC delivers programmability and performance/watt for highly parallel HPC
Thank You
Knights Corner brought to you by:
IAG (Intel Architecture Group)
• DCSG (Data Center and Systems Group)
• VPG (Visual and Parallel Group) MIC
– HW Architecture
– HW Design
– SW
SSG (Software and Services Group) MIC
IL PCL (Intel Labs – Parallel Computing Lab)
Copyright © 2012 Intel Corporation. All rights reserved. 27 Visual and Parallel Computing Group
Vector Processor: 512b SIMD Width
Shared Multiplier Circuit for SP/DP
RF3 RF2 RF1 RF0
SP
15 DP7
SP
14
SP
13 DP6
SP
12
SP
11 DP5
SP
10
SP
9 DP4
SP
8
SP
7 DP3
SP
6
SP
5 DP2
SP
4
SP
3 DP1
SP
2
SP
1 DP0
SP
0
Copyright © 2012 Intel Corporation. All rights reserved. 29 Visual and Parallel Computing Group
16 wide SP SIMD, 8 wide DP SIMD 2:1 Ratio good for circuit optimization
Gather/Scatter Address Machinery
gather-prime
loop: gather-step; jump-mask-not-zero loop
Copyright © 2012 Intel Corporation. All rights reserved. 30 Visual and Parallel Computing Group
Index0
+
Base Address
Addr0
Index1
+
Addr1
Index2
+
Addr2
Index3
+
Addr3
Index4
+
Addr4
Index5
+
Addr5
Index6
+
Addr6
Index7
+
Addr7
1
1
1
1
1
1
1
1
Clear
Clear = =
Access Address
Find
First
Gather/Scatter machine takes advantage of cache-line locality
Gather Instruction Loop
Scalar Register
Vector Register
Mask Register
To TLB/ DCACHE
Package Deep C3
PCIe
Client
Logic GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
GDDR5
Core
L2
Core
L2
Core
L2
Core
L2
TD TD TD TD
Core
L2
Core
L2
Core
L2
Core
L2
TD TD TD TD
GD
DR
IO G
DD
R IO
GDDR MC
GDDR MC
GDDR MC
GDDR MC
Copyright © 2012 Intel Corporation. All rights reserved. 31 Visual and Parallel Computing Group
PCIe IO
Host Driver Initiated – L2/Ring/TDs dropped to retention V, memory in self refresh