intel® xeon phiâ„¢ product family - an overview - scicomp

S c i c o m P 2 0 1 3 Tu t o r i a l

Intel® Xeon Phi™ Product Family Architecture Overview

Klaus-Dieter Oertel, May 28th 2013

Software and Services Group Intel Corporation

Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Intel Architecture Multicore and Manycore More cores. Wider vectors. Co-Processors.

Intel® Xeon®

processor

64-bit

Intel Xeon

processor

5100 series

Intel Xeon processor

5500 series

Intel Xeon processor

5600 series

Intel Xeon

processor

E5 Product Family

Intel Xeon

processor code name

Ivy Bridge

Intel Xeon

processor code name

Haswell

Intel® Xeon Phi™

Coprocessor

Core(s) 1 2 4 6 8 10 To be determined

61

244 Threads 2 2 8 12 16 20

Intel® Xeon Phi™ Coprocessor extends established CPU architecture and programming concepts to highly parallel applications

Images do not reflect actual die sizes. Actual production die may differ from images.

2

© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

vision span from few cores to many cores with consistent models, languages, tools, and techniques

3

http://software.intel.com/en-us/articles/optimization-notice


Consistent Tools & Programming Models

Code

Compiler Libraries

Parallel Models

Manycore

Intel® Xeon Processor

Intel® Xeon Phi™

Coprocessor

Multicore

Intel® Xeon Processors

4

Standards Programming Models Vectorize, Parallelize, & Optimize


Each Intel® Xeon Phi™ Coprocessor core is a fully functional multi-thread execution unit

Scalar unit based on Intel® Pentium® processor family

• Two pipelines (U and V)

– Dual-issue on scalar instructions

• Scalar pipeline 1 clock latency

• 64-bit data path

4 hardware threads per core

• “Smart” round-robin scheduling

– Prefetch buffers 2 instr-bundles / context

– Next ready context selected in order

5

Ring

Scalar

Registers

Vector

Registers

512K L2 Cache

32K L1 I-cache 32K L1 D-cache

Vector Unit

Scalar Unit

U V

Instruction Decoder


Each Intel® Xeon Phi™ Coprocessor core is a fully functional multi-thread vector unit

A new vector unit: 512 bits wide!

• 32 512-bit vector registers per context

– Each holds 16 floats or 8 doubles

– ALUs support int32/float32 operations, float64 arithmetic, int64 logic ops

– Ternary ops including Fused-Multiply-Add

– Broadcast/swizzle support

– 8 vector mask registers for per lane conditional operations

– Most ops: 4-cycle latency, 1-cycle thrput – Matches 4-cycle round robin of integer unit

– Mostly IEEE 754 2008 compliant – Not supported: MMX™ technology,

Streaming SIMD Extensions (SSE), Intel® Advanced Vector Extensions (Intel AVX)

7

Ring

Scalar

Registers

Vector

Registers

512K L2 Cache

32K L1 I-cache 32K L1 D-cache

Vector Unit

Scalar Unit

U V

Instruction Decoder


Individual cores are tied together via fully coherent caches into a bidirectional ring

8

GDDR

GDDR

GDDR

GDDR

PCIexp

L1 32K I/D-cache per core

1 cycle access latency 3 cycle addr-gen interlock l. 8-way associativity 64-byte cache line ~38 concurrent access/core

L2 512K cache per core

11 cycle raw latency 8-way associativity 64-byte cache line Streaming HW prefetcher ~38 concurrent access/core

GDDR5 Memory

16 32-bit channels - Up to 5.5 GT/sec 8 GB - 300ns access

Bidirectional ring 200 GB/sec Distributed Tag Directory (DTD) reduces ring snoop traffic Gen2x16 PCI Express* 64-256 byte packets peer-to-peer R/W


Ring Interconnect Distributed Tag Directories

9

Core

L2

Core

L2

Core

L2

Core

L2

TD TD TD TD

Core

L2

Core

L2

Core

L2

Core

L2

TD TD TD TD TAG Core Valid Mask State

TAG Core Valid Mask State

TAG Core Valid Mask State

Tag Directories track the Cache line in all L2 caches

Data

Data

Command Address

Coherence

Command Address

Coherence


Vector Processing Unit and Intel® IMCI

Vector Processing Unit Execute Intel® IMCI

• Intel® Initial Many Core Instructions

512-bit Vector Execution Engine

• 16 lanes of 32-bit single precision and integer operations

• 8 lanes of 64-bit double precision and integer operations

• 32 512-bit general purpose vector registers in 4 thread

• 8 16-bit mask registers in 4 thread for predicated execution

Read/Write

• One vector length (512-bits) per cycle from/to Vector Registers

• One operand can be from the memory free

IEEE 754 Standard Compliance

• 4 rounding Model, even, 0, +∞, -∞

• Hardware support for SP/DP denormal handling

• Sets status register VXCSR flags but not hardware traps

10


Examples of Intel® IMCI

Ternary Operands

• vop ::: zmm1, zmm2, zmm3 zmm1 = zmm2:::vop:::zmm3

• vop ::: zmm1, zmm2, [ptr] zmm1 = zmm2::: vop:::MEM[ptr]

Fused operation Multiply-Add, Multiply-subtract

• vfmadd132ps::: zmm1, zmm2, zmm3 zmm1=zmm1Xzmm3+zmm2



• Standard IEEE 754-2008R 0.5 ulps not 1 upls as two operations

Prefetching

• Memory Prefetching minimize the likelihood of L1, L2 cache misses

• Intel® Xeon Phi Coprocessor has a hardware prefetcher

• L1 prefetch: vprefetch1::: ptr, hint

• L2 prefetch: vprefetch2::: ptr, hint

11


12

D2 E VC1 VC2 V1 V2 V3 V4

D2 E VC1 VC2 V1-V4 WB

D1 D2 E PPF PF D0

Core extension Vector Processing Unit

Vector ALUs 16 X 32-bit Wide 8 X 64-bit Wide

Fuse Multiply Add

LD

EMU

ST

VPU RF

3R,1W

Scatter Gather

DEC

Mask RF

WB

12


EMU - Extended Math Unit

Single Precision Transcendental function

Minimax quadratic polynomial approximation

Directly implement 4 Elementary functions

• vrcp23ps v1 {k1}, v0 // Reciprocal

• vrsqrt23ps v1 {k1}, v0 // Reciprocal square root

• vlog223ps v1 {k1}, v0 // Logarithmic

• vexp223ps v1 {k1}, v2 // Exponential

Benefit other Derived Functions

• pow(x,y), sqrt(), div(), ln()

13

Function name Latency Throughput

exp2() 8 2

log2() 4 1

rcp() 4 1

rsqrt() 4 1

sqrt() 8 2

pow() 16 4

div() 8 2

ln() 8 2


Synthetic Benchmark Summary

640

1,729

1,860

0

500

1000

1500

2000

E5-2670

Baseline

(2x 2.6GHz, 8C,

115W)

5110P

(60C,

1.053GHz, 225W)

SE10P

(61C, 1.1GHz,

300W)

SGEMM (GF/s)

Up to 2.9X

309

833

883

0

200

400

600

800

1000

E5-2670

Baseline

(2x 2.6GHz, 8C,

115W)

5110P

(60C,

1.053GHz, 225W)

SE10P

(61C, 1.1GHz,

300W)

DGEMM (GF/s)

303

722

803

0

200

400

600

800

1000

E5-2670

Baseline

(2x 2.7GHz, 8C,

115W)

5110P

(60C,

1.053GHz, 225W)

SE10P

(61C, 1.1GHz,

300W)

SMP Linpack (GF/s)

80

159

174

0

50

100

150

200

E5-2670

Baseline

(2x 2.6GHz, 8C,

115W)

5110P

(60C,

1.053GHz, 225W)

SE10P

(61C, 1.1GHz,

300W)

STREAM Triad (GB/s)

Up to 2.8X Up to 2.6X Up to 2.2X Higher is Better Higher is Better Higher is Better Higher is Better

85%

Eff

icie

nt

86%

Eff

icie

nt

82%

Eff

icie

nt

82%

Eff

icie

nt

71%

Eff

icie

nt

75%

Eff

icie

nt

ECC O

n

ECC O

n

Coprocessor results: Benchmark run 100% on coprocessor, no help from Intel® Xeon® processor host (aka native)

14


http://software.intel.com/mic-developer

15


Intel® Xeon Phi™ Coprocessor High Performance Programming, Jim Jeffers, James Reinders, (c) 2013, publisher: Morgan Kaufmann

It all comes down to PARALLEL PROGRAMMING! (applicable to processors and Intel® Xeon Phi™coprocessors Forward, Preface Chapters: 1. Introduction 2. High Performance Closed Track

Test Drive! 3. A Friendly Country Road Race 4. Driving Around Town:

Optimizing A Real-World Code Example

5. Lots of Data (Vectors) 6. Lots of Tasks (not Threads) 7. Offload 8. Coprocessor Architecture 9. Coprocessor System Software 10. Linux on the Coprocessor 11. Math Library 12. MPI 13. Profiling and Timing 14. Summary, Glossary, Index

Available NOW

This book belongs on the bookshelf of every HPC

professional. Not only does it successfully and accessibly teach

us how to use and obtain high performance on the Intel MIC architecture, it is about much

more than that. It takes us back to the universal fundamentals of

high-performance computing including how to think and reason

about the performance of algorithms mapped to modern architectures, and it puts into

your hands powerful tools that will be useful for years to come.

—Robert J. Harrison Institute for Advanced

Computational Science, Stony Brook University

Learn more about this book:

lotsofcores.com

“© 2013, James Reinders & Jim Jeffers, book image used with permission

16


INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice


18

5/28/2013

intel® xeon phiâ„¢ product family - an overview - scicomp

Documents