design is a high performance embedded · pdf filec6x high performance at low power ... –...

56
1 High Performance Embedded Computing Arnon Friedmann Texas Instruments Design is a strategic asset

Upload: ngohanh

Post on 06-Feb-2018

224 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

1

High Performance Embedded Computing

Arnon Friedmann

Texas Instruments

Design is a strategic

asset

Page 2: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

Overview • What is embedded? • How did we get here?

– Shannon DSP

• Brief history of TI DSP for HPC • What makes a DSP? • Where are we now

– Benchmarks – Hawking DSP, Brown Dwarf, Moonshot

• It’s about the Software • Where are we headed • Summary

2

Page 3: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

radar & communications computing

Embedded Markets

Video and audio infrastructure

Wireless and Networking

DVR / NVR & smart camera

Test and Measurement

Industrial control

High-performance and cloud computing

3

Home AVR and automotive audio

Portable mobile radio

Medical imaging

Mission critical systems

media processing industrial electronics

Industrial imaging

Analytics

Page 4: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

C6657 • 1x/2x C66x • 1.25 GHz, 3MB L2 • PCIe, USB, GigE, SRIO • 21x21mm

DSP Roadmap C6678/4/2/1 • 1x/2x/4x/8x C66x • 1.25GHz, 8MB of L2 • PCIe, GigE, SRIO • 24x24mm

Concept Development Sampling Production D

SP M

id

DSP

Low

D

SP H

igh

66AK2H12/06 • 2x/4x ARM A15, 1.4 GHz • 4x/8x C66x, 1.2GHz • 1.4GHz,up to 8MB L2 • PCIe, USB, GigE, SRIO • 40x40mm

Next Mid-range • Multicore ARM and DSP • Industrial control and communications

Next High End Multicore • High performance Multicore ARM + DSP • Large L2, 2x DDR4 • High speed serial I/O

Next DSP Low • Multicore ARM and DSP devices • Industrial, Audio and Communicaitons

2013 2014 2015 Production

OMAP L138 • 1xARM A9, 456 MHz • 1xC674x, 456 MHz • EMAC, USB2, TDM • 13mm2 ,16x16mm

C6748 • 1xC674x, 456 MHz • EMAC, USB2, McASP • 13mm2 ,16x16mm

AM5K2H04 • 4x ARM A15, 1.4 GHz • 1.4GHz,up to 8MB L2 • PCIe, USB, GigE, SRIO

Page 5: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

TVP5154 TVP5158

DM64x

DM6467

Technology for Video Security Analytics from the Core to the Edge

Analog Camera

DM33x ISP+ARM

Coax cable

Smart Analytics IP Camera

DMVA2 DMVA3

Advanced Analytics IP Camera

DM644x DM812x

Main Stream

IP Camera

DM36x DM385

3G/Edge

IP DM812x C665x

C665x DM385

Additional processing With C665x Multicore

DVR : Digital Video Recorder NVR : Network Video Recorder DVS : Digital Video Server

DM810x DM814x DM816x

TI’s DSP & vision solutions: - All DM81xx DVR solutions with embedded analytics capabilities

- Analytics at the edge with DMVAx & DM812x

C667x

C667x Multicore

Page 6: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

Unleashing TI multicore DSPs @ SC’11

Innovative new DSP core Most powerful multicore DSPs Lowest power per MHz/GMAC/GFLOP 6

Page 7: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

Highest Performance Fixed and Floating Point DSP

• 40 GMACs/ 20 GFLOPs/Core

• 320 GMACs/ 160 GFLOPs/Total

• >10GFlops/Watt

TI Optimizing Compiler (GCC support, C/C++)

Scientific Computing Libraries

Multicore Tools and Code Composer

Studio IDE

C66x DSP Core

Floating Point

Fixed Point

C64x Core

C64xx

Industry’s Lowest Power Fixed-point DSP Core

Industry’s Highest Performance DSP Core

Current base for multi-core product line

C67x Core Industry’s Lowest Power

Floating-point DSP Core

High precision and wide dynamic range

Easy and flexible programming

C67xx

Most Power Efficient Scientific Computing Engine in the Industry!

NEW MultiCore

DSP C66x

Evolution of the C66x

Page 8: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

Shannon (TMS320C6678) – Block Diagram

8

Multicore Navigator

Tera

Net

C66x DSP

L1 L2

C66x DSP

L1 L2

C66x DSP

L1 L2

C66x DSP

L1 L2

C66x DSP

L1 L2

C66x DSP

L1 L2

C66x DSP

L1 L2

C66x DSP

L1 L2

8 x CorePac

SRIO x4

PCIe x2

EMIF 16

TSIP 2x

I2C SPI UART

Peripherals & IO

GbE Switch

SGMII SGMII

IP Interfaces

Crypto

Packet Accelerator

Network CoProcessors

Power Management Debug

Multicore Shared Memory Controller (MSMC)

Shared Memory 4MB

DDR3- 64b

EDMA SysMon

System Elements

Memory Subsystem

• Multi-Core KeyStone SoC • Fixed/Floating CorePac

• 8 CorePac @ 1.25 GHz • 0.5MB L2/core, 4.0 MB Shared L2 • 320G MAC, 160G FLOP, 60G DFLOPS • 10W

• Navigator • Hardware Queue Manager with DMA

• Multicore Shared Memory Controller

• Low latency, high bandwidth memory access

• Network Coprocessor • IPv4/IPv6 Network interface solution • IPSec, SRTP, Encryption fully offloaded

• HyperLink • 50G Baud Expansion Port • Transparent to Software

Hyper Link 50

Page 9: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

Telecom ATCA Blade

• 1T DFLOPS / blade • 240W (board power) • 256GB / s memory bandwidth • 20GB memory • 100Gbit/s interconnect total

bandwidth • Dual 10Gbit/s Ethernet uplink • 20 devices, 8 cores each • 50Gbit/s links pairing devices

9

Page 10: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

Quad/Octal-Shannon PCIe Cards

•512 Gflops •50 W

• ~1 Teraflop • 110 W •16 GByte DDR3

Page 11: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

High Level Comparison

• TI Quad Shannon PCIe ~50 W (2011)

• Nvidia Kepler ~250 W (2012) – Dominates acceleration today – Powers #2 Supercomputer (Titan)

• Intel Xeon PHI (MIC) ~250 W (2012) – Unveiled at SC’12 – Powers #1 Supercomputer (Tian 2)

~12.8 Gflops/W SP ~3.2 Gflops/W DP

~12 Gflops/W SP ~4 Gflops/W DP

~8 Gflops/W SP ~4 Gflops/W DP

Page 12: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

QCDSP: A Teraflop Scale Massively Parallel Supercomputer

12

We discuss the work of the QCDSP collaboration to build an inexpensive Teraflop scale massively parallel computer suitable for computations in Quantum Chromodynamics (QCD). The computer is a collection of nodes connected in a four dimensional toroidial grid with nearest neighbor bit serial communications. A node is composed of a Texas Instruments Digital Signal Processor (DSP), memory, and a custom made communications and memory controller chip. An 8192 node computer with a peak speed of 0.4 Teraflops is being constructed at Columbia University for a cost of $1.8 Million. A 12,288-node machine with a peak speed of 0.6 Teraflops is being constructed for the RIKEN Brookhaven Research Center. Other computers have been built including a 50 Gigaflop version for Florida State University. Keywords: parallel, supercomputer, digital signal processor, QCD Introduction The atoms and nuclei of everyday matter are now known to be made up of still tinier particles known as quarks and leptons.

Researchers at Brookhaven develop DSP-based system in the mid-late 90’s

Page 13: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

QCDSP: A Teraflop Scale Massively Parallel Supercomputer

13

We discuss the work of the QCDSP collaboration to build an inexpensive Teraflop scale massively parallel computer suitable for computations in Quantum Chromodynamics (QCD). The computer is a collection of nodes connected in a four dimensional toroidial grid with nearest neighbor bit serial communications. A node is composed of a Texas Instruments Digital Signal Processor (DSP), memory, and a custom made communications and memory controller chip. An 8192 node computer with a peak speed of 0.4 Teraflops is being constructed at Columbia University for a cost of $1.8 Million. A 12,288-node machine with a peak speed of 0.6 Teraflops is being constructed for the RIKEN Brookhaven Research Center. Other computers have been built including a 50 Gigaflop version for Florida State University. Keywords: parallel, supercomputer, digital signal processor, QCD Introduction The atoms and nuclei of everyday matter are now known to be made up of still tinier particles known as quarks and leptons.

Researchers at Brookhaven develop DSP-based system in the mid-late 90’s

TI then forgot all about this...

Page 14: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

14

Page 15: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

Linpack Results from KTH

15

• Data from study performed at KTH Supercomputing Center • LINPACK running on C6678 achieves 25.6 GFlops, ~2.1 GFlops/W • Single precision performance ~4x better, ~8 GFlops/W

Page 16: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

Comparison – Algorithm Level • GPU benchmark: Nvidia Tesla C1060

(Bisceglie’10, ref[2]) – Core clock= 1.296GHz; Processor core

#=240; memory= 4GB @ 800MHz – Testing algorithm: Range-azimuth algorithm,

FFT size 4096

• FPGA: Xilinx VIRTEX-5 (Pfitzner’11, ref[3])

• Comparison

16

DSP GPU FPGA

23.0

14.9

53.3 ns/pixel

DSP > 20x in power/performance

Page 17: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

17

Running on OpenMP today

Page 18: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

Video Analytics Comparison between KeyStone I and x86 processors

0

5

10

15

20

25

30

35

40

0

1

2

3

4

5

6

7

i7-2600 Xeon E5620 Dual E5645 Xeon X5675 singleShannon

QuadShannon

OctoShannon

Cost

(USD

)/ c

hann

el

Wat

ts c

onsu

med

/ ch

anne

l

Processors

Watts consumed / Cost per Channel (QVGA)

QVGA Watt per channel

QVGA Cost per channel

Page 19: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

FINALLY SOME DETAILS...

19

Page 20: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

C66x Core Overview

L1P

C66x DSP

L1P SRAM/Cache

32KB

L2

L1D

Embedded Debug

Prefetch

Power Management

Interrupt controller

Emulation

Register file A Register file B

Fetch

L1D SRAM/Cache

32KB

L2 SRAM/Cache

1MB

DMA

L M S D L M S D

Dispatch Exectute

Prefetch

Prefetch

registers

64 64 64 64

registers

Page 21: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

C6X High Performance at low power • The c6x architecture is designed to provide the highest performance

DSP processing • Superscalar DSP is capable of executing 8 instructions per clock cycle • VLIW engine works in concert w/ compiler technology to provide

superscalar performance without the power overhead of general purpose superscalar CPUs

GPP C6X

ALU ALU ALU

Instruction Dispatch

Instruction Scheduler

Reservation Stations

Re-order Buffers

ALU ALU ALU

Instruction Dispatch Instruction Scheduler

Register Allocation

C6X Compiler C6X VLIW Engine

Page 22: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

C6X VLIW Power Optimization • Traditional exposed-pipeline VLIW machines have some drawbacks

with respect to power

• Instruction RAM usage is high due to – No instruction scheduler – if an ALU (or other functional unit) is not used, a

NOP must be issued to it – Loops must be unrolled by the compiler

Pure VLIW Machine code for low IPC code – 6 instructions encoded in 24 instruction words

NOP ADD NOP NOP MPY NOP NOP NOP SUB NOP NOP NOP MPY NOP NOP NOP NOP NOP LD NOP MPY NOP NOP NOP

Page 23: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

C6X VLIW Instruction Dispatch • To reduce instruction fetch power and code size, a simple instruction

dispatch unit is introduced

ALU

Decoder

ALU

Decoder

ALU

Decoder

Instruction RAM

ALU

Decoder

ALU

Decoder

ALU

Decoder

Instruction RAM

Pure VLIW Machine C6X VLIW

Instruction Dispatch

Page 24: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

C6X VLIW Instruction Dispatch (2) • Execution unit and Parallelism encoded in machine code

– Simplified dispatch unit “unpacks” the machine code.

ADD MPY SUB MPY LD MPY

C6X Machine Code

C6X VLIW Core

NOP ADD NOP NOP MPY NOP NOP NOP SUB NOP NOP NOP MPY NOP NOP NOP NOP NOP LD NOP MPY NOP NOP NOP

Instruction Dispatch

Page 25: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

C6X VLIW Loop Unrolling • Traditional exposed pipeline VLIW machines have additional instruction

overhead – Loops are unrolled by the compiler – A loop that really only has 4 unique instructions can easily need 12-16

instructions after unrolling

• C64x+ generation introduced a loop construct which unrolls the loop w/in the CPU – Code size reduction for loops – Power Savings in CPU instruction pipeline

Page 26: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

C6X VLIW Loop Unrolling (2) • A “while” loop in C is depicted below • The resulting assembly code is shown for

– Traditional VLIW and – C6X VLIW

B LOOP LDW *A0++, A7 || B LOOP LDW *A0++, A7 || B LOOP LDW *A0++, A7 || B LOOP LDW *A0++, A7 || B LOOP LDW *A0++, A7 || B LOOP LOOP: LDW *A0++, A7 ||[A1] SUB A1,1,A1 || ADD A7,A8,A8 ||[A1] B LOOP

while (A1--) A8 += *A0++; C-psuedo source

Traditional VLIW

MVC A1, ILC || SPLOOP 1 LDW *A0++,A7 NOP 4 ADD A7,A8,A8 || SPKERNEL

C6X VLIW w/ Software Pipelined Loop Unroller

• 20% Overall Dynamic Power reduction • Same performance using less than ½ the instructions

Page 27: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

C6X VLIW plus SIMD • Strategy in evolving from c674x core to c66x

– Increase datapath width, leave “overhead” the same – C66x increased most 32-bit instructions into 64-bit SIMD versions of the

same instructions – Instruction decode overhead is the same, but processing power goes up by

2x – Overall energy consumption is lower for a given benchmark. – When only 32-bits of the unit is required, clock-gating eliminates the dynamic

power of the unused 32-bits

Page 28: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

WHERE ARE WE NOW

28

Page 29: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

KeyStone Innovation

• Lowers development effort • Speeds time to market • Leverages TI’s investment • Optimal software reuse

5 generations of multicore

29

2011 2013 / 2014 2012

C64x+

Wireless Accelerators

Network and Security AccelerationPacs

C66x fixed and floating point, FPi, VSPi

ARM A8

ARM A15

10G Networking

64 bit ARM v8

C66x+

40G Networking

Faraday 65nm

2014 / 2015

Multicore cache coherency

KeyStone III 20nm

KeyStone

KeyStone II

KeyStone 40nm

KeyStone II 28nm

Concept

Development

Sampling

Production

Janus 130nm

6 core DSP

2003 2006

Page 30: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

K2H Platform 66AK2H12/06 Functional Diagram

40mm x 40mm package

Multicore Navigator 28 nm

Te

raN

et

MSMC 6MB

Network AccelerationPacs

System Elements

Power Mgr

Packet Accelerator

5 port 1GbE Switch

EMIF and I/O

64/72b DDR3

x2

16b EMIF

UART x2

SPI x3

I2C x3

High Speed SERDES

1GbE SRIO HyperLink x2

4x

4x

8x

PCIe

2x

Security Accelerator

SysMon

Debug EDMA

•4x/8x 66x DSP cores up to 1.4GHz •2x/4x Cotex ARM A15 •1MB of local L2 cache RAM per C66 DSP core •4MB shared across all ARM

C66x Fixed or Floating Point DSP

•Multicore Shared Memory Controller provides low latency & high bandwidth memory access

•6MB Shared L2 on-chip •2 x 72 bit DDR3, 72-bit (with ECC), 16 GB total addressable, DIMM support (4 ranks total)

Large on chip and off chip memory

• Multicore Navigator, TeraNet, HyperLink •1GbE Network coprocessor (IPv4/IPv6) •Crypto Engine (IPSec, SRTP)

KeyStone multicore architecture and acceleration

•4 Port 1G Layer 2 Ethernet Switch •2x PCIe, 1x4 SRIO 2.1, EMIF16, USB 3.0 UARTx2, SPI, I2C

•15-25W depending upon DSP cores, speed, temp & other factors

Peripherals

66x

66x

66x

66x

1MB 1MB 1MB 1MB

ARM A15

ARM A15

ARM A15

ARM A15

4MB

66x

66x

66x

66x

1MB 1MB 1MB 1MB

USB3

Page 31: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

The essential foundation for the new style of IT

45 hot-plug cartridges

Compute, Storage, or Combination x86 , ARM, or Accelerator

• Single-server = 45 servers per chassis

• Quad-server =180 servers per chassis (future capability)

Dual low-latency switches

• HP Moonshot-45G Switch Module (180 x1Gb downlinks)

HP Moonshot – Keystone II software-defined server

Page 32: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

TI KeyStone II HPC System

• ATCA-Based • Rapid IO Switching • 8 TFlop/Blade • 100 GByte/Blade • Up to 14 blades/chassis

32

Page 33: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

SOFTWARE AND TOOLS

Page 34: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

Multicore Software Vision

Multicore DSP Multicore ARM

Mainline SMP Linux Standard Linux Tools Distribs. MPI Augment with TI differentiation

OpenMP Libraries Navigator run-time RTOS & Drivers IPC And more…

“Same User Experience as x86 devices”

Leverage standard accelerator models

OpenCL OpenMP Accel

Extensive Tool box for advanced programmers

Development Environment: • GDB, etc. Apps Debug Environment for ARM & DSP

• Eclipse “Embedded” Development & Debug Environment • Instrumentation and Trace leveraging embedded hardware capability

Multicore ARM + Multicore DSP

“Make easy native DSP experience”

34

Page 35: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

Optimized C Compiler • Multicore parallel programming models

Productive IDE – Code Composer Studio™ • Eclipse based, host TI and 3rd party tools for easy debug • Advanced analysis and visualization, speeds SW development

Efficient Multicore Software Development Kit • Available on both DSP and ARM with free source code • HLOS/RTOS, Optimized library, algorithm and drivers,

multicore runtime, protocol stack and application demos.

35

Fast, Effective, Open Tools

Page 36: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

3 Create optimized functions using • Standard Programming • Vector Programming • OpenMP Programming

Discrete to Integrated Getting from here To here

Parallel computing strategies with TI DSPs

1 Get started quickly with • Optimized libraries • Simple host/DSP interface

TI LIBS BLAS, DSPLIB

FFT

User LIBS TI Tools

3rd Party LIBS VSIPL

2 Offload code simply with • Directive-based programming • OpenMP Accelerator Model

Accelerator Model

User LIBS Custom fxn

x86

DSP

Hawking 28nm

Page 37: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

There’s No Single Answer

37

Different parallelism models: o Task Parallel or Data parallel o Course or Fine Grained Parallelism o Large or Small Data Sets

Variety of System Baselines o Current Multicore Systems o Variety of methods of expressing and

managing parallelism. o Already or easily partitioned

Page 38: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

Hierarchy of Multicore Engagement Options In

crea

sing

abs

trac

tion

and

prod

uctiv

ity Increasing control and performance

Explicit IPC

OpenMP (Homogeneous)

OpenCL for Accelerators

OpenMP (Accelerator Model)

Multicore Libraries

MPI Development Approach:

• Engage at the most abstract level

• Incrementally optimize to achieve required performance or power efficiency

Page 39: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

Cooperative Parallel Programming (brief history of expression APIs/languages)

Node 0

MPI Communication APIs

Node 1 Node N

39

Page 40: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

Cooperative Parallel Programming (brief history of expression APIs/languages)

40

CPU CPU CPU CPU

OpenMP Threads

Node 0

MPI Communication APIs

CPU CPU CPU CPU

OpenMP Threads

Node 1

CPU CPU CPU CPU

OpenMP Threads

Node N

Page 41: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

Cooperative Parallel Programming (brief history of expression APIs/languages)

CPU CPU CPU CPU

OpenMP Threads

GPU

CUDA/OpenCL

Node 0

MPI Communication APIs

CPU CPU CPU CPU

OpenMP Threads

GPU

CUDA/OpenCL

Node 1

CPU CPU CPU CPU

OpenMP Threads

GPU

CUDA/OpenCL

Node N

41

Page 42: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

CPU CPU CPU CPU

OpenMP Threads

DSP

OpenCL

Node 0

MPI Communication APIs

CPU CPU CPU CPU

OpenMP Threads

DSP

OpenCL

Node 1

CPU CPU CPU CPU

OpenMP Threads

DSP

OpenCL

Node N

Cooperative Parallel Programming On KeyStone II as an example

42

Page 43: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

CPU CPU CPU CPU

OpenCL

Node 0

MPI Communication APIs

Node 1 Node N

Cooperative Parallel Programming On KeyStone II as an alternative example

CPU CPU CPU CPU

OpenCL

CPU CPU CPU CPU

OpenCL

43

Page 44: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

CPU CPU CPU CPU

OpenCL

Node 0

MPI Communication APIs

Node 1 Node N

Cooperative Parallel Programming On KeyStone II as an alternative example

OpenMP

CPU CPU CPU CPU

OpenCL

OpenMP

CPU CPU CPU CPU

OpenCL

OpenMP

44

Page 45: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

CPU CPU CPU CPU

Node 0

MPI Communication APIs

Node 1 Node N

Cooperative Parallel Programming On KeyStone II as an alternative example

OpenMP Accel

CPU CPU CPU CPU

OpenMP Accel

CPU CPU CPU CPU

OpenMP Accel

45

Page 46: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

OpenMP Accelerator model: Target Construct Pragma based model to dispatch computation from host to accelerator

(K2H ARMs to DSPs)

04.09.2013

TI Confidential - NDA Restrictions 46

Extends OpenMP by adding • A ‘target’ construct to indicate regions to be dispatched • Map clause to indicate data transfer between host & accelerator

• Does not have to be a copy (e.g. shared memory) • Clauses to indicate that variables/functions reside on host/device/both • Target regions can contain OpenMP constructs

void foo(int *in1, int *in2, int *out1, int count) { #pragma omp target map (to: in1[0:count-1], in2[0:count-1], count, \ from: out1[0:count-1]) { #pragma omp parallel shared(in1, in2, out1) { int i; #pragma omp for for (i = 0; i < count; i++) out1[i] = in1[i] + in2[i]; } } }

TI co-chair on OpenMP accelerator model sub-committee – Played significant role in spec definition

Page 47: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

EPCC Micro benchmark data

04.09.2013

47

1 2 3 4 5 6 7 8OpenMP Runtime 1.2 6506 9519 10587 11600 12695 13857 15079 16423OpenMP Runtime 2.0 900 5788 6035 6161 6250 6368 6554 6804

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

Cycl

es

Parallel-For Overheads

1 2 3 4 5 6 7 8OpenMP Runtime 1.2 2573 4461 4919 5431 6117 6619 7177 7842OpenMP Runtime 2.0 1667 1840 2009 2170 2366 2539 2733 2948

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Cycl

es

Barrier Construct Overheads

OpenMP Runtime 2.0 • Significantly reduces (2.5x) overhead of OpenMP constructs such as parallel for,

barrier – makes it feasible to use OpenMP for parallel regions with smaller granularity i.e. fewer cycles

• Optimized OpenMP runtime built on OpenEM and libgomp (gcc openmp library) • Does not require BIOS/IPC/XDC

• However, runtime will co-exist with BIOS etc. if present in user application

Page 48: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

WHERE ARE WE GOING

48

Page 49: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

High Performance Compute – moving to mainstream

49

Oil and Gas Exploration

Bioscience

Big data mining

Weather forecast

Financial trading

Electronics design automation

Defense

Page 50: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

Compute

• Heterogeneous processing

• High level of parallelism

Data Movement

• Reducing memory and IO bottleneck

Connectivity

• Efficient networking

• Higher IO bandwidth

Performance/W

• Increasing power efficiency

Less power consumption

More networking

and IO capability

More memory and memory BW

More computation

capacity

50

HPC System and Architecture Evolution

Reliability Real time Safety Scalability High Performance

Power Efficiency

Page 51: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

C1x First 16bit Commercial DSP

C5000 16 bit Fixed point Ultra low power

C6000 32 bit Fixed and/or Floating point

C66xx 12.8GFLOPS/w

TI Continues to Invest in DSP

1995

1997

2010

1982

Next Gen DSP

DSP Leadership Innovation

Page 52: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

High Performance Memory Interfaces

52

Hybrid memory cube(HMC)

• High BW serialized interface • Large DRAM memory space • Suitable for networking and

applications that are latency tolerant

• Lower mw/Gbps

High Bandwidth Memory Interface(HBM) • Interposer/TSV stack memory into

SoC package • Wide interface to SoC cores • Suitable for core centric access

requires large BW and low latency • Higher mw/Gbps

Page 53: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

High BW IO and network on chip

53

Multicore Navigator

PKT DMA

Hyperlink

Packet Accelerator

Ethernet Switch

Security Accelerator

Other IO (PCIe,

JESD204B SRIO, USB…)

Teranet

Multicore Navigator enables zero copy and common multicore programming model

Modular, scalable networking solution

Hyperlink enables 50Gbps throughput with minimum latency and SW overhead

Teranet enables high throughput non-blocking network on chip

Page 54: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

Holistic Power Optimization -- From board to transistor level

54

Board level

• Memory integration, (e.g.HMC, HBM)

Device level

• Low voltage operation

• DVSF • Retention • Bias • In-package

voltage regulation

• Interposer/TSV • Signal transport

– on-die Serdes

Transistor level

• FinFET, • Significant

leakage current reduction with lower Vdd

ASIC 2Si Interposer

ASIC 1 ASIC 2Si Interposer

ASIC 1

Page 55: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

Industry Standard Ecosystem • In-house and 3rd party IP --

interface IP, core IP and soft IP

Overcoming Integration Complexities

Dynamic Power Management • IO voltages, core logic AVS and

DVFS domains, SRAM supplies.

Static Power Management • Power domains on processor

cores, accelerators, and I/O

Board-Level Feature Integration • Asynchronous clocking, scalable

clocks, fixed frequency clocks. • A/D & D/A Converters, RF

integration, voltage regulation.

System Management • Reset, Clocking, DFT, Interrupts,

Interconnect fabric

Page 56: Design is a High Performance Embedded · PDF fileC6X High Performance at low power ... – C66x increased most 32-bit instructions into 64-bit SIMD versions of the same instructions

56

Thank you Texas Instruments

Design is a strategic

asset