llmore: mapping and optimization frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf ·...

38
NOT APPROVED FOR PUBLIC RELEASE Michael Wolf, MIT Lincoln Laboratory 11 September 2012 LLMORE: Mapping and Optimization Framework This work is sponsored by Defense Advanced Research Projects Agency (DARPA) under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government. Distribution Statement A: Approved for public release, distribution is unlimited. (9/10/2012).

Upload: others

Post on 18-Mar-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

NOT APPROVED FOR PUBLIC RELEASE

Michael Wolf, MIT Lincoln Laboratory

11 September 2012

LLMORE: Mapping and Optimization Framework

This work is sponsored by Defense Advanced Research Projects Agency (DARPA) under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.

Distribution Statement A: Approved for public release, distribution is unlimited. (9/10/2012).

Page 2: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 2 MMW 09/11/2012

Overview of Mapping and Optimization Challenges

Challenges: •  Realistic simulations of applications •  Support for diverse languages/numerical libraries •  Support for diverse devices and architectures

Supercomputers Data warehouses

Small hybrid systems Clusters FPGA Chips Arc

hite

ctur

es

App

licat

ions

SAR Secure Communication Cyber Database operations

Mapping and Optimization

Page 3: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 3 MMW 09/11/2012

•  LLMORE is MIT Lincoln Laboratory’s Mapping and Optimization Runtime Environment

•  Parallel framework/environment for –  Optimizing data to processor mapping for parallel applications –  Simulating and optimizing new (and existing) architectures –  Generating performance data (runtime, power, etc.) –  Code generation and execution for target architectures

LLMORE

LLMORE: multiple language support, sparse/dense operations,

architecture model, executor, parallel, robust software

pMapper: Matlab, dense operations, simple architecture model,

executor, serial

SMaRT/MORE: Matlab, sparse operations

(limited data size), architecture model, no executor, serial

LLMORE SMaRT/MORE pMapper

2012 2004 2006 2011

Three generations of mapping and optimization

pMapper patent issued: “Method and apparatus performing automatic mapping for multiprocessor system”

Page 4: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 4 MMW 09/11/2012

Key Features of LLMORE

•  Support for multiple languages and numerical libraries •  Ability to solve large problems

– Written in C++ and runs in parallel –  Fit larger problems into memory, reduces time to solution

•  Support for dense and sparse linear algebra operations •  Production quality research software

– Easy to use interfaces – Designed to support future algorithms/packages/languages

•  LLMORE is NOT an autoparallelizing compiler –  Will not generate optimized parallel code for any set of (serial or

parallel) instructions –  Data layouts optimized in context of maps

Page 5: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 5 MMW 09/11/2012

•  Motivation/Overview •  Design and Usage

–  Usage 1: Map Optimization –  Usage 2: Performance Evaluation

•  LLMORE and POEM •  Preliminary POEM Results: 2D FFT •  Next Steps and Summary

Outline

Page 6: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 6 MMW 09/11/2012

LLMORE Framework Overview

LLMORE input LLMORE output

Architectural  Model  

LLMORE  Parameters  

User  Code  1. Set  of  op=mized  maps    

or  2.  Performance  data  

or  3.  Op=mized  architecture(s)  

or  4.  Generated  code  

or  5.  Results  from  run  on  target  architecture  Key/Novel Features

•  Application code to simulator •  Data mapping optimization for user code •  Support for multiple languages and libraries •  Ability to solve large problems •  Production quality software

LLMORE

Output: One or more

Page 7: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 7 MMW 09/11/2012

LLMORE Design Overview

Core Functionality

Language Interface

Run=me  Engine  

Code  Generator  

Machine Interface

LLMORE Input

LLMORE Output

LLMORE Output

Analyzer  and  Op=mizer  

Map Converter Parser

Parse  Manager  

Map  Manager  

Map  Builder  

AST  Builder  

AST = abstract syntax tree

Page 8: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 8 MMW 09/11/2012

•  Motivation/Overview •  Design and Usage

–  Usage 1: Map Optimization –  Usage 2: Performance Evaluation

•  LLMORE and POEM •  Preliminary POEM Results: 2D FFT •  Next Steps and Summary

Outline

Page 9: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 9 MMW 09/11/2012

LLMORE Usage 1: Map Optimization

LLMORE optimizes data mapping to improve parallel performance of key computational kernels

•  LLMORE produces set of optimized maps for parallel variables specified in user code

•  Matrix-vector product example –  LLMORE computes map for dense matrix –  LLMORE computes map for two vectors

= A x y = A x y LLMORE

Page 10: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 10 MMW 09/11/2012

Map Optimization: Input

y=Ax

= A x y

Input from application: user code for dense matrix-vector product

Dense Matrix-Vector Product

LLMORE

Optimized Maps

Analyzer and Optimizer

Mapper

Mapped AST

Parser User code

User Code:

AST

Page 11: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 11 MMW 09/11/2012

Map Optimization: AST Representation

Parser converts user code into abstract syntax tree (AST), which is input language/numerical library neutral

SB

MV

PVar: A PVar: x PVar: y

AST

y=Ax

LLMORE

Optimized Maps

Analyzer and Optimizer

Mapper

Mapped AST

Parser

AST  User code

Dense Matrix-Vector Product

Page 12: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 12 MMW 09/11/2012

Map Optimization: Mapping

LLMORE computes map for each parallel variable in AST

SB

MV

PVar: A PVar: x PVar: y

Vector Map P0: {0,1} P1: {2,3}

Vector Map P0: {0,1} P1: {2,3}

Matrix Map P0: {0,1} P1: {2,3}

LLMORE

Optimized Maps

Analyzer and Optimizer

Mapper

Mapped AST

Parser

AST  User code

Dense Matrix-Vector Product

Page 13: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 13 MMW 09/11/2012

Map Optimization: Output

A y x

•  LLMORE output: optimized maps for parallel variables •  New maps used to redistribute vector and matrix data •  Optimized matrix-vector product calculated with new data

distributions

A y x

LLMORE

Optimized Maps

Analyzer and Optimizer

Mapper

Mapped AST

Parser

AST  User code

Optimized y=Ax

= A y x

Dense Matrix-Vector Product

Page 14: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 14 MMW 09/11/2012

•  Motivation/Overview •  Design and Usage

–  Usage 1: Map Optimization –  Usage 2: Performance Evaluation

•  LLMORE and POEM •  Preliminary POEM Results: 2D FFT •  Next Steps and Summary

Outline

Page 15: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 15 MMW 09/11/2012

LLMORE Usage 2: Performance Evaluation

LLMORE simulates user code on specified architecture to produce performance evaluation metrics

= A x y

3040

5060

7080

90 23

45

67

8

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Memory: Static Power(W)

GFLOPs/W Nehalem Model − Dense

Processor: Static Power(W)

Performance data

LLMORE

M2 P1 P2 M1

Page 16: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 16 MMW 09/11/2012

Performance Evaluation: Input

y=Ax

= A x y

LLMORE

Performance Data

Analyzer and Optimizer

Mapper

Mapped AST

User Code, Arch. Model

Parser AST  

Performance Evaluator

MI Code Generator

MI code

Simulator

MI = Machine Independent M2 P1 P2 M1

Dense Matrix-Vector Product

Input from application: user code for dense matrix-vector product, architecture model

User Code:

Architecture Model:

Page 17: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 17 MMW 09/11/2012

Performance Evaluation: AST Representation

y=Ax

LLMORE

Performance Data

Analyzer and Optimizer

Mapper

Mapped AST

User Code, Arch. Model

Parser AST  

Performance Evaluator

MI Code Generator

MI code

Simulator

MI = Machine Independent

SB

MV

PVar: A PVar: x PVar: y

AST

Dense Matrix-Vector Product

Parser converts user code into abstract syntax tree (AST), which is input language/numerical library neutral

Page 18: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 18 MMW 09/11/2012

Performance Evaluation: Mapping

LLMORE computes map for each parallel variable in AST

SB

MV

PVar: A PVar: x PVar: y

Vector Map P0: {0,1} P1: {2,3}

Vector Map P0: {0,1} P1: {2,3}

Matrix Map P0: {0,1} P1: {2,3}

LLMORE

Performance Data

Analyzer and Optimizer

Mapper

Mapped AST

User Code, Arch. Model

Parser AST  

Performance Evaluator

MI Code Generator

MI code

Simulator

MI = Machine Independent

Dense Matrix-Vector Product

Page 19: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 19 MMW 09/11/2012

Performance Evaluation: Machine Independent Code

Mapped AST and architecture model used to generate machine independent code

LLMORE

Output

Analyzer and Optimizer

Mapper

Mapped AST

Input Parser AST  

Performance Evaluator

MI Code Generator

MI code

Simulator

Mapped AST

SB

MV

PVar: A PVar: y Vector Map

P0: {0,1} P1: {2,3}

PVar: x Vector Map

P0: {0,1} P1: {2,3}

Matrix Map P0: {0,1} P1: {2,3}

MI Code

read A1,1 read A0,0 read x0 read y0

y0 = A0,0x0

read A0,1

y0 += A0,1x1

write y0

send x1 send x0

read x1 read y1

y1 = A1,1x1

read A1,0

y1 += A1,0x0

write y1

MI = Machine Independent

M2 P1 P2 M1

Architecture Model

MI Code Generator

Page 20: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 20 MMW 09/11/2012

Machine Independent Code

# rows = 2 # processors = 2

read A1,1 read A0,0 read x0 read y0

y0 = A0,0x0

read A0,1

y0 += A0,1x1

write y0

send x1 send x0

read x1 read y1

y1 = A1,1x1

read A1,0

y1 += A1,0x0

write y1

Machine independent code for matrix-vector product

P0

P1

Computation

Memory access

Communication

Flow graph of operations

Page 21: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 21 MMW 09/11/2012

Performance Evaluation: Output

Simulation of user code on target architecture

LLMORE

Performance Data

Analyzer and Optimizer

Mapper

Mapped AST

Input Parser AST  

Performance Evaluator

MI Code Generator

MI code

Simulator

MI Code

read A1,1 read A0,0 read x0 read y0

y0 = A0,0x0

read A0,1

y0 += A0,1x1

write y0

send x1 send x0

read x1 read y1

y1 = A1,1x1

read A1,0

y1 += A1,0x0

write y1

3040

5060

7080

90 23

45

67

8

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Memory: Static Power(W)

GFLOPs/W Nehalem Model − Dense

Processor: Static Power(W)

Performance Data

Simulator

Page 22: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 22 MMW 09/11/2012

•  Motivation/Overview •  Design and Usage

–  Usage 1: Map Optimization –  Usage 2: Performance Evaluation

•  LLMORE and POEM •  Preliminary POEM Results: 2D FFT •  Next Steps and Summary

Outline

Page 23: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 23 MMW 09/11/2012

POEM

Architecture Design Application Analysis

POEM will bridge the gap between innovations in chip-scale photonics and fielded military-critical systems. It will offer a complete architecture design

and analysis for numerous real-world military-critical applications.

Photonic Innovation

Fielded Systems

P hotonically O ptimized E mbedded M icroprocessor

Handheld DNA Sequencing Device UAV Surveillance

Ring Resonator Technology

Photonics Layered on Silicon

Photonic Switching Technology

Handheld PDA Field Devices

Proposed Photonic Architectures

Processing Chains for Military-Critical Applications

Simulation, Mapping, and Optimization

Framework

Architecture Parameter Characterization

Page 24: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 24 MMW 09/11/2012

LLMORE’s Role in POEM Program

Architecture Design

LLMORE used to study chip-scale photonics and its impact on applications

Photonic Innovation

Fielded Systems

Handheld DNA Sequencing Device UAV Surveillance

Ring Resonator Technology

Photonics Layered on Silicon

Photonic Switching Technology

Handheld PDA Field Devices

Proposed Photonic Architectures

Architecture Parameter Characterization

Application Analysis

P hotonically O ptimized E mbedded M icroprocessor

Processing Chains for Military-Critical Applications

Simulation, Mapping, and Optimization

Framework

Page 25: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 25 MMW 09/11/2012

LLMORE and POEM: Applications

•  LLMORE provides framework for analyzing POEM applications –  LLMORE supports many key numerical kernels (FFT, sparse

matrix-vector product, vector updates, etc.) – Applications supported through composition of these kernels – Easy to extend to analyze new applications

•  Initially analyzing synthetic aperture radar (SAR) application

LLMORE enables the analysis of many applications on many different architectures (existing and proposed)

SAR Secure Communication Cyber Database operations

LLMORE

Page 26: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 26 MMW 09/11/2012

LLMORE and POEM: Architectures

•  LLMORE supports simulation of applications on POEM architectures (e.g., electronic mesh and photonic bus)

•  Framework for simulating user code –  LLMORE simulator for understanding big picture trends –  Interface to third party simulators (e.g., PhoenixSim) for higher

fidelity performance data

LLMORE enables the analysis of many applications on many different architectures (existing and proposed)

LLMORE

Electronic Mesh

Photonic Bus

Application

Page 27: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 27 MMW 09/11/2012

LLMORE and POEM: Optimization of Maps

Optimization of maps – Good application data to processor mapping crucial to

achieving peak parallel performance on target machines – Not difficult for SAR applications (simple maps sufficient) – Challenging for applications with irregular communication

(DNA sequence analysis and sparse matrix computations)

LLMORE provides automatic map optimization

= A x y = A x y LLMORE

Page 28: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 28 MMW 09/11/2012

•  Motivation/Overview •  Design and Usage

–  Usage 1: Map Optimization –  Usage 2: Performance Evaluation

•  LLMORE and POEM •  Preliminary POEM Results: 2D FFT •  Next Steps and Summary

Outline

Page 29: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 29 MMW 09/11/2012

LLMORE and Performance Evaluation

LLMORE

LLMORE output

Analyzer and Optimizer

LLMORE input

Mapper

Mapped AST

Performance Evaluator

MI Code Generator

MI code

Simulator Performance data: time (s), GFLOPS

Architectural  Model  

User  Code  

LLMORE used to produce performance evaluation data for 2D FFT, an important kernel in SAR processing chain

Parser

AST

Page 30: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 30 MMW 09/11/2012

POEM Architecture Models

LLMORE framework allows direct comparison of electronic mesh and photonic bus architectures

M2 P1 LM

P2 LM

P3 LM

P11 LM

P10 LM

P9 LM

P7 LM

P6 LM

P5 LM

P4 LM

P8 LM

P12 LM

P13 LM

P14 LM

P15 LM

P16 LM

M1

M3 M4

P1 LM

P2 LM

P3 LM

P11 LM

P10 LM

P9 LM

P7 LM

P6 LM

P5 LM

P4 LM

P8 LM

P12 LM

P13 LM

P14 LM

P15 LM

P16 LM

M1 M2 M3 M4

Electronic Mesh Photonic Bus

•  Two architectures modeled (electronic mesh and photonic bus •  Processors same, shared memory same •  Network parameters (latency, bandwidth) set to allow for apple

to apples comparison between networks

P=processor/core M=Shared memory LM=local memory

Page 31: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 31 MMW 09/11/2012

Preliminary POEM Simulations

Read FFT Write

Read FFT Write

Read Transposed Write

Read FFT

Read FFT Write

SCA Write

FFT

CT

FFT

2D FFT: FFT, CT, FFT 2D FFT: FFT, SCA, FFT

FFT

CT

FFT

Architectures simulated: EM Architectures simulated: PB EM=Electronic mesh PB=Photonic bus SCA = Synchronous coalesced access

1D FFT for each row of dense matrix

1D FFT for each column of dense matrix

For data locality of FFT

•  Initially, simulated 2D FFT kernel •  Important kernel for SAR applications

Page 32: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 32 MMW 09/11/2012

•  Utilizes distance independence to rapidly re-organize spatially separate data

•  Large gains in efficiency even when bandwidth is equalized

•  SCA write –  Synthesizes matrix row-to-column

transpose/write into a single transaction by interleaving data from spatially separate data producers

–  Novel ISA construct enabled by a highly synchronous photonic waveguide

Photonic Synchronous Coalesced Access Network (PSCAN)

R R

R R

R R

R R

R R

R R

R R

R MI M1

PSCAN leverages the differences between electronic and photonic interconnect to achieve large efficiency gains in critical operations

0 0 0 0 1 1Address info driven by proc 3

00

0

0

11

Data is not driven

1

DRAM P-CLK

DRIVE3 DRIVE2

DRIVE1

DRIVE0

time

Credit: Dave Whelihan, MITLL

Page 33: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 33 MMW 09/11/2012

Preliminary LLMORE Results

Experiment: •  Number of memory controllers: 4 •  Matrix size: 4096 x 4096 •  Equalize bandwidth to memory in

photonic and electronic models •  Run traditional 2D FFT with block

transpose for electronic mesh •  Run 2D FFT with SCA for PSCAN •  Scale number of processors

Examine performance of photonics and electronics in an increasingly work-

starved environment

PSCAN architecture with SCA yields significant speed-up over electronic mesh architectures using block transpose

4 8 16 32 64 128 256 512 1024 2048 4096109

1010

1011Performance of 2D FFT Operation

Number of Cores

GO

PS

ElectronicPSCANIdeal

43% of Ideal

Page 34: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 34 MMW 09/11/2012

Data Reorganization Bottleneck in Modern Manycore Systems

As the number of cores in modern CMPs grows, the data reorganization becomes an

increasing performance bottleneck. Performance is limited by the inability to

keep processors supplied with data.

FFT

CT

FFT

Read FFT Write Read Transpose Write Read FFT Write

Block Transpose

FFT

FFT

CT

SCA

Read FFT SCA Read FFT Write

SCA operation performs data reorganization step of 2D FFT more efficiently as number of processors grows, alleviating traditional block transpose bottleneck

4 8 16 32 64 128 256 512 1024 2048 40960

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of Cores

Perc

enta

ge

Time Spent Reorganizing Data for 2D FFT

Block TransposeSCA

Dat

a R

eorg

aniz

atio

n

Page 35: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 35 MMW 09/11/2012

•  Motivation/Overview •  Design and Usage

–  Usage 1: Map Optimization –  Usage 2: Performance Evaluation

•  LLMORE and POEM •  Preliminary POEM Results: 2D FFT •  Next Steps and Summary

Outline

Page 36: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 36 MMW 09/11/2012

•  Extend architecture model, LLMORE simulator –  Support memory hierarchy –  Model network contention more accurately

•  Support for external simulators –  E.g., PhoenixSim (Columbia), SST (Sandia) –  Needed for higher fidelity simulations

•  Better power modeling (e.g., dynamic power) •  Additional parallel numerical library/language support

–  Additional languages: Matlab, Python –  Additional libraries: e.g., VSIPL++/PVTOL –  Additional kernels: e.g., Sparse matrix operations

•  Code generator and runtime engine for execution/emulation on target architectures

LLMORE Next Steps

Page 37: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 37 MMW 09/11/2012

•  LLMORE: Parallel framework/environment for –  Optimizing data to processor mapping for parallel applications –  Simulating and optimizing new (and existing) architectures

•  LLMORE used to compare photonic and electronic architectures of interest to POEM project –  Support for many applications (through composition of kernels) –  Support for different architectures

•  Preliminary results for 2D FFT kernel –  Support thesis that photonics can improve performance for these

applications –  SCA instruction mitigates performance impact of corner turn in 2D

FFT operation

Summary

LLMORE allows for exploration of architecture design space in context of real application constraints

Page 38: LLMORE: Mapping and Optimization Frameworkieee-hpec.org/2012/index_htm_files/wolfslides.pdf · pMapper SMaRT/MORE LLMORE 2004 2006 2011 2012 Three generations of mapping and optimization

Michael Wolf - 38 MMW 09/11/2012

•  LLMORE –  Michelle Beard –  Anna Klein –  Sanjeev Mohindra –  Julie Mullen –  Eric Robinson –  Nadya Bliss (Manager) –  Minna Song (MIT student intern)

•  POEM –  Dave Whelihan –  Jeff Hughes –  Scott Sawyer

Acknowledgements