the fresh breeze memory model status: linear algebra and plans guang r. gao jack dennis mit csail...

21
The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant CCF-0937832 Joshua Slocum Xiaoxuan Meng Brian Lucas

Upload: cecilia-gardner

Post on 05-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant

The Fresh Breeze Memory ModelStatus: Linear Algebra and Plans

Guang R. GaoJack Dennis

MIT CSAIL University of Delaware

Funded in part by NSF HECURA Grant CCF-0937832

Joshua Slocum Xiaoxuan MengBrian Lucas

Page 2: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant

Problem: I/O Performance

Culprits:

• Large Units of Data Transfer

• Operating System Overhead/Noise

Managed by hardwareManaged by software (OS)

$Main

Memory

L3

$

P

Disk

Other

FileMemory

• Few Concurrent Transfers (OS Limits) 2

$P

$P

Page 3: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant

Solution: Integrated Memory Hierarchy

Features:

• Many Concurrent Transactions – High Bandwidth

• Global Virtual Store

Managed by hardware

$Main

Memory

L3

$

P

Disk

Other

FileMemory

• Superior Basis for Security / Privacy 3

$P

$P

Page 4: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant

The Fresh Breeze Memory Model

• Fan-out as large as 16

Data

Chunks

e.g. 128 Bytes

Master

Chunk

• Write-Once then Read Only

Cycle-Free Heap Arrays as Trees of Chunks

4

• Arrays: Three levels yields 4096 elements (longs)

Page 5: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant

Fresh Breeze System Vision

Many Core Processing Chips

Switch Switch

Main Memory:Associative Directoriesand DRAM Chips

Backing Store:Access Controllersand Flash Devices

AD DRAM

Switch

AD DRAMAD DRAM

AC Flash AC Flash AC Flash AC FlashAC Flash

L2 Cache

PL1

PL1

PL1

PL1

L2 Cache

PL1

PL1

PL1

PL1

L2 Cache

PL1

PL1

PL1

PL1

Page 6: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant

Unique Features of the Processor

• Cache Lines are 128-byte Chunks.• Registers are tagged to indicate those

holding handles of chunks.• Hardware task scheduler: Active Task

List and Pending Task Queue.

6

Page 7: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant

Spawn and Join

7

Master Task

spawn (n)

Join_fetch Join

Worker 0 Worker n-1

join_update join_update

Spawned Worker Tasks

Continuation Task

Page 8: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant

The Dot Product

A

B

* Sum

A B

5 levels:Vector length =165 = 1,048,576

* +

scalar result

* *

Each Leaf Task: Dot Product of two 16-element vectors: 16 multiplies; 15 adds

Page 9: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant

Linear Algebra: Three Algorithms

• Dot Product• Matrix Multiply• Fast Fourier Transform

9

Let’s consider the special characteristics of each.

Page 10: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant

Dot Product

16

10

SegmentA

15

Multiplies

Adds

31 Operations

• No data reuse

• No intermediate data

• No chunks written

• Large volume of input data

Leaf Task: Dot Product of 16-element segments A and B

SegmentB

+*

Page 11: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant

Matrix Multiply

64

11

48

112

Multiplies

Adds

Operations

• Each input chunk used many times

• Result chunks written to memory

• No chunks written

• Relatively small input data

Leaf Task: Product of two 4-by-4 matrices

16 dot products of four-element vectors

+*

Page 12: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant

Fast Fourier Transform

4

EightData

Samples

FourTwiddleFactors

6

Multiplies

Adds

10Operations

• Log2 (n) stages

• Intermediate data

• Chunks written and read

Leaf Task: Group of Four Butterfly Computations

BFLYEight

ResultsBFLY

BFLY

BFLY

16

40

One Butterfly Four Butterflies

24

Page 13: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant

Simulated System Model

L1P

Memory

Switch

Processors – 16, 24, 32, 40

Up to 64IndependentStorage Units

L1P

Page 14: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant

Simulation Modes

Non-Blocking:

A task executing a read simply waits for operation to complete.

Models an L2 cache with short access time.

Models a main memory with long access time.

Blocking:

A task executing a read suspends to permit other tasks to run.

Page 15: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant

Estimated PerformanceL2 Cache Model - NonBlocking

Dot ProductA: 163

B: 164

C: 165

Matrix MultiplyA: 16 x 16B: 32 x 32C: 48 x 48

Fourier TransformA: 1024B: 2048C: 4096

AA

A

A

A A A A

A A A AB

B

B

B

B

B

BB

B B B B

C

C

C

C C

C

C

C

C C C C

0

2

4

6

8

10

12

16 24 32 40 16 24 32 40 16 24 32 40

Number of Processors

To

tal P

erfo

rman

ce -

GF

LO

PS

Page 16: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant

Estimated PerformanceL2 Cache Model - NonBlocking

Dot ProductA: 163

B: 164

C: 165

Matrix MultiplyA: 16 x 16B: 32 x 32C: 48 x 48

Fourier TransforA: 1024B: 2048C: 4096

0

2

4

6

8

10

12

16 24 32 40 16 24 32 40 16 24 32 40

Number of Processors

To

tal P

erfo

rman

ce -

GF

LO

PS

Page 17: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant

Findings

• Data locality of chunks works well.• L1 Cache is not very important; only a buffer

for input and result chunks of tasks.• Task switching for chunk reads is costly;

percolation would make a big difference.

17

Percolation: Ensuring that input chunks of a task have been retrieved before the task is scheduled.

Page 18: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant

Further Work

• Model a Three-Level or Four-Level Storage System

• Develop Hierarchical Work Stealing to improve Load Distribution for Massively Parallel Systems.

• Compiler Development: Automatic Mapping of Objects to Trees of Chunks.

• Expand Test Programs to include Transaction Processing and Database Operations .

18

Page 19: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant

Distributed Discrete-Event Simulation

• Accurate timing for interconnected communicating components.

• Packet Communication Architecture is a model for distributed systems especially amenable to efficient distributed simulation.

• UDel and MIT have begun a project to develop a PCA-based Simulation Sandbox for evaluating alternate PXMs for massively parallel computing.

19

Page 20: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant

Relevance to FSIO

• The Fresh Breeze Memory Model extends to 264 chunks, about 1021 chunks or 100,000 exabytes.

• The tree-structure is an excellent basis for advanced security and data object sharing.

• Can simplify check-point / restart.

• Seamless transition from “in-core” to “out-of-core” operation of arbitrary parallel programs.

• Provides ability to use any program as a component of new programs – including any parallel program.

20

Page 21: The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant

Conclusion

• Making the best of many-core technology requires study of new program execution models (PXMs).

• The FSIO challenge can be met by integrating the file system into the system memory hierarchy as a global virtual store.

• The Fresh Breeze PXM demonstrates some of the benefits from departing from conventional system organization.

• More exploration of new PXMs is needed!

21