functional partitioning to optimize end-to-end performance ... · functional partitioning to...

37
Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai, Ali R. Butt, Fei Meng, Xiaosong Ma, Youngjae Kim,Christian Engelmann, and Galen Shipman Virginia Tech, Oak Ridge National Laboratory, North Carolina State University

Upload: others

Post on 25-Mar-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

Functional Partitioning to Optimize End-to-End Performance

on Many-core Architectures

Min Li, Sudharshan S. Vazhkudai, Ali R. Butt, Fei Meng, Xiaosong Ma, Youngjae Kim,Christian Engelmann, and Galen Shipman

Virginia Tech, Oak Ridge National Laboratory, North Carolina State University

Page 2: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Many-cores are driving HPC

2

There is a need for redesigning the HPC software stack to benefit from increasing number of cores

Page 3: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Can’t apps simply use more cores?

3

1

1.2

1.4

1.6

1.8

2

2.2

2.4

0 5 10 15 20 25 30

Spee

dup

Number of Cores

mpiBLAST FLASH

Simply assigning more cores to applications does not scale.

Page 4: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Growing computation-I/O gap degrades performance

4

2000 2002 2004 2006 2008 2010

Perf

orm

ance

Disk-based Storage Systems

Storage Wall !

Server/CPUs

Source: storagetopic.com

25X

2X

Performance Growing Trend

Research question: Can the underutilized cores be leveraged

to bridge the Compute-I/O gap?

Page 5: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Observation: All workflow activities (not just compute) affect overall performance

5

Computation Checkpointing Preprocessing

Script

Timeline

Computation I/O Computation I/O End-to-end Performance

Computation I/O Computation I/O Performance gain!

Page 6: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Our Contribution: Functional Partitioning (FP) of Cores

•  Idea: Partition the app core allocation •  Dedicate partitions to different app activities

•  Compute, checkpoint, format transformation, etc.

•  Bring app support services into the compute node •  Transforms the compute node into a scalable unit for service composition

•  A generalized FP-based I/O runtime environment •  SSD-based checkpointing •  Adaptive checkpoint draining •  Deduplication •  Format transformation

•  A thorough evaluation of FP using a160-core testbed

6

Page 7: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Functional Partitioning (FP)

7

Checkpointing core Deduplication core

Parallel File System

… Aggregate Buffer

FP Configuration FP(2,8)

Launch Application

Data

Data

Script

Page 8: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Agenda

•  Motivation •  Functional Partitioning (FP) •  FP Case Studies •  Evaluation •  Conclusion

8

Page 9: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Challenges in FP design

•  How to co-execute the support services with the app? •  How to assign cores for the support activities? •  How to share data between compute and support activities?

•  How to make the FP runtime transparent?

•  How to have a flexible API for different support activities?

•  How to do adapt support partitions based on progress?

•  How to minimize the overhead of FP runtime?

9

Page 10: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

FP runtime design

•  Uses app-specific instances setup as part of job startup

•  Uses interpositioning strategy for data management: •  Initiates after core allocation by the scheduler and before

application startup (mpirun) •  Pins the admin software to a core •  Sets up a fuse-based mount point for data sharing between

compute and support services

•  Initiates the support services and the application’s main compute to use the shared mount space

10

Page 11: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Aux-apps: Capturing support activities •  Provide an API for writing code for support activities •  Describe actions to take when data is accessed

Adavantages: •  Decouple application design from support activity design •  Provide a flexible, reusable interface •  Support recycling of common activities across apps

•  Reduce application development time

11

int dedup_write (void * output_buffer, int size){ int result=SUCCESS; //process output in chunks while((chunk=get_chunk(&out_buffer,size))!=null){ // compute hash on output_buffer chunks char* hash=sha1(chunk);

//write the new chunk if(!hashtable_get(hash)) result=data_write(chunk);

// update de-dup hash-table hashtable_update(&result,chunk,hash); } return result; }

Page 12: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Assigning cores to aux-apps

•  Per-activity partition: dedicate a core to each aux-app •  Intra-Node: Dedicated cores are co-located with the main app •  Inter-Node: Dedicated cores are on specialized nodes

•  Shared partition: multiple cores for multiple aux-apps •  One service runs on multiple cores •  One core runs multiple services

12

Page 13: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Key FP runtime components for managing aux-apps •  Benefactor: Software that runs on each node

•  Manages a node’s contributions, SSD, memory, core •  Serves as a basic management unit in the system •  Provides services and communication layer between nodes •  Uses FUSE to provide a special transparent mount point

•  Manager: Software that runs on a dedicated node •  Manages and coordinates benefactors •  Schedules aux-apps and orchestrates data transfers

•  Manager and benefactors are application specific and utilize cores from the application’s allotment

13

Page 14: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Minimizing FP overhead

•  Minimize main memory consumption •  Use non-volatile memory, e.g. SSD, instead of DRAM

•  Minimize cache contention •  Schedule aux-apps based on socket boundaries

•  Minimize interconnection bandwidth consumption •  Coordinate the application and FP aux-apps •  Extend the ioclt call to the runtime to define blackout periods

14

Page 15: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Agenda

•  Motivation •  Functional Partitioning(FP) •  FP Case Studies •  Evaluation •  Conclusion

15

Page 16: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

SSD-based checkpointing

•  FP can help compose a scalable service out of node-local checkpointing

•  Why SSD checkpointing: More efficient than memory-checkpointing •  Does not compete with app main-memory demands •  Provide fault tolerance •  Cost less

•  How: Aggregate SSD on multiple nodes as an aggregate buffer •  Provide faster transfer of checkpoint data to Parallel FS •  Utilize dedicated core memory for I/O speed matching

16

Page 17: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

SSD-based checkpointing aux-app

17

Application

Fuse:/

M

Application

Fuse:/

M

Application

Fuse:/

M

Manager

MetaInfo

M

Parallel File system

Compute Nodes/Benefactors Compute Nodes Checkpointing core

Aggregate SSD Store

Configuration FP(1,8)

Launch Application

Script

SSDs

Page 18: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Deduplication of checkpoint data

•  FP cores can be used to perform compute-intensive de-duplication, in-situ, on the node

•  Why: Reduce the data written and improve I/O throughput

•  How: Identify similar data across checkpoints •  If data is duplicate, update only the metadata •  Co-located with ssd-checkpointing app on the same core

18

Page 19: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Deduplication aux-app

19

Application

Fuse:/

M

Application

Fuse:/

M

Application

Fuse:/

M

Manager

MetaInfo

M

Parallel File system

Compute Nodes/Benefactors

Aggregate SSD Store

Configuration FP(1,8)

Launch Application

Script

Checkpointing Deduplication core

Page 20: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Agenda

•  Motivation •  Functional Partitioning(FP) •  FP Case Studies •  Evaluation •  Conclusion

20

Page 21: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Evaluation objectives

•  How effective is functional partitioning?

•  How efficient is the SSD-checkpointing aux-app? •  Real world workload •  Synthetic workload

•  How efficient is the deduplication aux-app? •  Synthetic workload

21

Page 22: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Experimentation methodology

•  Testbed: •  20 nodes, 160 cores, 8G memory/node, Linux 2.6.27.10 •  HDD model: WD3200AAJS SATAII 320GB, 85MB/s •  SSD model: Intel X25-E Extreme, sequential read 250MB/s,

sequential write 175MB/s, capacity 32G

•  Workloads: •  FLASH: Real-world astrophysics simulation application •  Synthetic benchmark: A checkpoint application that generates

250MB/process every barrier step

•  FP(x,y) -> dedicate x out of y cores on each node to aux-apps

22

Page 23: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Impact of SSD-checkpointing using real-world workload

23

0

50

100

150

200

80 160

Tim

e(s)

Number of Total Cores

Local Disk non-FP(0,8) Local Disk non-FP(0,7) Aggregate Disk FP(1,8) Aggregate SSD FP(1,8)

15%

41%

27%

44% FP effectively improves application end-to-end performance

Page 24: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

SSD-checkpointing I/O throughput using synthetic workload

24

0

1000

2000

3000

4000

5000

6000

7000

0 20 40 60 80 100 120 140 160

I/O T

hrou

ghpu

t (M

B/s

)

Number of Compute Cores

In Memory(20) FP(1,8) SSD(20) FP(1,8)

SSD checkpointing provides sustained high throughput vs. an in-memory approach without memory overhead

Page 25: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Varying the number of benefactors

25

0

1000

2000

3000

4000

5000

6000

7000

5 10 16 20 25 30 36 40

I/O T

hrou

ghpu

t(MB

/s)

Number of Benefactors

In Memory FP(0-2,8)

A small number of benefactors can significantly improve performance

Page 26: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Varying the number of compute cores

26

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

10 30 50 70 90 110 130 150

I/O T

hrou

ghpu

t (M

B/s

)

Number of Compute Cores

N=1 N=2 N=5 N=10 N=20

SSD FP(0-1,8), N=benefactors

Page 27: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Efficiency of De-duplication Aux-app Using FP(2,8)

0

0.5

1

1.5

2

10 30 50 70 90 110 130

Nor

mal

ized

I/O

Thr

ough

put

Number of Compute Cores

Dedup(0.25) Dedup(0.50) Dedup(0.75) Dedup(0.90) Non-dedup

27

60%

For compute intensive tasks such as deduplication assigning more cores to the aux-app improve end-to-end performance

Page 28: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Impact on end-to-end performance

28

0

200

400

600

800

1000

1200

1400

20 40 60 80 100 120 140 160

Tim

e (s

)

Number of Compute Cores

Execution Time Checkpointing Time Total Time

FP effectively improves application end-to-end performance

In Memory FP(1,8)

Page 29: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Agenda

•  Motivation •  Functional Partition(FP) •  Sample core services •  Evaluation •  Conclusion

29

Page 30: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Conclusion •  Created a novel FP run-time for many-cores systems

•  Transparent to applications •  Easy-to-use, flexible, and support recyclable aux-apps

•  Implemented several sample FP support services •  SSD-based checkpointing •  Deduplication •  Format transformation •  Adaptive checkpoint data draining

•  Showed that FP can improve end-to-end application performance by reducing support activity time with minimal overhead to compute

30

Page 31: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Future work

•  Explore dynamic functional partitioning •  Implement more FP-based services •  Utilize FP for non-I/O-based activities

•  Contact information Viginia Tech: Min Li, Ali R. Butt -- [email protected], [email protected] ORNL: Sudharshan S. Vazhkudai – [email protected] NCSU: Xiaosong Ma – [email protected]

31

Page 32: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Backup slides

32

Page 33: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Adaptive checkpoint data draining

•  Why: Data cannot be stored in the SSD buffer forever

•  How: Lazily draining the data to PFS every k checkpoints •  Periodically update the manager with free space status •  The manager uses this info to determine when to drain •  Dedicated cores can be used to facilitate the draining and

support tasks

33

Page 34: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Adaptive checkpoint data draining

34

Application

Fuse:/

M

Application

Fuse:/

M

Application

Fuse:/

M

Manager

MetaInfo

M

Parallel File system

Compute Nodes/Benefactors Checkpointing core

Aggregate SSD Store

Launch Application

Draining core

Page 35: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Removed

35

Page 36: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Deduplication aux-app

36

Application

Fuse:/

M

Application

Fuse:/

M

Application

Fuse:/

M

Manager

MetaInfo

M

Parallel File system

Compute Nodes/Benefactors

Checkpointing Deduplication core

Aggregate SSD Store

Launch Application

Draining core

Page 37: Functional Partitioning to Optimize End-to-End Performance ... · Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai,

OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY NC State University

Efficiency of de-duplication aux-app

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

10 30 50 70 90 110 130 150

I/O T

hrou

ghpu

t (M

B/s

)

Number Of Compute Cores

Non-dedup Dedup(0.90) Dedup(0.75)

Dedup(0.50) Dedup(0.25) Dedup(0.10)

37

Using a core to support a deduplciation aux-app improves I/O throughput and in turn improve end-to-end application performance