io workload throttling on supercomputers€¦ · (ooops) •an innovative io workload managing...

18
IO Workload Throttling on Supercomputers 11/12/18 1 Si Liu Analyzing Parallel I/O Nov 13, 2018

Upload: others

Post on 11-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IO Workload Throttling on Supercomputers€¦ · (OOOPS) •An innovative IO workload managing system that optimally controls the IO workload from the users' side. •Automatically

IO Workload Throttling on Supercomputers

11/12/18 1

Si LiuAnalyzing Parallel I/O

Nov 13, 2018

Page 2: IO Workload Throttling on Supercomputers€¦ · (OOOPS) •An innovative IO workload managing system that optimally controls the IO workload from the users' side. •Automatically

Lei [email protected] Advanced Computing Center

Team Members

11/12/18 2

Si [email protected] Advanced Computing Center

Page 3: IO Workload Throttling on Supercomputers€¦ · (OOOPS) •An innovative IO workload managing system that optimally controls the IO workload from the users' side. •Automatically

Issues of Parallel Shared Filesystem

• Achilles' heel of HPC: filesystem is shared by all users on all nodes (even crossing multiple clusters). It is a weak point of modern HPC.

• Overloading metadata server results in global filesystem performance degradation and even unresponsiveness.

• Many practical applications (in computational fluid dynamics, quantum chemistry, machine learning, etc.) raise a huge amount of IO requests in a very short time.

• There is no strict enforced IO resource provisioning in production (e.g. metadata sever throughput, bandwidth) on user level or node level.

11/12/18 3

Page 4: IO Workload Throttling on Supercomputers€¦ · (OOOPS) •An innovative IO workload managing system that optimally controls the IO workload from the users' side. •Automatically

Potential Solutions • System level

o A strong parallel filesystem that can handle any kind of IO requests from all users without losing efficiency, e.g., upgrade hardware of MDS to achieve better IO throughputØ Impractical, expensive or limited improvement

o Burst buffer Ø Needs extra hardware and software, even changes in user code

• Application levelo A well-designed workflow with reasonable IO workload

Ø Recommended wayØ Expertise required

• User levelo Users give up planned IO work to avoid heavy IO requests or decrease

the number of jobsØ A compromise rather than a solution

11/12/18 4

Page 5: IO Workload Throttling on Supercomputers€¦ · (OOOPS) •An innovative IO workload managing system that optimally controls the IO workload from the users' side. •Automatically

Potential Solutions • System level

o A strong parallel filesystem that can handle any kind of IO requests from all users without losing efficiency, e.g., upgrade hardware of MDS to achieve better IO throughputØ Impractical, expensive or limited improvement

o Burst buffer Ø Needs extra hardware and software, even changes in user code

• Application levelo A well-designed workflow with reasonable IO workload

Ø Recommended wayØ Expertise required

• User levelo Users give up planned IO work to avoid heavy IO requests or decrease

the number of jobsØ A compromise rather than a solution

o An optimal system that makes heavy IO work under control Ø Without rewriting users’code

11/12/18 5

Page 6: IO Workload Throttling on Supercomputers€¦ · (OOOPS) •An innovative IO workload managing system that optimally controls the IO workload from the users' side. •Automatically

Lustre Architecture (NICS website)https://www.nics.tennessee.edu/computing-resources/file-systems/lustre-architecture

11/12/18 6

Page 7: IO Workload Throttling on Supercomputers€¦ · (OOOPS) •An innovative IO workload managing system that optimally controls the IO workload from the users' side. •Automatically

Lustre Architecture (NICS website)https://www.nics.tennessee.edu/computing-resources/file-systems/lustre-architecture

11/12/18 7

Page 8: IO Workload Throttling on Supercomputers€¦ · (OOOPS) •An innovative IO workload managing system that optimally controls the IO workload from the users' side. •Automatically

Our Proposed User-side Solution

• Intercept IO related functions (open(), stat(), etc.) within applications and keep a record ofo IO operation time (response time)o IO operation frequency (calculated from saved time

stamp of recent function calls)• Evaluate filesystem status (busy/modest used/free)

o Responding time per operation• Evaluate IO workloads (recent IO request frequency)

o Node based and user based• Insert proper delays when necessary

11/12/18 8

Page 9: IO Workload Throttling on Supercomputers€¦ · (OOOPS) •An innovative IO workload managing system that optimally controls the IO workload from the users' side. •Automatically

Optimal Overloaded IO Protection System(OOOPS)

• An innovative IO workload managing system that optimally controls the IO workload from the users' side.

• Automatically detect and throttle excessive IO workload from supercomputer users to protect parallel shared filesystems.

11/12/18 9

Page 10: IO Workload Throttling on Supercomputers€¦ · (OOOPS) •An innovative IO workload managing system that optimally controls the IO workload from the users' side. •Automatically

write_data() {FILE *fOut; fOut = fopen(name, mode);…}

User application

open(name, mode, …) {…}

glibc version of open()defined in libc.so

write_data() {FILE *fOut; fOut = fopen(name, mode); …}

User application

open(name, mode, …) {…open(name, mode, …);…}

OOOPS version of open()defined in ooops.so

Without OOOPS loaded

With OOOPS loaded (LD_PRELOAD OOOPS library)

open(name, mode, …){…}

glibc version of open()defined in libc.so

Function Interception

11/12/18 10

Page 11: IO Workload Throttling on Supercomputers€¦ · (OOOPS) •An innovative IO workload managing system that optimally controls the IO workload from the users' side. •Automatically

IO Requests with Different Settings

11/12/18 11

Page 12: IO Workload Throttling on Supercomputers€¦ · (OOOPS) •An innovative IO workload managing system that optimally controls the IO workload from the users' side. •Automatically

11/12/18 12

Example of Running OpenFOAM

Page 13: IO Workload Throttling on Supercomputers€¦ · (OOOPS) •An innovative IO workload managing system that optimally controls the IO workload from the users' side. •Automatically

11/12/18 13

Example of Running TensorFlow

Page 14: IO Workload Throttling on Supercomputers€¦ · (OOOPS) •An innovative IO workload managing system that optimally controls the IO workload from the users' side. •Automatically

Example of Dynamically Throttling IO Requests

11/12/18 14

Page 15: IO Workload Throttling on Supercomputers€¦ · (OOOPS) •An innovative IO workload managing system that optimally controls the IO workload from the users' side. •Automatically

OOOPS Highlights• Convenient to HPC users

o No source code modification at all on uses’ sideo Little/no workflow update on users’ sideo Self-driven slowdown IO work when necessary

• Valuable on supercomputerso Protect filesystem from overloaded IO requestso Little overhead: minimal/slight influence on performance except

some jobs performing excessive IO worko Easy to deploy on an arbitrary cluster as long as file system is

POSIX complianto Scale up to any size of supercomputerso Little work for system administratorso Dynamically control running jobs’ IO requests without interruption

11/12/18 15

Page 16: IO Workload Throttling on Supercomputers€¦ · (OOOPS) •An innovative IO workload managing system that optimally controls the IO workload from the users' side. •Automatically

Limitations

• The IO resource provisioning policy is too simple.

• OOOPS will lead to noticeable performance degradation for the jobs with very intensive IO for significant time.

11/12/18 16

Page 17: IO Workload Throttling on Supercomputers€¦ · (OOOPS) •An innovative IO workload managing system that optimally controls the IO workload from the users' side. •Automatically

Conclusion

• We developed a new tool (OOOPS) to helpü users carry out heavy IO work that is originally not

allowedü administrators protect the cluster from overload

• We enforce a fair-sharing IO resource provisioning policy on client side practically (instead of server side)ü Treat IOPS/Metadata server throughput as a resourceü Increase system capacity (applications with heavy IO

load)

11/12/18 17

Page 18: IO Workload Throttling on Supercomputers€¦ · (OOOPS) •An innovative IO workload managing system that optimally controls the IO workload from the users' side. •Automatically

Acknowledgement

Colleagues at TACC● Zhao Zhang ● Junseong Heo● Tommy Minyard ● Robert McLay● Bill Barth ● John CazesStampede2 early users of OOOPS

Other HPC centers● Davide Del Vento (NCAR)● Kevin Manalo (JHU)

11/12/18 18