the genesis of clusterlib - an open source library to tame your favourite supercomputer

45
The genesis of clusterlib An open source library to tame your favourite supercomputer Arnaud Joly May 2015, Phd hour discussions

Upload: arnaud-joly

Post on 12-Aug-2015

227 views

Category:

Software


1 download

TRANSCRIPT

The genesis of clusterlibAn open source library to tame your

favourite supercomputer

Arnaud Joly

May 2015, Phd hour discussions

Use case for the birth of clusterlibSolving supervised learning tasks

The goal of supervised learning is to learn a function frominput-output pairs in order to predict the output for any newinput.

A supervised learning task

1.5 1.0 0.5 0.0 0.5 1.0 1.5

1.5

1.0

0.5

0.0

0.5

1.0

1.5

A learnt function

1.5 1.0 0.5 0.0 0.5 1.0 1.5

1.5

1.0

0.5

0.0

0.5

1.0

1.5

Time is running out

A huge set of tasks and a practical upper bound

O(#datasets×#algorithms×#parameters) ≤ Time before deadline

Embarrassingly parallel tasks.

Supercomputer = scheduler + workersSupercomputer = cluster of computers

CECI is in !

Thanks to the CECI1 at the University of Liège, we have accessto

7 supercomputers (or clusters of computers);≈ 20 000 cores;>60 000 GB ram;

12 high performance scientific GPU.

1The CECI (http: // www. ceci-hpc. be/ ) is a supercomputerconsortium funded by the FNRS.

A glimpse to the SLURM scheduler user interfaceHow to sumit a job?

First, we need to write a bash file job.sh which specifiesresource requirements and calls the program.#!/bin/bash#SBATCH --job-name=job-name#SBATCH --time=10:00#SBATCH --mem=1000srun hostname

Then you can launch the job in the queue$ sbatch job.sh

How to launch easily many jobs?

With clusterlib, how to launch easily many jobs??

Let’s generate jobs submission command on the fly!>>> from clusterlib.scheduler import submit>>> script = submit(job_command="srun hostname",... job_name="test",... time="10:00",... memory=1000,... backend="slurm")>>> print(script)echo ’#!/bin/bashsrun hostname’ | sbatch --job-name=test --time=10:00 --mem=1000

>>> # Let’s launch the job>>> os.system(script)

A glimpse to the SLURM scheduler user interfaceHow to check if a job is running?

To check if a job is running, you can use the squeue command.

$ squeue -u ‘whoami‘JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)1225128 defq job-9999 someone PD 0:00 1 (Priority)1225129 defq job-9998 someone PD 0:00 1 (Priority)...1224607 defq job-0003 someone R 7:39:16 1 node0251224605 defq job-0002 someone R 7:43:25 1 node0401224593 defq job-0001 someone R 8:06:33 1 node035

How to avoid launching running, queued or completed jobs?

With clusterlib, how to avoid re-launching ... ?

... running and queued jobs?Let’s give to jobs a unique name, then let’s retrieve the namesof running / queued jobs.>>> from clusterlib.scheduler import queued_or_running_jobs>>> queued_or_running_jobs()["job-0001", "job-0002", ..., "job-9999"]

... completed jobs?A job must indicate through the file system that it has beencompleted e.g. by creating a file or by registering completion into adatabase.clusterlib provides a small NO-SQL database based on sqlite3.

Simplicity beats complexity

With only 4 functionsfrom clusterlib.scheduler import queued_or_running_jobsfrom clusterlib.scheduler import submitfrom clusterlib.storage import sqlite3_dumpsfrom clusterlib.storage import sqlite3_loads

I We launch easily thousands of jobs.I We are task DRY (Don’t repeat yourself).I We are smoothly working on SLURM and SGE schedulers.I We have only a dependency on python,

Are we done?

Let’s go open source!

Are we done?

Let’s go open source!

Why making an open source library?

Give back to the open source community.

Open source initiative affiliate communities

Why making an open source library?

Bug reports are great!

Why making an open source library?

Welcome new contributors !

From left to right: Olivier Grisel, Antonio Sutera, Loic Esteve (nophoto) and Konstantin Petrov (no photo).

Why making open source in sciences? Reproducibility!

Be proud of your code

Open source way: Host your code publicly

www.myproject.com

Another awesome host platformTip Sit on the shoulders of a giant!

Bonus Use a control version system such as git (clusterlibchoice), mercurial,. . .

Open source way: Choose a license

No license = closed source

GPL-like license = copyleftBSD / MIT-style license = permissive

In shortYou can’t use or even read my code.

Open source way: Choose a license

No license = closed sourceGPL-like license = copyleft

BSD / MIT-style license = permissive

In shortYou can read / use / share / modify my code, but derivativesmust retains those rights.

Open source way: Choose a license

No license = closed sourceGPL-like license = copyleft

BSD / MIT-style license = permissive

In shortDo whatever you want with the code, but keep my name with it.

Open source way: Choose a license

No license = closed sourceGPL-like license = copyleft

BSD / MIT-style license = permissive

For the sake of open source, pick a popular open source license.

Open source way: Choose a license

No license = closed sourceGPL-like license = copyleft

BSD / MIT-style license = permissive

For the sake of open source, pick a popular open source license.For the sake of wisdom or money, choose carefully.

Open source way: Choose a license

No license = closed sourceGPL-like license = copyleft

BSD / MIT-style license = permissive

For the sake of open source, pick a popular open source license.For the sake of wisdom or money, choose carefully.For the sake of science, go with BSD / MIT-style license.

Clusterlib is BSD Licensed.

Open source way: Start an issue tracker

An issue tracker allows managing and maintaining a list ofissues.

Tip Sit on the shoulders of giants! (again)

Open source way: Let users discuss with core contributors

1. Issue tracker (only viable for small project)2. Mailing list : sourceforge, google groups, . . .3. Stack overflow tag for big projects

Tip Sit on the shoulders of giants! (again * 2)

Are we done?

The grand seduction!

Are we done?

The grand seduction!

Know who you are!

VisionThe goal of the clusterlib is to ease the creation, launch andmanagement of embarrassingly parallel jobs on supercomputerswith schedulers such as SLURM and SGE.

Core valuesPure python, simple, user-friendly!

Attrative documentation

Readme

Attrative documentation

API documentation is nice.

Tip Follow a standard such as "PEP 0257 – DocstringConventions".

Attrative documentation

Beautiful doc is better.

clusterlib uses sphinx for building its doc.

Attrative documentation

And even better with examples.

Attrative documentation

A narrative documentation is awesome.

Attrative documentation

What’s new?

Thanks to all clusterlib contributors!

Appealing test suite

How good is your test suite (code coverage)?All code lines are hit by the tests (100% line coverage), but notall code paths (branches) are tested.

$ make test

nosetests clusterlib doc

Name Stmts Miss Branch BrMiss Cover Missing------------------------------------------------------------------clusterlib 1 0 0 0 100%clusterlib.scheduler 82 0 40 16 87%clusterlib.storage 35 0 14 0 100%------------------------------------------------------------------TOTAL 118 0 54 16 91%------------------------------------------------------------------Ran 12 tests in 0.189s

OK (SKIP=3)

Publicity

Are we done?

Not yet! Let’s make our live easier!

Are we done?

Not yet! Let’s make our live easier!

Automation, automation, . . .

Continuous testingclusterlib uses Travis CI (works for many languages).

Tip Sit on the shoulders of giants! (again * 3)

Automation, automation, . . .Continuous integration pays off in the long term!Ensure test suite is often run. Awesome during development.

Tip Continuous integration can be enhanced with codetest coverage and code quality coverage.

Automation, automation, . . .

Continuous doc building

Tip Sit on the shoulders of giants! (again * 4)

Create and join an open source projects!Join a community! Learn the best practices and technologies!Make your code survive more than the one project! Give yourcode to the world!

scikit-learn sprint 2014

A full example of clusterlib usage

# main.py

import sys, osfrom clusterlib.storage import sqlite3_dumps

NOSQL_PATH = os.path.join(os.environ["HOME"], "job.sqlite3")

def main(argv=None):# For ease here, function parameters are sys.argvif argv is None:

argv = sys.argv# Do heavy computation# Save script evaluation on the hard disk

if __name__ == "__main__":main()# Great, the jobs is done!sqlite3_dumps({" ".join(sys.argv): "JOB DONE"}, NOSQL_PATH)

A full example of clusterlib usage

# launcher.pyimport sysfrom clusterlib.scheduler import queued_or_running_jobsfrom clusterlib.scheduler import submitfrom clusterlib.storage import sqlite3_loadsfrom main import NOSQL_PATH

if __name__ == "__main__":scheduled_jobs = set(queued_or_running_jobs())done_jobs = sqlite3_loads(NOSQL_PATH)

for param in range(100):job_name = "job-param=%s" % paramjob_command = ("%s main.py --param %s"

% (sys.executable, param))if (job_name not in scheduled_jobs and

job_command not in done_jobs):script = submit(job_command, job_name=job_name)