duke docker day 2014: research applications with docker

26
Automate Analyses, Reuse Them, Allow Reproducibility Research Analysis Applications with Docker: Duke Office of Information Technology 9/11/14 | Duke Docker Day

Upload: darin-london

Post on 02-Jul-2015

734 views

Category:

Software


0 download

DESCRIPTION

This talk was presented at the first annual Duke Docker Day presented by the Duke Office of Information Technology. It describes a reproducible analysis pipeline using Docker images

TRANSCRIPT

Page 1: Duke Docker Day 2014: Research Applications with Docker

AutomateAnalyses, Reuse Them, Allow Reproducibility

Research Analysis Applications with Docker:

Duke Office of Information Technology9/11/14 | Duke Docker Day

Page 2: Duke Docker Day 2014: Research Applications with Docker

Docker Concepts

▪ Build context: Directory with a Dockerfile, and any files to be added to the image to be built

▪ Image: ▫ Like a VM Image▫ run to produce a container▫ Multiple containers can be produced from the same image▫ Images are shared on a hub▫ Foundation of reusability and reproducibility

▪ Container: ▫ Like a VM machine instance▫ A running instance of an image

▪ Hub: ▫ Network accessible repository of named docker images▫ https://registry.hub.docker.com is the world’s repo of docker images▫ Can be hosted internally for private images▫ docker commandline is aware of a hub (registry.hub.docker.com by default, but

configurable)

2

Page 3: Duke Docker Day 2014: Research Applications with Docker

Docker commandline interface

▪ Requires sudo (unless on a mac, or specially configured by sysadmins)

▪ https://docs.docker.com/reference/commandline/cli/

▪ https://docs.docker.com/reference/run/

3

Page 4: Duke Docker Day 2014: Research Applications with Docker

Dockerfile

▪ https://docs.docker.com/reference/builder/• http://devo.ps/blog/docker-dos-and-donts/• Tension between lots of RUN statements VS a

single RUN of a big, all-inclusive installation process (shell, puppet, ansible, etc.)• lots of RUN’s can get hard to maintain• you lose all the benefits of caching if you just

run a single installation process• Look for the golden mean. Maybe run multiple

installation processes with the aim of adding related functionality as a group

4

Page 5: Duke Docker Day 2014: Research Applications with Docker

5

DEMO Plasmodium Alignment

A Research Analysis Pipeline WITH a Reproducible Exemplar!

https://github.com/dmlond/docker_bwa_aligner

Page 6: Duke Docker Day 2014: Research Applications with Docker

What is a Docker Application?

▪ wraps the logic for exposing a single process interface (may have many processes running in the background, but generally exposes only one process to the user)

▪ Can run much like an installed application

6

Page 7: Duke Docker Day 2014: Research Applications with Docker

Example: dmlond/bwa_aligner

▪ It’s a perl script▫ In a container built to have its own special *nix

environment▪ Starts from centos:centos6▪ Adds its own user ‘bwa_user’ with its own HOMEDIR

/home/bwa_user▪ Adds the EPEL repo▪ Adds bwa and samtools from EPEL using yum (one could

download source and compile just as easily)

▫ Hosted on github so you can view its build context, and build it yourself from scratch https://github.com/dmlond/bwa_aligner

▫ Hosted on dockerhub so you can run it on your own machinehttps://registry.hub.docker.com/u/dmlond/bwa_aligner

7

Page 8: Duke Docker Day 2014: Research Applications with Docker

What is a Volume Container

• Image contains the logic for exposing one or more distinct directory trees to other Docker containers

• Running the image to produce a container exposes its own version of the specified directory tree

• A volume container can run and immediately exit, but its specific directory tree stays around for use in other containers

• Designed to be run with --name $name• Other containers access a volume containers’ exposed

directory trees by passing its $name at run time using the --volumes-from run parameter

• When you rm a volume container, all files in its specific directory tree are destroyed

8

Page 9: Duke Docker Day 2014: Research Applications with Docker

Example dmlond/bwa_reference_volume

▪ Dockerfile Exposes /home/bwa_user/bwa_indexed

▪ When the container runs (with a name), it exits immediately

▪ A container can add files to the volume container directory (dmlond/bwa_reference)

▪ A container can read files in the volume container directory (dmlond/bwa_aligner)

▪ Each container created from the image has its own distinct existence. Writes to the /home/bwa_user/bwa_indexed directory tree in one container does not affect the directory trees in dmlond/bwa_reference_volume containers

9

Page 10: Duke Docker Day 2014: Research Applications with Docker

FROM Here to Eternity and Beyond

▪ You can extend an existing image to have new functionality using your own build context

▪ Use intermediate container names, and tagging

10

Page 11: Duke Docker Day 2014: Research Applications with Docker

11

DEMO Agents

Reusing and Extending the Applications from the PlasmodiumAlignment Exemplar!

https://github.com/dmlond/split_agent[ https://github.com/dmlond/split_raw/blob/master/split_raw.pl ]

https://github.com/dmlond/bwa_aligner_agent[ https://github.com/dmlond/bwa_aligner ]

Page 12: Duke Docker Day 2014: Research Applications with Docker

Old School *nix is Cool Again!!!! (For better or worse)

▪ STDIN, STDOUT, STDERR

▪ $?, the exit status

▪ Wrapper scripts

▪ Usage statements

▪ Building a containerized app feels like compiling a C application, e.g edit, build, run, repeat

12

Page 13: Duke Docker Day 2014: Research Applications with Docker

Security

▪ Unlike in traditional VM, a docker container can access some host resources

▪ DO NOT RUN AS ROOT BY DEFAULT!▫ USER + ENTRYPOINT + CMD + WORKDIR▫ These can be overridden at run time▫ These can be overridden in new containers starting FROM them to

‘extend’ them▪ Be very specific with commands, rely on wildcards and shell/exec

commands sparingly▪ Use the same paranoid practices in your container apps that you use in

web/cgi applications:▫ use open([“cmd”,”arg1”,”arg2”]) instead of open(“cmd arg1 arg2”)▫ check for tainted input▫ watch for wildcards in filenames, especially if doing chmod, chown,

chgrp, rsync, etc. ( http://www.defensecode.com/public/DefenseCode_Unix_WildCards_Gone_Wild.txt )

13http://opensource.com/business/14/9/security-for-docker

Page 14: Duke Docker Day 2014: Research Applications with Docker

Acknowledgements:

▪ Duke Office of Research Informatics

▪ ORI Research Application Development Group

▪ Duke Office of Information Technology

▪ Mark Delong (Duke Research Computing)

▪ Chris Collins (OIT)

▪ Erich Huang (Duke School of Medicine)

▪ Greg Crawford (Genomics and Computational Biology)

▪ Rutger Vos (Naturalis)

14

Page 15: Duke Docker Day 2014: Research Applications with Docker

References

1. Stodden VC. Reproducible research: Addressing the need for data and code sharing in computational science. Computing in Science & Engineering 2010

2. Stodden V, Guo P, Ma Z. Toward Reproducible Computational Research: An Empirical Analysis of Data and Code Policy Adoption by Journals. Zaykin D, editor. PLoS ONE. Public Library of Science; 2013;8(6):e67111.

3. Francis S. Collins& Lawrence A. Tabak. Policy: NIH plans to enhance reproducibility. Nature 505, 612–613 (30 January 2014)

4. Announcement: Reducing our irreproducibility. Nature News. 2013 Apr 25;496(7446):398–8.

5. Ince D. C., Hatton L., Graham-Cumming J. The Case for Open Computer Programs. Nature 482, 485–488 (23 February 2012).

6. Dudley JT, Butte AJ (2009) A Quick Guide for Developing Effective Bioinformatics Programming Skills. PLoS Comput Biol 5(12): e1000589. doi:10.1371/journal.pcbi.1000589.

7. Noble WS (2009) A Quick Guide to Organizing Computational Biology Projects. PLoSComput Biol 5(7): e1000424. doi:10.1371/journal.pcbi.1000424.

15

Page 16: Duke Docker Day 2014: Research Applications with Docker

16

Any Questions?

Page 17: Duke Docker Day 2014: Research Applications with Docker

https://www.docker.com/whatisdocker

▪ Open platform for developers and sysadmins to build, ship, and run distributed applications

▪ Consists of Docker Engine, a portable, lightweight runtime and packaging tool, and Docker Hub, a cloud service for sharing applications and automating workflows

▪ Enables apps to be quickly assembled from components and eliminates the friction between development, QA, and production environments

▪ IT can ship faster and run the same app, unchanged, on laptops, data center VMs, and any cloud

17

Page 18: Duke Docker Day 2014: Research Applications with Docker

Sys Admins Like Docker

18

Page 19: Duke Docker Day 2014: Research Applications with Docker

Application Developers Like Docker

19

Page 20: Duke Docker Day 2014: Research Applications with Docker

Another Reason to Like Docker

20

Page 21: Duke Docker Day 2014: Research Applications with Docker

Researchers should also like Docker

Computation is becoming more prevalent

▪ "Computation is becoming central to the scientific enterprise, but the prevalence of relaxed attitudes about communicating computational experiments’ details and the validation of results is causing a large and growing credibility gap.” (1)

▪ “To adhere to the scientific method in the face of the transformations arising from changes in technology and the Internet, we must be able to reproduce computational results.” (1)

Granting Agencies and Journals have begun to take note

▪ 2012 saw a one year increase of 16% in the number of data policies, a 30%increase in code policies, and a 7% increase in the number of supplemental materials policies in journals (2)

▪ NIH has introduced new mandatory training modules, and reviewer checklists (3)

▪ Nature has introduced checklists to enhance reproducibility (4)

21

http://melissagymrek.com/science/2014/08/29/docker-reproducible-research.html

Page 22: Duke Docker Day 2014: Research Applications with Docker

What about Good Old Excel Spreadsheets?

Benefits

▪ Reusable

▪ Reproducible

▪ Shareable

▪ Code and Data stored in one convenient package

Problems▪ $$$▪ Only works on MS

Windows and OSX*▪ Easy to share data not

intended for sharing (PHI accidentally left in another worksheet)

▪ Inter-version incompatibilities

▪ Does not scale to big data▪ Security (macros and

viruses)

22

Page 23: Duke Docker Day 2014: Research Applications with Docker

Free and Open Source Code

Benefits▪ Free for anyone

▪ Code can easily be shared using online repositories (github, sourceforge, etc.), separately from data, and without cost to publisher or peers

▪ Can scale to big data

Problems

▪ Inter-version incompatibilities

▪ Difficult to fully specify software dependencies (especially when moving between architectures and OSes)

▪ Dependency clashes between libraries required by different applications

▪ Data must be structured rigorously, and code must be written in a special way to facilitate automation and reproducibility (6,7)

▪ Code and Data distribution must be managed independently

▪ Code can get stale without routine maintenance

23

“we have reached the point that, with some exceptions, anything less than release of actual source code is an indefensible approach for any scientific results that depend on computation, because not releasing such code raises needless, and needlessly confusing, roadblocks to reproducibility.”(5)

Page 24: Duke Docker Day 2014: Research Applications with Docker

Workflow Enactors (Taverna, Galaxy,…)

Benefits

▪ Easy to share workflows with others

▪ Reduces dependency Clashes

▪ Can scale to big data with proper parallelization

Problems

▪ Dependence on web accessible data (security, privacy)

▪ Emphasize web services over commandlineapplications

▪ Still have inter-version incompatibilities

24

Page 25: Duke Docker Day 2014: Research Applications with Docker

Machine Images (Virtualbox, VMWare)?

Benefits

▪ Eliminates inter-version incompatibility issues

▪ Eliminates dependency clashes

▪ Can be shared between different servers (internal, amazon, google, etc.) running VM hosting technology

▪ Can spin up/tear down as many instances as needed

Problems

▪ Can get quite large

▪ Slow to start

▪ Must become proficient with server provisioning commands (package management, puppet/chef/ansible, etc.)

25

Page 26: Duke Docker Day 2014: Research Applications with Docker

Docker

Benefits

▪ Similar to VM

▪ Start and Stop much faster than VM

▪ Docker images are smaller than equivalent VM Image

Problems

▪ Must become proficient with server provisioning commands (package management, puppet/chef/ansible, etc. google ‘DevOps’)

▪ Orchestration and enactment

▪ Docker images may provide access to host resources that are not available to VM images

26