rusersgroupbioconductortalkupdated

Bioconductor R packages and Docker For Analysis of Genomics Data

Jennifer DavisData Scientist, Health and Life Sciences Consultant

July 2015 TalkATX R-User’s Group

Outline

What is Genomics and why is it ‘big data’?

Bioconductor History & Development Completely Open-source software for Genomics Current Use Cases

Using Bioconductor on ‘Big’ Data Platforms Docker on AWS Other methods: AMI & Vagrant

Example: Visualizing variants of the gene EHM2 overexpressed in Cancer Static example to show process R Code: http://rpubs.com/JDavis/93047 Data (free) can be obtained from Ensembl:

http://www.ensembl.org/index.html

What is Genomics and Why is it ‘Big Data’?

Study of the DNA, RNA & proteins within your cells

Looks at all of the material in a high-through-put manner

Creates lots of data Data coming off the sequencer for 1 genome with

reasonable coverage is ~200 GB (1 individual’s genome)

However in clinical studies, this would quickly grow to well over a TB considering that patient population is larger

Computationally intensive, complex algorithms are used to align, model and predict gene location, changes in expression, changes in DNA/RNA shape and function

Types of Material Studied in Genomics

All the ‘blue-prints’ that make you uniqueDNA (genes, blue-prints for transcripts, introns,

exons)RNA (gene transcripts & gene regulators)

mRNA (blue-print for proteins)miRNA (regulates DNA and other functions)

Regulatory elements (3 dimensional DNA + protein)ChIPHiC-seq5C-seq

Bioconductor: History

Bioconductor is a consortium of open-source R packages created for bioinformatics analysis

2 releases each year, currently 1,024 packages are released

Bioconductor is available in a series of AMI (Amazon Machine Images) and Docker images

Thousands of academic and commercial users have contributed packages.

Supported by Fred Hutchinson Cancer Center as well as many contributors nationally and internationally

Bioconductor Docker Containers: Use Cases in Genomics

Pre-processing genomic reads such as those used for example:Alignment: LowMACA, msa, muscle, 19 othersAnnotation: AffyExpress, annaffy, 69 other

programsPreprocessing (83 packages)Quality Control (102 packages)

More interesting analyses:Pathway analysis: (83 programs)Report Writing (26 packages)Visualization (240 packages)

Installing Bioconductor is as easy as….

Bioconductor: Under the Hood

Bioconductor uses the R programming language and its ecosystem to perform genomic analyses

Bioconductor could be thought of as a repository with its own ecosystem, rather than a software

To use the R packages in Bioconductor more easily, it’s a good idea to load the Bioconductor environment in R. This is a 1-time only task

Under the hood, the bioinformatics R is often running its more computationally intensive math with C/C++ code.

Docker

Open platform for distributed applications

Useful for developers and system administrators as well as data analysts

Helpful to ensure:Calculations and scientific environment are

reproducible as long as OS & hardware are comparable

Containerizes without changing computationsDoes not shuffle data, map data, or change data

locationWorks on cloud computing platforms

Docker Continued…

Uses RAM efficiently—all containers run on simple machines and share same operating kernel

Schema – similar to Amazon Machine Images, and can run easily on AWS, Hadoop or your favorite Super-Computer

Containers are ‘lighter weight’ than VM & more resource-efficient

Isolates applications from each other and underlying infrastructure, therefore can be optimized for different computing environments

Using Bioconductor Docker Containers

Prerequisites: Docker installed and on Mac or Windows you also

need boot2docker installed and runningRunning in Rstudio:

Other Machine-Image Options: Vagrant

Vagrant – works very nicely with Spark and iPython Notebooks I have not tried with R…

Is a completely free lightweight development environment which can use Machine Images

Uses Virtual Box, VMWare or AWS MI as the actual computation environment

May have better security than containers because the computations are done in an actual virtual machine

Using BioconductorA very brief Tutorial

Typical Work-flow

Import Data to R program

Containerize software from Bioconductor in Docker or use Amazon Machine Image

Submit to Amazon Web Services or your favorite super computing cloud platform

You can run containerized R code on Hadoop clusters but it may not give superior results compared to AWS

Use Case Example: Visualizing a Cancer Gene & Variants

Genomics Data is freely available from Ensembl

Visualization today used the Bioconductor package GenomeGraphs

We will be visualizing the gene transcript and variants for EHM2

EHM2 is found in erythrocyte membranes and over expressed in epithelial-derived cancer cellsLiver, lung, kidney, mammary, otherCancer: high and low metastatic keratinocytes (skin

cells)

Ensembl Information on EHM2

Very brief Tutorial: Static examples of Code

We will walk-through static images in Rpubs:

How to download and set-up Bioconductor

How to download data from Ensembl

How to run your Genome Graph Code

What the output looks like

How to speed up your code

Thank you

Thank you to Sandy and R User’s group for inviting me to give a talk

Black Locus & Home Depot for sponsoring the talk

Sponsors for R User’s Group Revolution Analytics Bazaarvoice Rstudio BlackLocus

Audience members for your time & attention

Cheng Lee, Principal software engineer, Lab7 Systems for his helpful comments on Docker

Docker Files for Bioinformatics R Users

Sequencing Analysis packages: http://bioconductor.org/packages/release/BiocViews.ht

ml#___Sequencing

Rocker – Docker files for R users https://github.com/rocker-org/rocker/tree/master/rstudio Wiki: https://github.com/rocker-org/rocker/wiki

Containers or ‘machine images’ will make a stand-alone virtual environment in which to run your sequencing code on a ‘big data’ platform such as Amazon Web Services

http://bioconductor.org/packages/release/BiocViews.html%23___Sequencing

http://bioconductor.org/packages/release/BiocViews.html%23___Sequencing

https://github.com/rocker-org/rocker/tree/master/rstudio

https://github.com/rocker-org/rocker/wiki

Some More Useful Links

Docker: https://www.docker.com

Bioconductor: bioconductor.org

Amazon Elastic Compute Cloud Guide: docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html

Repository with Code:

Link 1: http://rpubs.com/JDavis/93047

Link 2: https://github.com/jddavis-100/Statistics-and-Machine-Learning/wiki/Welcome-&-Table-of-Contents

https://www.docker.com/

http://rpubs.com/JDavis/93047

More options for running Bioconductor analyses on Large-Scale Platforms

ParallelR Toolkit (still open source for now)http://projects.revolutionanalytics.com/documents/p

arallelr/parallerrpkgs/

doParallel is another package that is helpful in conjunction with parallelR

Rhadoop (does not work on the newest version of R, so you need to use an older version and containerize it); this tool mainly assists R users so that they do not need to write any hadoop streaming scripts

DeployR – web services API from RevolutionR

http://projects.revolutionanalytics.com/documents/parallelr/parallerrpkgs/



Your Cheat-Sheet of Things to Remember to Avoid Disaster

If something can go wrong, it will go wrong

Solution: use versioning (e.g. Git, Bitbucket) and a bug tracker (e.g. Jira) if you are making something for production

Many R Packages change, get updated, and sometimes new versions might give different results.

Solution I: curate your code (even if you think you’ll never need it again), and document which version of R as well as the date you ran your analysis

Solution II: containerize and save all of your code & the container. You can send the “containerized” environment to a colleague and they can run data in the same environment.

Caveat: if you have sensitive data, you may need to use a virtual machine. A container is a lightweight solution, but it does not replicate a virtual machine with all the security settings you can provide on a virtual machine.

Solution III: Consider Revolution R or a ‘Big Data’ platform if you are performing these analyses at the enterprise level

Solution IV: Use packages that have been mentioned in peer-reviewed publications OR if you get weird results, check them against other published results or R packages or your own computations

Make sure you have enough memory allocated to run your job

Containerizing your script does not reduce its dimensionality, nor the amount of computing time required to run it. It may reduce wall-clock time on a cluster computing environment, but you still need to allocate memory.

Consider making your R code parallelizable (another topic for another time)

rusersgroupbioconductortalkupdated

Documents

bioconductor r packages

lots of data data

hood bioconductor

history bioconductor

bioconductor environment

bioconductor docker

map data

source r packages