rusersgroupbioconductortalkupdated
TRANSCRIPT
Bioconductor R packages and Docker For Analysis of Genomics Data
Jennifer DavisData Scientist, Health and Life Sciences Consultant
July 2015 TalkATX R-User’s Group
Outline
What is Genomics and why is it ‘big data’?
Bioconductor History & Development Completely Open-source software for Genomics Current Use Cases
Using Bioconductor on ‘Big’ Data Platforms Docker on AWS Other methods: AMI & Vagrant
Example: Visualizing variants of the gene EHM2 overexpressed in Cancer Static example to show process R Code: http://rpubs.com/JDavis/93047 Data (free) can be obtained from Ensembl:
http://www.ensembl.org/index.html
What is Genomics and Why is it ‘Big Data’?
Study of the DNA, RNA & proteins within your cells
Looks at all of the material in a high-through-put manner
Creates lots of data Data coming off the sequencer for 1 genome with
reasonable coverage is ~200 GB (1 individual’s genome)
However in clinical studies, this would quickly grow to well over a TB considering that patient population is larger
Computationally intensive, complex algorithms are used to align, model and predict gene location, changes in expression, changes in DNA/RNA shape and function
Types of Material Studied in Genomics
All the ‘blue-prints’ that make you uniqueDNA (genes, blue-prints for transcripts, introns,
exons)RNA (gene transcripts & gene regulators)
mRNA (blue-print for proteins)miRNA (regulates DNA and other functions)
Regulatory elements (3 dimensional DNA + protein)ChIPHiC-seq5C-seq
Bioconductor: History
Bioconductor is a consortium of open-source R packages created for bioinformatics analysis
2 releases each year, currently 1,024 packages are released
Bioconductor is available in a series of AMI (Amazon Machine Images) and Docker images
Thousands of academic and commercial users have contributed packages.
Supported by Fred Hutchinson Cancer Center as well as many contributors nationally and internationally
Bioconductor Docker Containers: Use Cases in Genomics
Pre-processing genomic reads such as those used for example:Alignment: LowMACA, msa, muscle, 19 othersAnnotation: AffyExpress, annaffy, 69 other
programsPreprocessing (83 packages)Quality Control (102 packages)
More interesting analyses:Pathway analysis: (83 programs)Report Writing (26 packages)Visualization (240 packages)
Bioconductor: Under the Hood
Bioconductor uses the R programming language and its ecosystem to perform genomic analyses
Bioconductor could be thought of as a repository with its own ecosystem, rather than a software
To use the R packages in Bioconductor more easily, it’s a good idea to load the Bioconductor environment in R. This is a 1-time only task
Under the hood, the bioinformatics R is often running its more computationally intensive math with C/C++ code.
Docker
Open platform for distributed applications
Useful for developers and system administrators as well as data analysts
Helpful to ensure:Calculations and scientific environment are
reproducible as long as OS & hardware are comparable
Containerizes without changing computationsDoes not shuffle data, map data, or change data
locationWorks on cloud computing platforms
Docker Continued…
Uses RAM efficiently—all containers run on simple machines and share same operating kernel
Schema – similar to Amazon Machine Images, and can run easily on AWS, Hadoop or your favorite Super-Computer
Containers are ‘lighter weight’ than VM & more resource-efficient
Isolates applications from each other and underlying infrastructure, therefore can be optimized for different computing environments
Using Bioconductor Docker Containers
Prerequisites: Docker installed and on Mac or Windows you also
need boot2docker installed and runningRunning in Rstudio:
Other Machine-Image Options: Vagrant
Vagrant – works very nicely with Spark and iPython Notebooks I have not tried with R…
Is a completely free lightweight development environment which can use Machine Images
Uses Virtual Box, VMWare or AWS MI as the actual computation environment
May have better security than containers because the computations are done in an actual virtual machine
Typical Work-flow
Import Data to R program
Containerize software from Bioconductor in Docker or use Amazon Machine Image
Submit to Amazon Web Services or your favorite super computing cloud platform
You can run containerized R code on Hadoop clusters but it may not give superior results compared to AWS
Use Case Example: Visualizing a Cancer Gene & Variants
Genomics Data is freely available from Ensembl
Visualization today used the Bioconductor package GenomeGraphs
We will be visualizing the gene transcript and variants for EHM2
EHM2 is found in erythrocyte membranes and over expressed in epithelial-derived cancer cellsLiver, lung, kidney, mammary, otherCancer: high and low metastatic keratinocytes (skin
cells)
Very brief Tutorial: Static examples of Code
We will walk-through static images in Rpubs:
How to download and set-up Bioconductor
How to download data from Ensembl
How to run your Genome Graph Code
What the output looks like
How to speed up your code
Thank you
Thank you to Sandy and R User’s group for inviting me to give a talk
Black Locus & Home Depot for sponsoring the talk
Sponsors for R User’s Group Revolution Analytics Bazaarvoice Rstudio BlackLocus
Audience members for your time & attention
Cheng Lee, Principal software engineer, Lab7 Systems for his helpful comments on Docker
Docker Files for Bioinformatics R Users
Sequencing Analysis packages: http://bioconductor.org/packages/release/BiocViews.ht
ml#___Sequencing
Rocker – Docker files for R users https://github.com/rocker-org/rocker/tree/master/rstudio Wiki: https://github.com/rocker-org/rocker/wiki
Containers or ‘machine images’ will make a stand-alone virtual environment in which to run your sequencing code on a ‘big data’ platform such as Amazon Web Services
Some More Useful Links
Docker: https://www.docker.com
Bioconductor: bioconductor.org
Amazon Elastic Compute Cloud Guide: docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html
Repository with Code:
Link 1: http://rpubs.com/JDavis/93047
Link 2: https://github.com/jddavis-100/Statistics-and-Machine-Learning/wiki/Welcome-&-Table-of-Contents
More options for running Bioconductor analyses on Large-Scale Platforms
ParallelR Toolkit (still open source for now)http://projects.revolutionanalytics.com/documents/p
arallelr/parallerrpkgs/
doParallel is another package that is helpful in conjunction with parallelR
Rhadoop (does not work on the newest version of R, so you need to use an older version and containerize it); this tool mainly assists R users so that they do not need to write any hadoop streaming scripts
DeployR – web services API from RevolutionR
Your Cheat-Sheet of Things to Remember to Avoid Disaster
If something can go wrong, it will go wrong
Solution: use versioning (e.g. Git, Bitbucket) and a bug tracker (e.g. Jira) if you are making something for production
Many R Packages change, get updated, and sometimes new versions might give different results.
Solution I: curate your code (even if you think you’ll never need it again), and document which version of R as well as the date you ran your analysis
Solution II: containerize and save all of your code & the container. You can send the “containerized” environment to a colleague and they can run data in the same environment.
Caveat: if you have sensitive data, you may need to use a virtual machine. A container is a lightweight solution, but it does not replicate a virtual machine with all the security settings you can provide on a virtual machine.
Solution III: Consider Revolution R or a ‘Big Data’ platform if you are performing these analyses at the enterprise level
Solution IV: Use packages that have been mentioned in peer-reviewed publications OR if you get weird results, check them against other published results or R packages or your own computations
Make sure you have enough memory allocated to run your job
Containerizing your script does not reduce its dimensionality, nor the amount of computing time required to run it. It may reduce wall-clock time on a cluster computing environment, but you still need to allocate memory.
Consider making your R code parallelizable (another topic for another time)