cloud biolinux s.africa
TRANSCRIPT
Cloud BioLinux: pre-configured and on-demand computing for genomics without institutional, geographic or economic boundaries
Ntino Krampis, PhDJCVI-NIAID-UL workshop S. Africa 2011
Low-cost sequencing technology
A new generation of small-factor, bench-top sequencers
example: GS Junior by 454
sequencing becoming standard in biology and genetics research
besides whole genomes: RNAseq, ChiPseq, and metagenomics
1
downstream bioinformatic analysis is required for scientific discovery
Problem 1: sequence data analysis requires high performance and expensive computing hardware
Problem 2: many commonly used bioinformatics tools are difficult to install, usually available only as source code - need technical expertise
Acquiring the sequence data is only the first step
2
cloud computing : high performance computers and data storage, remotely accessible through the Internet
we are all using the cloud: Gmail, Google Docs, Yahoo! Mail, FaceBook; you store and access data on a remote computer
cloud computers rented pay-as-you-go by service providers such as Amazon Elastic Compute Cloud (EC2)
Solving problem 1: computational capacity on the cloud
3
Cloud computing with Amazon EC2
Additional services besides computing and storage:http://aws.amazon.com
a subsidiary company of Amazon.com, pay-as-you go cloud computing
cloud computers cost $0.085 - $2 per hr (max 64GB memory and 8 processors)
used by companies that need additional computers without investing on hardware
physical locations US East / West regions, EU, Singapore, Japan researchers
work on the closest location, then distribute results world-wide
democratizes access to computing resources outside of institutional, economic or national boundaries
750 hours free for new users!:http://aws.amazon.com/free/
Additional services besides computing and storage:http://aws.amazon.com
Additional services besides computing and storage:http://aws.amazon.com
4
operating system, bioinformatics tools and data, are installed on a Virtual Machine (VM)
a VM is uploaded on the cloud; runs using on-demand computing capacity from the EC2 cloud service
can be accessed world-wide through a desktop / laptop computer with Internet access
removes need for local computing infrastructure at each laboratory
How does cloud computing work ?
local desktop computers
Internet
remote Amazon EC2 cloud computing service
VM
VM
VM
5
bioinformatics tools are difficult to install
Cloud BioLinux offers a VM on the cloud with 100+ pre-installed and configured bioinformatics tools
sequence analysis, de novo assembly, annotation, phylogeny, molecular modeling, gene expression
a researcher can initiate a practically unlimited number of VMs for large-scale data analysis
Solving problem 2: Cloud BioLinux
6
sign-in to the Amazon EC2 cloud control console http://aws.amazon.com/console Username: [email protected] Password: SAcloud!
7
Starting our tutorial: using the cloud
Launch Cloud BioLinux through the EC2 cloud console
Click the Launch Instancebutton
8
1. go to the Community AMIs tab, specify the Cloud BioLinux identifierami-6011e409
Click
2. select computational capacity: Large - 2 CPU cores 7.5 GB memory
Click
Cloud BioLinux launch wizard: steps 1 & 2
9
3. specify a password (workshop)for login to Cloud BioLinux in the User Data box
Click
Cloud BioLinux launch wizard: step 3
10
Cloud BioLinux launch wizard: steps 4 & 5
4. enter a value to uniquely identify your individual Cloud BioLinux VM
Click
5. select Proceed without a Key Pair
Click
11
Cloud BioLinux launch wizard: steps 6 & 7
6. choose default security groupClick
7. Are we all on the final screen ? Click
12
Cloud BioLinux launch status
wizard completes and we return back to the console
takes a few minutes to launch, will be in pending (yellow) state
13
While waiting for Cloud BioLinux to boot up...
14
public datasets on Amazon EC2: http://aws.amazon.com/publicdatasets
Genbank and Ensembl databases, 1000 human genomes project, influenza
data hosted for free, users pay only for the computing time used
community program: http://aws.amazon.com/datasets/submit
advantage: putting the data where computational capacity is available
Amazon EC2 education-research grants: http://aws.amazon.com/education/
Any questions before we get to the exercises ?
15
final step
In the console click Instances
findyour unique Cloud BioLinux VM using your name specified in step 4
copy its Public DNS (server address / URL on the cloud)
Connecting remotely to Cloud BioLinuxclick the NX client icon on your computer's desktop:
A. paste the DNS in the Host box B. select Unix, Gnome, remote desktop size
C. ubuntu is the default user Login workshop is the password we set
16
17
18
a.
b.
c.
19
two S.aureus strains and one S.carnosus speciesdrag & drop the .fna files on the Cloud BioLinux desktop
20
21
22
23
24
25
26
27
28
29
30
save and share the Virtual Machine (VM) containing your analysis results with a collaborator
storage costs:0.10$ / GB / month
31
authorize access to the VM: public or for certain users
other researchers can access the VM with all the software, data, analysis results directly on the cloud
Cloud BioLinux: whole system snapshot exchange
32
Acknowledgments & Credits
Brad Chapman,Tim Booth, Bela Tiwari, Dawn Field Cloud BioLinux developmentDeepak Singh and AWS - compute credits on EC2 supporting initial developmentJ. Craig Venter Inst. - sponsorship / time allowed to work on this projectD. Gomez, E. Navarro, J. Shao, I. Singh, D. Edwards, M. Stout JCVI tech innovation
Members of the Cloud Biolinux community:Enis AfganMichael HeuerRichard HollandMark JensenDave MessinaSteffen MllerRoman Valls
Thank you !