ngs bioinformatics workshop 1.1 workshop overview and practical informatics considerations march 7...

Download NGS Bioinformatics Workshop 1.1 Workshop Overview and Practical Informatics Considerations March 7 th, 2012 IRMACS, SFU Facilitator: Richard Bruskiewich

If you can't read please download the document

Upload: ella-asch

Post on 15-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1

NGS Bioinformatics Workshop 1.1 Workshop Overview and Practical Informatics Considerations March 7 th, 2012 IRMACS, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Slide 2 Todays Agenda Part 1 Welcome and Acknowledgments Some administrative details Introductions: Facilitator Participants 10 minute break Slide 3 Advance Acknowledgments Jim Mattson: for championing the workshop idea Felix Breden: for championing the idea of IRMACS bioinformatics support & endorsing this workshop IRMACS team: Pam Borghardt, IRMACS Managing Director: sponsorship Brian Technical Director: workshop infrastructure WestGrid Team: Ata Roudgar, Martin Siegert: workshop HPC infrastructure Fiona Brinkman: for her kind permission to adapt a number of her MBB introductory bioinformatics course slides for portions of the workshop Slide 4 Topic Lecture (12:30 14:30, Wednesdays) Demo/Lab (9:30 11:30, Thursdays) Bioinformatics Overview (roughly equivalent to core MBB 441/741 topics) Workshop Overview and Practical Informatics Considerations March 7thMarch 8th Sequence Formats, Databases and Visualization ToolsMarch 14 th March 15 th Sequence Alignment and SearchingMarch 21 st March 22 nd Principles of Structural Genomics and Overview of Next Generation Sequencing Technologies March 28 th March 29 th Sequence Assembly AlgorithmsApril 4 th April 5 th Specific Applications Sequence Assembly of TranscriptomesMay 2 nd May 3 rd Sequence Assembly of Whole GenomesMay 9 th May 10 th Annotation of de novo Assembled SequencesMay 16 th May 17th Identification and Analysis of Sequence VariationMay 23 rd May 24 th Comparative Genomic Analysis and VisualizationMay 30 th May 31 st Meta-Analysis of Newly Annotated Sequence DataJune 6 th June 7 th Slide 5 Venue The workshop lectures and demo/labs will generally take place here, in the IRMACS Centre, Room 10900 (top floor, Applied Sciences Building) with the exception of the March 14 th and May 9 th lectures, plus the May 10 th lab/demo for which there is a meeting conflict in IRMACS. These particular sessions will instead be convened in BioSci room B9242. The lab/demo sessions on March 8th, 15th and 29th will end earlier, at 11 am, to accommodate the next scheduled event in IRMACS 10900. Slide 6 Workshop Fee Sign-up list to Barbara Sherman will contact PI for billing(?) Slide 7 INTRODUCTIONS NGS Bioinformatics Workshop 1.1 Workshop Overview and Practical Informatics Considerations Slide 8 Facilitator Richard: A Brief Bio Professional Experience 2009 present, Adjunct Professor, MBB, SFU 2000-2011, Research Scientist, Computational and Systems Biology, Bioinformatics, International Rice Research Institute (IRRI; irri.org) 1999-2000, Postdoc, Human Analysis Team, Sanger Centre, Cambridge, UK Academic Background 1999, PhD (Medical Genetics), UBC 1992, B.Sc. (Biochemistry, Molecular Biology& Genetics), UBC 1987, B.A. (Minor Computing), SFU Personal Originally from Edmonton; moved to GVRD in late teens and resided here for over 2 decades before travelling abroad to work Wife is Filipina-Canadian (hence the job in the Philippines); 3 teenage kids (son in his late teens has just started in the SIAT program at SFU Surrey) Returned last June to reside in Port Moody, at the foot of Burnaby Mountain Slide 9 Participants Around the table Your Name, department, lab, (PI) (optional) Your Port of Origin What is your research focus? How can bioinformatics (NGS) support that research? What NGS data of your own do you have to analyse *now* Expectations for the workshop Slide 10 10 minute break Slide 11 Todays Agenda Part 2 What is Bioinformatics and why is it needed? What is Next Generation Sequencing Coping with the NGS bioinformatics challenge The Workshop Road Map Looking ahead Slide 12 WHAT IS BIOINFORMATICS? NGS Bioinformatics Workshop 1.1 Workshop Overview and Practical Informatics Considerations Slide 13 Bioinformatics is The development of computational methods for studying the structure, function, and evolution of genes, proteins, and whole genomes; The development of methods for the management and analysis of biological information arising from genomics and high- throughput biological experiments. Slide 14 14 Why is there Bioinformatics? u Lots of new sequences being added - Automated sequencers -Genome Projects -Metagenomics - RNA sequencing, microarray studies, proteomics, Patterns in datasets that can be analyzed using computers Huge datasets Slide 15 15 Gramicidine S (Consden et al., 1947), partial insulin sequence (Sanger and Tuppy, 1951) 1961: tRNA fragments Francis Crick, Sydney Brenner, and colleagues propose the existence of transfer RNA that uses a three base code and mediates in the synthesis of proteins (Crick et al., 1961) General nature of genetic code for proteins. Nature 192: 1227- 1232. In Microbiology: A Centenary Perspective, edited by Wolfgang K. Joklik, ASM Press. 1999, p.384 First codon assignment UUU/phe (Nirenberg and Matthaei, 1961) Need for informatics in biology: origins Slide 16 16 The key to the whole field of nucleic acid-based identification of microorganisms the introduction molecular systematics using proteins and nucleic acids by the American Nobel laureate Linus Pauling. Zuckerkandl, E., and L. Pauling. "Molecules as Documents of Evolutionary History." 1965. Journal of Theoretical Biology 8:357-366 Another landmark: Nucleic acid sequencing (Sanger and Coulson, 1975) Need for informatics in biology: origins Slide 17 17 Need for informatics in biology: origins First genomes sequenced: 3.5 kb RNA bacteriophage MS2 (Fiers et al., 1976) 5.4 kb bacteriophage X174 (Sanger et al., 1977) 1.83 Mb First complete genome sequence of a free-living organism: Haemophilus influenzae KW20 (Fleischmann et al., 1995) First multicellular organism to be sequenced: C. elegans (C. elegans sequencing consortium, 1998) Early databases: Dayhoff, 1972; Erdmann, 1978 Early programs: restriction enzyme sites, promoters, etc circa 1978. 1978 1993: Nucleic Acids Research published supplemental information Slide 18 18 (from the National Centre for Biotechnology Information) Genbank and associated resources doubles faster than Moores Law! (< every 18 months) http://en.wikipedia.org/wiki/Moores_law Slide 19 19 Today: So many genomes As of mid-August 2010, according to the GOLD GenomesOnline database. Eukaryotic genome projects are in progress? (Genome and ESTs) 1548 (517 - 5 years ago) Prokaryote genome projects are in progress? 5006 (740 - 5 years ago ) Metagenome projects are in progress? 133 (Zero - 5 years ago ) TOTAL 6687 projects (As of Sept 2011: >10,000) Slide 20 Slide 21 Slide 22 Slide 23 Slide 24 Slide 25 25 The Human Genome The genome sequence is complete - almost! approximately 3.5 billion base pairs. Slide 26 26 Work ongoing to locate all genes and regulatory regions and describe their functions bioinformatics plays a critical role Slide 27 27 Identifying single nucleotide polymorphisms (SNPs) and other changes between individuals Slide 28 28 Bioinformatics helps with. Sequence Similarity Searching/Comparison u What is similar to my sequence? u Searching gets harder as the databases get bigger - and quality changes u Tools: BLAST and FASTA = early time saving heuristics (approximate methods) u Need better methods for SNP analysis! u Statistics + informed judgment of the biologist Slide 29 29 Bioinformatics helps with. Structure- Function Relationships u Can we predict the function of protein molecules from their sequence? sequence > structure > function Prediction of some simple 3-D structures possible ( -helix, -sheet, membrane spanning, etc.) Slide 30 30 u Can we define evolutionary relationships between organisms by comparing DNA sequences? -Lots of methods and software, what is the best analysis approach? Bioinformatics helps with. Phylogenetics Slide 31 WHAT IS NEXT GENERATION SEQUENCING (NGS)? NGS Bioinformatics Workshop 1.1 Workshop Overview and Practical Informatics Considerations Slide 32 Sanger (dideoxy sequencing or chain termination) Sequencing Single stranded DNA from sample* extended by polymerase from primer then randomly terminated by dideoxy nucleotide (ddNTP) Variable length DNA fragments radiolabelled or fluorescently detected ddNTP *sample derived from amplified cDNA, genomic clones or whole genome shotgun Slide 33 Sanger Pros & Cons Advantages Relatively accurate Relatively long (500 1500) bp reads Disadvantage Relatively costly in terms of reagents and relatively low throughput Slide 34 Next Generation Sequencing (NGS) Sequence Assembly on HPC Roche 454 Life Tech. Ion Torrent Illumina HiSeq Life Tech SOLiD Oxford Nanopore GridION Polonator HeliScope Pacific Biosciences SMRT Cell Slide 35 (General) NGS Pros & Cons Advantages Very high throughput Very cheap data production Disadvantages Relatively short reads Relatively higher error rates Bioinformatics of assembly is much more challenging Slide 36 General NGS Workflow 1.Template preparation 2.Sequencing & imaging 3.Genome alignment/assembly Slide 37 COPING WITH THE NGS BIOINFORMATICS CHALLENGE NGS Bioinformatics Workshop 1.1 Workshop Overview and Practical Informatics Considerations Slide 38 Challenge Assembling next generation sequence (NGS) data requires a great deal of computing power and gigabytes memory Software often can execute in parallel on all available computer processing unit (CPU) cores. Many functional annotation processes (e.g. database searching, gene expression statistical analyses) also demand a lot of computing power Slide 39 High Performance Computing and Cloud Computing Computer Nodes Network Storage Your local workstation/ laptop Slide 40 What is Cloud Computing? Pooled resources: shared with many users (remotely accessed) Virtualization: high utilization of hardware resources (no idling) Elasticity: dynamic scaling without capital expenditure and time delay Automation: build, deploy, configure, provision, and move without manual intervention Metered billing: pay-as-you-go, only for what you use Cloud Computing Slide 41 Cloud Bioinformatics Module Raw Data/ Results/ Snapshots Task- Specialized Server Input Job Message Queue Output Job Message Queue Job Status Notification Customized Machine Image Start-up (w/parameters) Slide 42 A More Complete Picture Raw Data + Results Web Portal Project Relational Database Loader Slide 43 Case Study in Bioinformatics on the Cloud Used Amazon Web Services http://aws.amazon.com Assembled ~99 raw NGS transcriptome sequence datasets from 83 species, on 16 Amazon EC2 instances with 8 CPU cores, 68 GB of RAM, ~200 hours of computer time, total run in less than one working day. Each single machine of the required size would likely have cost at least ~$10,000 (and time) to purchase, and incur significant operating costs overhead (machine room space, power supplies, networking, air conditioning, staff salaries, etc.) The above run could be started up in a few minutes and cost ~ $500 to complete. Once done, no machines left idling and unused Slide 44 Software for (NGS) Bioinformatics Bundled with sequencing machines: e.g. Newbler assembler with Roche 454 3 rd party commercial: DNA Star (www.dnastar.com) Geneious (http://www.geneious.com/) GeneWiz (http://www.genewiz.com) And others Open Source: Lots (selected examples to be covered in this workshop) Slide 45 What do I need to run bioinformatics software locally? Some common bioinformatics software is platform independent, hence will run equally under Windows and UNIX (Linux, OSX) Most other software targets Unix systems. If you are running Microsoft Windows and want to run such software locally, the easiest way to do this(?) is to install some version of Linux (suggest Ubuntu) as a dual boot or (less intrusively) as a guest operating system in a virtual machine, e.g. http://www.vmware.com/products/player/ Slide 46 But, what are *we* going to use here? Slide 47 WestGrid @ SFU / IRMACS WestGrid is a consortium member of Computer Canada https://computecanada.org/ bugaboo cluster: 4328 cores total: 1280 cores, 8 cores/node, 16 GB/node, x86_64, IB. Plus 3048 cores, 12 cores/node, 24GB/node, x86_64, IB. capability cluster, 40 Core Years Access to other Westgrid resources through LAN and WAN More details from Brian Corrie tomorrow Slide 48 Galaxy Genomics Workbench http://galaxy.psu.edu/ (also http://main.g2.bx.psu.edu/) Slide 49 THE WORKSHOP ROADMAP NGS Bioinformatics Workshop 1.1 Workshop Overview and Practical Informatics Considerations Slide 50 What is Bioinformatics? Road Map Annotation Sequences (Formats) Visualization of Sequence & Annotation Search & Alignments NGS Sequence Databases Sequence Assembly Slide 51 Specific Applications Sequence Assembly of Transcriptomes Sequence Assembly of Whole Genomes Annotation of de novo Assembled Sequences Identification and Analysis of Sequence Variation Comparative Genomic Analysis and Visualization Meta-Analysis of Annotated Sequence Data Slide 52 Survey: Workshop Expectations I How to find significance in the huge amount of data that Next Gen sequencing, but also microarrays etc. generate. A basic understanding of how to analyse next generation sequencing data. Learn some hands-on computer experience learning to use software for analysing sequence data; what can be done and how to do it. genome assembly + meta-analysis Slide 53 Survey: Workshop Expectations II The basics of alignment and SNP calling with next- gen sequencing, and what kind of programs are out there to do these tasks and then analyze the large datasets (I've been trying to figure this out on my own through reading the literature and it's quite time consuming so any info provided through the workshop would be very helpful - thanks) The main workflow for processing sequence data from the beginning to the more specific paths of analyses. Also the concepts, significance of the adjustable parameters behind the various algorithms used in the workflow. Slide 54 Survey: Workshop Expectations III I expect to learn the basic bioinformatics tools. Learn different sequence alignment software/technologies (i.e. BWA, Abyss, etc.). Learn more about the complexities of NGS sequencing Next generation sequencing, data analysis etc. Parameters regulating assembly of contigs. How to take raw data to an assembly, control the main parameters for assembly, mass analyze data for annotation and SNPs How to compare expression profiles using RNA transcriptomes. Want to learn new things Slide 55 Survey: Operating System Being Used Microsoft Windows on Intel/AMD 14 (86.7%) Most running Windows 7 (some XP & Vista) One uses Linux through Westgrid and the IRMACS cluster Some of you also thinking of running Linux Apple OS X 2 (13.3%) Snow Leopard Release Apple Lion, running Windows 7 using Parallels Linux on Intel- 2 (13.3%) Slide 56 Looking Ahead What will you need for this workshop? Mainly, just a laptop running a web browser (Optional) access to Linux/Unix locally (VM Player) Reading list: Will give review citations for future lectures For next week, suggest that you surf to http://www.ncbi.nlm.nih.gov/