sequencing and bioinformatics pgrp summer 2015
TRANSCRIPT
Surya Saha
Sol Genomics Network (SGN)
Boyce Thompson Institute, Ithaca, [email protected] // @SahaSurya
BTI PGRP Intership Program 2015
http://www.acgt.me/blog/2015/3/7/next-generation-sequencing-must-die
Hello Experiment!• Experimental design for survey
Sample sizeLocationsPhenotypes
6/11/2015 BTI PGRP Summer Internship Program 2015 2
Early Blight infected tomato plantshttp://www.longislandhort.cornell.edu/vegpath/photos/early_blight.htm
Hello Experiment!• Experimental design for survey
Sample sizeLocationsPhenotypes
• Experimental design to identify genetic differencesPCR-based
• Simple Sequence Repeats• Other markers
Sequencing-based• Genes of interest• Single Nucleotide Polymorphisms• Gene expression• Genotyping by Sequencing
6/11/2015 BTI PGRP Summer Internship Program 2015 3
Early Blight infected tomato plantshttp://www.longislandhort.cornell.edu/vegpath/photos/early_blight.htm
Why Sequencing?
• Targeted interrogation of genome
• Economical
• Technological developments
• High-throughput assays
• But requires subsequent validation
6/11/2015 BTI PGRP Summer Internship Program 2015 4
Why Sequencing?
• Targeted interrogation of genome
• Economical
• Technological developments
• High-throughput assays
• But requires subsequent validation
6/11/2015 BTI PGRP Summer Internship Program 2015 5
19
53
DNA Structure discovery
19
77
20
12
Sanger DNA sequencing by chain-terminating inhibitors
19
84
Epstein-Barr virus
(170 Kb)
19
87
Abi370 Sequencer
19
95
20
01
Homo sapiens (3.0 Gb)
20
05
454
Solexa
Solid
20
07
20
11
Ion Torrent
PacBio
Haemophilusinfluenzae(1.83 Mb)
20
13
Slide design credit: Aureliano Bombarely
Sequencing: Then and Now
Illumina
IlluminaHiseq X
454
6/11/2015 BTI PGRP Summer Internship Program 2015 6
Pinustaeda
(24 Gb)
20
14
NanoporeMinION
First generation sequencing
6/11/2015 BTI PGRP Summer Internship Program 2015 7
Sanger. Annu Rev Biochem. 1988;57:1-28.
Thanks to Nick Loman for the mention
Maxam-Gilbert method (1973)
6/11/2015 BTI PGRP Summer Internship Program 2015 9
http://en.wikipedia.org/wiki/File:Maxam-Gilbert_sequencing_en.svg
https://www.nationaldiagnostics.com/electrophoresis/article/maxam-gilbert-sequencing
Sanger method (1977)
6/11/2015 BTI PGRP Summer Internship Program 2015 10
Frederick Sanger13 Aug 1918 – 19 Nov 2013
Won the Nobel Prize for Chemistry in 1958 and 1980. Published the dideoxy chain termination method or “Sanger method” in 1977
http://dailym.ai/1f1XeTB
Sanger method (1977)
6/11/2015 BTI PGRP Summer Internship Program 2015 11
http://en.wikipedia.org/wiki/File:Sanger-sequencing.svg
http://en.wikipedia.org/wiki/File:Radioactive_Fluorescent_Seq.jpg
First generation sequencing
• Very high quality sequences (99.999% or Q50)
• Very low throughput
6/11/2015 BTI PGRP Summer Internship Program 2015 12
Run Time Read Length Reads / Run
Total
nucleotides
sequenced
Cost / MB
Capillary
Sequencing
(ABI3730xl)
20m-3h 400-900 bp 96 or 384 1.9-84 Kb $2400
http://www.hindawi.com/journals/bmri/2012/251364/tab1/
6/11/2015 BTI PGRP Summer Internship Program 2015 14
https://twitter.com/kbradnam/status/443153578429923328
• Second generation• Third generation• Fourth generation• Next-next-generation• Next-next-next
generationhttp://www.acgt.me/blog/2015/3/10/next-generation-sequencing-must-diepart-2
Mention the specific technology used to generate the data
– Illumina Hiseq/Miseq/NextSeq
– Pacific Biosciences RS1/RSII
– Ion Torrent Proton/PGM
– SOLiD
– Oxford Nanopore Minion
6/11/2015 BTI PGRP Summer Internship Program 2015 15
http://www.acgt.me/blog/2015/3/10/next-generation-sequencing-must-diepart-2
454 Pyrosequencing
One purified DNA fragment, to one bead, to one read.
6/11/2015 BTI PGRP Summer Internship Program 2015 16
http://www.genengnews.com/
GS FLX Titanium
https://mariamuir.com/wp-content/uploads/2013/04/rip.gif
Illumina
6/11/2015 BTI PGRP Summer Internship Program 2015 17
Output 0.3-15 Gb 20-120 GB 10-1500 GB 900-1800 GB
Number of Reads/ Flow cell
25 Million 130-400 Million 300 million – 2.5 Billion 3 Billion
Read Length
2x300 bp 2x150 bp 2x250 - 2x125 bp 2x150 bp
Cost $99K $250K $740K $10M (10 units)
Source: Illumina
250030004000
500
Illumina
6/11/2015 BTI Plant Bioinformatics Course 2015 18
Output 0.3-15 Gb 20-120 GB 10-1500 GB 900-1800 GB
Number of Reads/ Flow cell
25 Million 130-400 Million 300 million – 2.5 Billion 3 Billion
Read Length
2x300 bp 2x150 bp 2x250 - 2x125 bp 2x150 bp
Cost $99K $250K $740K $10M (10 units)
Source: Illumina
250030004000
$1000 human genome??
500
Illu
min
a
6/11/2015 BTI PGRP Summer Internship Program 2015 19
Mardis 2008. Annu. Rev. Genomics Hum. Genet. 2008. 9:387–402
Illu
min
a
6/11/2015 BTI PGRP Summer Internship Program 2015 20
Mardis 2008. Annu. Rev. Genomics Hum. Genet. 2008. 9:387–402
Illu
min
a: T
ruSe
qLo
ng
Rea
d
6/11/2015 BTI PGRP Summer Internship Program 2015 21
Voskoboynik eLife 2013;2:e00569
Pacific Biosciences SMRT sequencing
Single Molecule Real Time sequencing
6/11/2015 BTI PGRP Summer Internship Program 2015 22
http://smrt.med.cornell.edu/images/pacbio_library_prep-1.gif
Pacific Biosciences SMRT sequencingError correction methods
6/11/2015 BTI PGRP Summer Internship Program 2015 23
Hierarchical genome-assembly process (HGAP)
English et al., PLOS One. 2012
PBJelly
Pacific Biosciences SMRT sequencingError correction methods
6/11/2015 BTI PGRP Summer Internship Program 2015 24
PB
cRP
ipel
ine
6/11/2015 BTI PGRP Summer Internship Program 2015 25
Pacific Biosciences SMRT sequencingRead Lengths
http://www.igs.umaryland.edu/labs/grc/
Mean Read Length: 8391 bpMaximum Subread Length: 24585 bp
6/11/2015 BTI PGRP Summer Internship Program 2015 26
Pacific Biosciences SMRT sequencingRead Lengths
Oxford Nanopore
6/11/2015 BTI PGRP Summer Internship Program 2015 28
https://www.nanoporetech.com/
http://erlichya.tumblr.com/post/66376172948/hands-on-experience-with-oxford-nanopore-minion
http://halegrafx.com/vector-art/free-vector-despicable-me-minions/
Oxford Nanopore
6/11/2015 BTI PGRP Summer Internship Program 2015 30
https://theconversation.com/how-a-small-backpack-for-fast-genomic-sequencing-is-helping-combat-ebola-41863
Sequencing Trends
6/11/2015 BTI PGRP Summer Internship Program 2015 32
https://www.google.com/trends/
6/11/2015 BTI PGRP Summer Internship Program 2015 33
0
5000
10000
15000
20000
25000
30000
2008 2009 2010 2011 2012 2013 2014
Number of Publications
Illumina Pacific Biosciences Roche 454 Ion Torrent
-2000
-1000
0
1000
2000
3000
4000
5000
6000
2009 2010 2011 2012 2013 2014
Increase in Number of Publications
Illumina Pacific Biosciences Roche 454 Ion Torrent
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
2009 2010 2011 2012 2013 2014
% Increase in Number of Publications
Pacific Biosciences Roche 454 Ion Torrent
Others
• Ion Torrent Proton/PGM
• SOLiD
• Helicos
• Supporting technologies– BioNano
– Nabsys
– OpGen
– 10X Genomics
– Fluidigm
6/11/2015 BTI PGRP Summer Internship Program 2015 35
Next generation sequencing
6/11/2015 BTI PGRP Summer Internship Program 2015 37
Run Time Read Length Quality
Total
nucleotides sequenced
Cost /MB
454
Pyrosequencing24h 700 bp Q20-Q30 1 GB $10
Illumina Miseq 27h 2x300bp > Q30 15 GB $0.15
Illumina Hiseq
25001 - 10days 2x250bp >Q30 3000 GB $0.05
Ion torrent 2h 400bp >Q20 50MB-1GB $1
Pacific
Biosciences30m - 4h 10kb - >40kb
>Q50 consensus
>Q10 single
500 - 1000MB
/SMRT cell$0.13 - $0.60
http://www.hindawi.com/journals/bmri/2012/251364/http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431227
http://omicsmaps.com/
Next Generation Genomics: World Map of High-throughput Sequencers
BTI PGRP Summer Internship Program 20156/11/2015 38
6/11/2015 BTI PGRP Summer Internship Program 2015 39
https://flxlexblog.wordpress.com/2014/06/11/developments-in-next-generation-sequencing-june-2014-edition/
6/11/2015 BTI PGRP Summer Internship Program 2015 40
https://flxlexblog.wordpress.com/2014/06/11/developments-in-next-generation-sequencing-june-2014-edition/
Real cost of Sequencing!!
Sboner, Genome Biology, 2011
6/11/2015 41BTI PGRP Summer Internship Program 2015
Library Types
Single end
Pair end (PE, 150-800 bp, Fwd:/1, Rev:/2)
Mate pair (MP, 2Kb to 20 Kb)
6/11/2015 43
F
F R
F R 454/Roche
FR Illumina
Illumina
Slide credit: Aureliano BombarelyBTI PGRP Summer Internship Program 2015
Implications of Choice of Library
6/11/2015 44Slide credit: Aureliano Bombarely
Consensus sequence
(Contig)
Reads
Scaffold
(or Supercontig)
Pair Read information
NNNNN
Pseudomolecule
(or ultracontig)
F
Genetic information (markers) or Optical maps
NNNNN NN
BTI PGRP Summer Internship Program 2015
Multiplexing Libraries
Use of different tags (4-6 nucleotides) to identify different samples in the same lane/sector.
6/11/2015 45Slide credit: Aureliano Bombarely
AGTCGT
TGAGCA
AGTCGTAGTCGT
AGTCGTAGTCGT
TGAGCATGAGCA
TGAGCATGAGCA
AGTCGT
AGTCGT
AGTCGT
AGTCGT
TGAGCATGAGCA
TGAGCA
TGAGCA
Sequencing
BTI PGRP Summer Internship Program 2015
Fasta files:
It is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes.
-Wikipedia
File Formats
6/11/2015 46Slide credit: Aureliano Bombarely
BTI PGRP Summer Internship Program 2015
Fastq files:
FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.
-Wikipedia
• Single line ID with at symbol (“@”) in the first column.
• Sequences can be in multiple lines after the ID line
• Single line with plus symbol (“+”) in the first column to represent the quality line.
• Quality ID line may contain ID
• Quality values are in multiple lines after the + line but length is identical to sequence
6/11/2015 47Slide credit: Aureliano Bombarely
File Formats
BTI PGRP Summer Internship Program 2015
6/11/2015 48
Quality control: EncodingFastq files:
!"#$%&'()*+,-./0123456789 Offset by 33 (Phred+33)
KLMNOPQRSTUVWXYZ[\]^_`abcdefgh Offset by 64 (Phred+64)
BTI PGRP Summer Internship Program 2015
Quality control: Encoding
6/11/2015 49
!"#$%&'()*+,-./0123456789 Offset by 33 (Phred+33)
KLMNOPQRSTUVWXYZ[\]^_`abcdefgh Offset by 64 (Phred+64)
BTI PGRP Summer Internship Program 2015
6/11/2015 50
Quality control: Encoding
http://en.wikipedia.org/wiki/Phred_quality_score
Phred score of a base is:Qphred = -10 log10 (e)
where e is the estimated probability of a base being wrong
BTI PGRP Summer Internship Program 2015
Pre-processing: Tools
Trimming
• FastQC
• FASTX toolkit
• Trimmomatic
• Scythe
Joining paired-end reads
• fastq-join
• FLASH
• PANDAseq
6/11/2015 51BTI PGRP Summer Internship Program 2015
Sequencing done! Now What??
• 1 Hiseq run can produce up to 1500GB or 1.5TB of data
• How much is 250GB of data?
– 250,000,000,000 characters
– 3000 characters per sheet
– 100 sheets / cm
– Stack of ~8000m
6/11/2015 BTI PGRP Summer Internship Program 2015 53
Mount Everest - 8848m
Increase in Sequencing Data
L. Stein, Genome Biology, 2010
6/11/2015 54Slide credit: Lukas Mueller
BTI PGRP Summer Internship Program 2015
High Performance Computing
Powerful servers with large amounts of memory, compute cores, and disk
6/11/2015 56BTI PGRP Summer Internship Program 2015
What is bioinformatics?
Bioinformatics /baɪ.oʊˌɪnfərˈmætɪks/ is the application of computer science and information technology to the field of biology and medicine.
6/11/2015 57Slide credit: Lukas Mueller
BTI PGRP Summer Internship Program 2015
Bioinformatics deals with
Algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software engineering, data mining, image processing, modeling and simulation, signal processing, discrete mathematics, control and system theory, circuit theory, and statistics.
Generation of new knowledge in biology and medicine, and improving & discovering new models of computation (e.g. DNA computing, neural computing, evolutionary computing, immuno-computing, swarm-computing, cellular-computing).
6/11/2015 58Slide credit: Lukas Mueller
BTI PGRP Summer Internship Program 2015
Bioinformatics can...
Identify similar sequences Provide a putative function for a sequence Assemble sequences (genomes, transcriptomes) Annotate genomes Identify differentially expressed genes Build networks of genes or metabolites Determine phylogenetic relationships Mine literature for biological information Uncover differences between two genomes Calculate how a protein folds
6/11/2015 59Slide credit: Lukas Mueller
BTI PGRP Summer Internship Program 2015
What can bioinformatics do for me?
Majority of projects involve large datasets
Speed up your research
Enable you to ask new questions
Basic knowledge of bioinformatics needed
Extract information
Transform information
Run analyses
Build hypotheses, etc.
6/11/2015 60Slide credit: Lukas Mueller
BTI PGRP Summer Internship Program 2015
Linux UNIX-based, free and open source
operating system Very stable, easy to use Created by Linus Torvalds in 1990s
as a student Adopted for most bioinformatics
work Also: installed on cell phones,
laptops, desktops, clusters, supercomputers
Can run on your computer! Virtualized or native
http://www.linux-netbook.com/linux/distributions/
6/11/2015 62Slide credit: Lukas Mueller
BTI PGRP Summer Internship Program 2015
Linux
UNIX-based, free and open source operating system
Very stable, easy to use
Created by Linus Torvalds in 1990s as a student
Adopted for most bioinformatics work Also: installed on cell phones, laptops, desktops,
clusters, supercomputers
Can run on your computer! Virtualized or native
6/11/2015 63BTI PGRP Summer Internship Program 2015
Further Reading
Plant Bioinformatics Course• Virtual machine setup instructions• Slides for Linux, Sequencing, RNAseq, NGS Read
Mapping and R graphics• http://btiplantbioinfocourse.wordpress.com
• 6/11/2015 64Slide credit: Lukas Mueller
BTI PGRP Summer Internship Program 2015
Scripting
Scripts: Small programs written by the end-user that control the execution of other programs or perform a simple algorithm
Extremely flexible
Written in Shell, Perl, Python
You can write them yourself!!!
6/11/2015 65Slide credit: Lukas Mueller
BTI PGRP Summer Internship Program 2015
Perl
Developed since 1980s by Larry Wall
Useful for bioinformatics and web development
Support for objects
Excellent integration of regular expressions (text handling language)
Vast open source code library (http:/cpan.org/) BioPerl (http://bioperl.org/)
Easy to learn
http://www.perl.org/
6/11/2015 66Slide credit: Lukas Mueller
BTI PGRP Summer Internship Program 2015
Python
Created by Guido van Rossum in 1989
Very elegant language
BioPython libraries
The “new” popular language
Many frameworks (Django for web etc.)
6/11/2015 67Slide credit: Lukas Mueller
BTI PGRP Summer Internship Program 2015
Language designed for statistics
Support for matrix calculations, graphics
Expression analysis, Next-Gen sequence analysis, Graphics, genome annotation statistics, phylogeny
Interactive
6/11/2015 68Slide credit: Lukas Mueller
BTI PGRP Summer Internship Program 2015
Databases
Need to store and query data
Biological data is highly structured
Relational database systems
Non-relational systems
6/11/2015 69Slide credit: Lukas Mueller
BTI PGRP Summer Internship Program 2015