2013 pag-equine-workshop
Post on 10-May-2015
1.278 Views
Preview:
TRANSCRIPT
C. Titus BrownAsst Prof, CSE and
Microbiology;BEACON NSF STC
Michigan State Universityctb@msu.edu
Next-Gen Sequencing:4 years in the trenches
These slides are available online.
“titus brown slideshare”
You can also e-mail me: ctb@msu.edu
Also note that these are my opinions and observations, culled from personal experience,
online material, and reading. I’m happy to cite/explain further upon request, but:
Your Mileage May Vary
Things I won’t talk aboutDon’t work on/with/have anything useful to
say about:Exome sequencingAncient DNAChIP-seq (protein-DNA interactions)
Work on but you’re probably not interested in:Metagenomics (sequencing uncultured
microbial communities)Bioinformatics data structures and algorithms
OverviewShotgun sequencing basics
Things everyone wants to know: how much $$...
Various current problems & challenges
Technology, now and future
Some papers and projects worth looking at; & our own experiences
Two specific concepts:First, sequencing everything at random is very
much easier than sequencing a specific gene region. (For example, it will soon be easier and cheaper to shotgun-sequence all of E. coli then it is to get a single good plasmid sequence.)
Second, if you are sequencing on a 2-D substrate (wells, or surfaces, or whatnot) then any increase in density (smaller wells, or better imaging) leads to a squared increase in the number of sequences.
These two concepts underlie the recent stunning increases
in sequencing capacity.
What are current costs for Illumina?Approximate costs from MSU sequencing
center, a few months ago, including labor:
RNAseq:$200 prep / sampleSingle-ended 1x50 -- $1100/lane – 100-150 mn
readsPaired-end 2x100 -- $2500/lane – 200-300 mn
reads (/ 2)
Barcoding samples, etc, gets complicated.Discuss biology, etc with a sequencing geek
before going forward!
What does this data really give you??With RNAseq, you can do de novo (genome- and gene-
annotation-independent) gene & isoform discovery and quantification; 50-100m reads/sample is probably “enough”(see: http://blog.fejes.ca/?p=607 for a good discussion)
With genome resequencing, you can do variant analysis/discovery; I recommend 20x depth.
De novo assembly of complex vertebrate genomes is not casual:Cheap short-read sequencing does not yet deliver good long-range contiguity; repeats, heterozygosity get in the way.Assembly & scaffolding process itself is still evolving.
Why so much data?Why do we need 10-20x coverage
(resequencing) or 50-100m reads (mRNAseq) with Illumina?
Two (linked) reasons:Shotgun sequencing is randomCounting/sampling variation
1. Useful minimum coverage depends on high average coverage
2. mRNAseq quantitation – must overcome sampling variation
Coverage conclusionsMore coverage rarely hurts (you can always
discard data, but it is harder/more $$ to get more data from an old sample)
Your desired coverage numbers should be driven by sensitivity considerations.
Problems and challengesSystematic bias in sequencing and
software.
Genome assembly: scaffolding and sensitivity
Gene references
mRNAseq isoform construction
Resequencing: bias and errorCalling SNPs by mapping --
U. Coloradohttp://genomics-course.jasondk.org/?p=395
Both sequencing and bioinformatics yield many low-frequency artifacts!“Obvious” things like misalignments to
paralogous/repeat sequences.Indels are handled badly by current tools (up to
60% false positive rate?!)Oxidation of DNA during library prep step
(acoustic shearing) generated 8-oxoguanine “lesions” responsible for artifacts involving C>A/G>T triplets.
=> With any data set, especially big ones, there will both random and systematic error and bias.
http://pathogenomics.bham.ac.uk/blog/2013/01/sequencing-data-i-want-the-truth-you-cant-handle-the-truth/
Suggestion: Cortex variant caller
Iqbal et al., Nat Genet. 2012, pmid 22231483
Genome assembly: scaffolding & sensitivityEveryone wants two things from a genome assembly --
Long/correct scaffolds
See http://www.slideshare.net/flxlex/a-different-kettle-of-fish-entirely-bioinformatic-challenges-and-solutions-for-whole-de-novo-genome-assembly-of-atlantic-cod-and-atlantic-salmon
Complete genome content
Sequence dataReads
http://www.cbcb.umd.edu/research/assembly_primer.shtmlslides from http://slideshare.net/flxlex/ ; Lex Nederbragt
original DNA
fragments
original DNA
fragments
Sequenced ends
ContigsBuilding contigs
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
ScaffoldsOrdered, oriented contigs
contigs
mate pairs
gap size estimate
http://dx.doi.org/10.6084/m9.figshare.100940
Scaffoldcontig
gap
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
Longer reads!
Repeat copy 1
Repeat copy 2
Long reads can span repeats
Polymorphic contig 2Polymorphic contig 2
Polymorphic contig 3Polymorphic contig 3
Contig 4Contig 1
and heterozygous regions
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
Cod: PacBio resultsMapping to the published genome
11.4 kbp subread
10.6 kbp subread
10.9 kbp subread
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
Sensitivity – does your genome include everything?Generally not!
For example, the chick genome is missing a substantial number of genes from microchromosomes:723 genes from HSA19q missing from
chicken galGal4.ESTs and RNAseq transcripts for many or
most.
Approach - Digital normalization(a computational version of library normalization)
Digital normalization “smooths out” coverage from
different loci, and can “recover” low
coverage regions for assembly.
Applying diginorm to increase sensitivityReassembled chick genome from 70x Illumina -
> normalized reads in ~24 hours.Contig assembly contained partial or complete
matches to 70% of previously unmappable transcripts assembled from chick mRNAseq
Together with Wes Warren (WUSTL), Hans Cheng (USDA ADOL), Jerry Dodgson (MSU) proposing to apply PacBio and normalization to improve chick genome; should be generalizable approach.
Mapping => mRNAseq quantitation
Reference transcriptome required.
Existing chick gene models lack exons, isoforms
*This gene contains at least 4 isoforms.
Our data
Models
Likit Preeyanon
(Exon detection is pretty good.)
Likit Preeyanon
Gene Modeler Pipeline (“gimme”?)Merge transcripts together based on
transcript mapping to genome; can include existing gene predictions, iterate.
Construct gene modelsRemove redundant sequencesPredict strands and ORFs
Likit Preeyanon
Some thoughts on bioinfoSoftware is evolving very fast. Don’t worry
about using the latest, but keep an eye on possible artifacts/problems with what you do use.
In NGS, online information (seqanswers, biostar, Twitter) is generally far less behind than publications.
Technology – where next?Most slides taken from Lex Nederbragt:
http://www.slideshare.net/flxlex/updated-new-high-throughput-sequencing-technologies-at-the-norwegian-sequencing-centre-and-beyond
High-throughput sequencingPhase 1: more is better2005 GS20 200 000 reads100 bp
0.02 Gb/run
2011 GS FLX+1.2 million reads750 bp0.7 Gb/run
2006 GA 28 million reads25 bp0.7 Gb/run
2011 HiSeq 2000 3 billion reads 2x100 bp600 Gb/run
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
High-throughput sequencingPhase 2: smaller is better
GS Junior from Roche/454
MiSeq from Illumina
PGM from Ion Torrent/Life Technologies
0.04 GB/run400 bp reads
0.7 GB/run700 bp reads
4.5 GB/run2x150 bp
reads600 GB/run2x100 bp reads
0.01, 0.1 or 1 GB/run
100 or 200 bp reads
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
High-throughput sequencingWhy benchtop sequencing instruments?
Affordable price per
instrumentSmall projects
Diagnostics
Fast turn around time
http://pennystockalerts.com/ http://www.highqualitylinkbuildingservice.com/http://www.vetlearn.com/ http://vanillajava.blogspot.com
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
Which instrument to choose?
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
High-throughput sequencingPhase 3: single-molecule
C2 (current) chemistry:Average read length 2500 bp36 000 reads90 MB per ‘run’
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
High-throughput sequencingTechnology
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
Need to combine Illumina + PacBio still.
+
+
2.7x
23x
24 cpus4.5 days 100 Gb RAM
Alignments of at least 1kb to cod published assembly
Raw reads
Err
or-
corr
ect
ed r
eads
P_errorCorrection pipeline from
93% of reads recovered
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
My perspective on tech:Illumina HiSeq + benchtop sequencers
(MiSeq) currently most reliable for data generation: data in hand, decent quality.
PacBio data is an excellent add-on for situations where long reads are needed (to bridge repeats or het regions).
Two final pieces of adviceShould you work with genome centers? Maybe.
Genome centers are good at large, well funded projects.
Their default pipelines are reliable but not always cutting edge.
“Weird” problems (high heterozygosity, or complex repeats) may require more attention than they can give.
They also have their own schedules and incentives.
Where should you go for contract sequencing?I get asked this a lot!My best recommendation is UC Davis.“Cheaper” is not always “better”; data quality can
vary immensely.
June 10-June 20, Kellogg Biological Station; < $500
Hands on exposure to data, analysis tools.
Advertisement: next-gen sequence course
http://bioinformatics.msu.edu/ngs-summer-course-2013
AcknowledgementsI showed work from Likit Preeyanon and
Alexis Black Pyrkosz, in my labHans Cheng is primary collaborator on
chick work
USDA funded our technology development.
Lex Nederbragt for his slides :)
top related