bioinformatics 2 - lecture 5 · gabriele schweikert bioinformatics 2 - lecture 5 6. rna-seq:...
TRANSCRIPT
Bioinformatics 2 - Lecture 5
Gabriele Schweikert
University of Edinburgh
March 7, 2013
Gabriele Schweikert Bioinformatics 2 - Lecture 5 1
RNA-Seq
Credits: Darryl Leja (NHGRI), Ian Dunham (EBI)
Gabriele Schweikert Bioinformatics 2 - Lecture 5 2
RNA vs DNA
Pedro Romero et al., PNAS 2006
Gabriele Schweikert Bioinformatics 2 - Lecture 5 3
RNA-Seq: Transcript identification
Which parts of the DNA sequence are transcribed into RNA?1 protein-coding genes are transcribed into mRNAs.
in humans around 20,000 protein-coding genesaccount for around 1.5% of the genome sequenceheterogeneous in length, (100s-10000s bp)
coding regions are flanked by untranslated regions (UTRs)coding exons, usually interspersed by non-coding introns
Gabriele Schweikert Bioinformatics 2 - Lecture 5 4
RNA-Seq: Transcript identification
Which parts of the DNA sequence are transcribed into RNA?1 protein-coding genes are transcribed into mRNAs.
in humans around 20,000 protein-coding genesaccount for around 1.5% of the genome sequenceheterogeneous in length, (100s-10000s bp)coding regions are flanked by untranslated regions (UTRs)coding exons, usually interspersed by non-coding introns
Gabriele Schweikert Bioinformatics 2 - Lecture 5 4
Alternative isoforms increase functional diversity
many genes are associated with several different isoforms:
alternative transcription starts sites (TSS)
alternative splicing
alternative cleavage sites
Pedro Romero et al., PNAS 2006
Gabriele Schweikert Bioinformatics 2 - Lecture 5 5
RNA-Seq: Transcript identification
Which parts of the DNA sequence are transcribed into RNA?1 protein-coding genes2 RNA genes
transcribed into non-coding RNAs (ncRNA)functional RNA molecules that are not translated into proteins
include highly abundant and functionally important RNAs:
1 ribosomal RNA (rRNA): major structural components of theribosome (translation machinery)
2 transfer RNA (tRNA): act as adaptors between amino acids andmRNA during protein synthesis
other important types include:
1 miRNAs: regulate gene expression2 small nuclear RNAs (snRNAs): functions in RNA splicing,
regulating TFs or RNA pol II3 long noncoding RNAs (ncRNAs)4 many others
Gabriele Schweikert Bioinformatics 2 - Lecture 5 6
RNA-Seq: Transcript identification
Which parts of the DNA sequence are transcribed into RNA?1 protein-coding genes2 RNA genes
transcribed into non-coding RNAs (ncRNA)functional RNA molecules that are not translated into proteinsinclude highly abundant and functionally important RNAs:
1 ribosomal RNA (rRNA): major structural components of theribosome (translation machinery)
2 transfer RNA (tRNA): act as adaptors between amino acids andmRNA during protein synthesis
other important types include:
1 miRNAs: regulate gene expression2 small nuclear RNAs (snRNAs): functions in RNA splicing,
regulating TFs or RNA pol II3 long noncoding RNAs (ncRNAs)4 many others
Gabriele Schweikert Bioinformatics 2 - Lecture 5 6
RNA-Seq: Transcript identification
Which parts of the DNA sequence are transcribed into RNA?1 protein-coding genes2 RNA genes
transcribed into non-coding RNAs (ncRNA)functional RNA molecules that are not translated into proteinsinclude highly abundant and functionally important RNAs:
1 ribosomal RNA (rRNA): major structural components of theribosome (translation machinery)
2 transfer RNA (tRNA): act as adaptors between amino acids andmRNA during protein synthesis
other important types include:
1 miRNAs: regulate gene expression2 small nuclear RNAs (snRNAs): functions in RNA splicing,
regulating TFs or RNA pol II3 long noncoding RNAs (ncRNAs)4 many others
Gabriele Schweikert Bioinformatics 2 - Lecture 5 6
RNA-Seq: Transcript identification
Which parts of the DNA sequence are transcribed into RNA?
1 protein-coding genes
2 RNA genes3 pseudogenes
dysfunctional copies of genesno longer expressed in the cellaccumulate multiple mutations
Gabriele Schweikert Bioinformatics 2 - Lecture 5 7
RNA-Seq: Transcript identification
Which parts of the DNA sequence are transcribed into RNA?
1 protein-coding genes
2 RNA genes3 pseudogenes
dysfunctional copies of genesno longer expressed in the cellaccumulate multiple mutations
Gabriele Schweikert Bioinformatics 2 - Lecture 5 7
RNA: summary
The typical eukaryotic cell contains ≈ 10−5µg of RNA
80-85% rRNA
15-20% low MW species including tRNAs, snRNAs, miRNAs
1-5% of total cellular RNA is messenger RNA (mRNA)
→ filter for RNAs that you are interested in !→ remove RNS that you are not interested in !
Gabriele Schweikert Bioinformatics 2 - Lecture 5 8
RNA: summary
The typical eukaryotic cell contains ≈ 10−5µg of RNA
80-85% rRNA
15-20% low MW species including tRNAs, snRNAs, miRNAs
1-5% of total cellular RNA is messenger RNA (mRNA)
→ filter for RNAs that you are interested in !→ remove RNS that you are not interested in !
Gabriele Schweikert Bioinformatics 2 - Lecture 5 8
RNA: summary
The typical eukaryotic cell contains ≈ 10−5µg of RNA
80-85% rRNA
15-20% low MW species including tRNAs, snRNAs, miRNAs
1-5% of total cellular RNA is messenger RNA (mRNA)
→ filter for RNAs that you are interested in !→ remove RNS that you are not interested in !
Gabriele Schweikert Bioinformatics 2 - Lecture 5 8
RNA-Seq: Beyond transcript identification
Transcript Quantification:RNA-Seq allows to quantitatively measure the cellular RNAcomplement
Jonas Behr: Transcript Inference and Quantification Using MIP. NIPS-MLCB, December, 2011
Gabriele Schweikert Bioinformatics 2 - Lecture 5 9
RNA-Seq: Beyond transcript identification
Transcript Quantification:RNA-Seq allows to quantitatively measure the cellular RNAcomplement
Jonas Behr: Transcript Inference and Quantification Using MIP. NIPS-MLCB, December, 2011
Gabriele Schweikert Bioinformatics 2 - Lecture 5 9
Library Preparation
Martin and Wang, Nature Reviews Genetics, 2011
Gabriele Schweikert Bioinformatics 2 - Lecture 5 10
Library Preparation
Martin and Wang, Nature Reviews Genetics, 2011
Gabriele Schweikert Bioinformatics 2 - Lecture 5 10
Library Preparation and Sequencing Considerations
Poly(A) selection vs depletion of rRNAs ?
Eliminate PCR amplification ?(to remove biases against GC content)
strand-specific RNA-Seq
read length?
single-end vs paired-end?
sequencing depth, multiplexing?
Gabriele Schweikert Bioinformatics 2 - Lecture 5 11
Data Analysis
Martin and Wang, Nature Reviews Genetics, 2011
Gabriele Schweikert Bioinformatics 2 - Lecture 5 12
Data Analysis
Martin and Wang, Nature Reviews Genetics, 2011
Gabriele Schweikert Bioinformatics 2 - Lecture 5 12
Computational Challenges
Sequencing depth of transcripts varies over several magnitudes
Transcription is strand specific
Transcript variants from the same gene can share exons;difficult to resolve unambiguously
Gabriele Schweikert Bioinformatics 2 - Lecture 5 13
Pre-processing
Removing artefacts
sequencing adaptors
low complexity sequences
duplicates which are derived from PCR amplification
rRNA sequence contamination
Sequencing errors (low quality scores)
Gabriele Schweikert Bioinformatics 2 - Lecture 5 14
Assembly Strategies
Reference-based
Reference-free
Hybrids
Gabriele Schweikert Bioinformatics 2 - Lecture 5 15
Reference-based assembly strategies
Martin and Wang, Nature Reviews Genetics, 2011
Gabriele Schweikert Bioinformatics 2 - Lecture 5 16
Reference-based assembly strategies
In the early days: known exons, or splice junctions needednow: ab-initio; mapping without annotation
Assembly in 3 steps:1 spliced alignment e.g. Cufflinks, Scripture, PALMapper, Oases,
Trinity
based on: seed-and-extend strategy, find exact match of asubstring than extend using Smith-Waterman alignmentBurrows-Wheeler Transform (BWT) :
2 building graph representation
3 traversing the graph to resolve individual isoforms
Gabriele Schweikert Bioinformatics 2 - Lecture 5 17
Reference-based assembly strategies
In the early days: known exons, or splice junctions needednow: ab-initio; mapping without annotation
Assembly in 3 steps:1 spliced alignment e.g. Cufflinks, Scripture, PALMapper, Oases,
Trinity
based on: seed-and-extend strategy, find exact match of asubstring than extend using Smith-Waterman alignmentBurrows-Wheeler Transform (BWT) :
2 building graph representation
3 traversing the graph to resolve individual isoforms
Gabriele Schweikert Bioinformatics 2 - Lecture 5 17
Reference-based assembly strategies
In the early days: known exons, or splice junctions needednow: ab-initio; mapping without annotation
Assembly in 3 steps:1 spliced alignment e.g. Cufflinks, Scripture, PALMapper, Oases,
Trinity
based on: seed-and-extend strategy, find exact match of asubstring than extend using Smith-Waterman alignmentBurrows-Wheeler Transform (BWT) :
2 building graph representation
3 traversing the graph to resolve individual isoforms
Gabriele Schweikert Bioinformatics 2 - Lecture 5 17
Reference-based assembly strategies
In the early days: known exons, or splice junctions needednow: ab-initio; mapping without annotation
Assembly in 3 steps:1 spliced alignment e.g. Cufflinks, Scripture, PALMapper, Oases,
Trinity
based on: seed-and-extend strategy, find exact match of asubstring than extend using Smith-Waterman alignmentBurrows-Wheeler Transform (BWT) :
2 building graph representation
3 traversing the graph to resolve individual isoforms
Gabriele Schweikert Bioinformatics 2 - Lecture 5 17
Spliced-alignment: Tophat
Gabriele Schweikert Bioinformatics 2 - Lecture 5 18
Spliced-alignment: Tophat
1 map all reads using Bowtie(fast short read mapping program, uses BWA indexed genome)
allow some (2) mismatches in the 5’ end of the readsup to 10 alignments are allowed per read
→ low complexity reads are discarded→ initially unmapped reads (IUMs) put aside
2 assemble mapped reads into covered regions: islands
extend islands on both sides (45bp)remove gaps: merge nearby islands of low expressed genes(introns typically larger than 70bp)
3 find all possible introns
extract sequence for islands using the reference genomefind all canonical donor and acceptor splice sites: GT, AGconsider all possible pairings to form introns: GT-AG (maxlength: 20,000bp)include ’single-island’ junctions in highly coverage islands
Gabriele Schweikert Bioinformatics 2 - Lecture 5 19
Spliced-alignment: Tophat
1 map all reads using Bowtie(fast short read mapping program, uses BWA indexed genome)
allow some (2) mismatches in the 5’ end of the readsup to 10 alignments are allowed per read
→ low complexity reads are discarded→ initially unmapped reads (IUMs) put aside
2 assemble mapped reads into covered regions: islands
extend islands on both sides (45bp)remove gaps: merge nearby islands of low expressed genes(introns typically larger than 70bp)
3 find all possible introns
extract sequence for islands using the reference genomefind all canonical donor and acceptor splice sites: GT, AGconsider all possible pairings to form introns: GT-AG (maxlength: 20,000bp)include ’single-island’ junctions in highly coverage islands
Gabriele Schweikert Bioinformatics 2 - Lecture 5 19
Spliced-alignment: Tophat
1 map all reads using Bowtie(fast short read mapping program, uses BWA indexed genome)
allow some (2) mismatches in the 5’ end of the readsup to 10 alignments are allowed per read
→ low complexity reads are discarded→ initially unmapped reads (IUMs) put aside
2 assemble mapped reads into covered regions: islands
extend islands on both sides (45bp)remove gaps: merge nearby islands of low expressed genes(introns typically larger than 70bp)
3 find all possible introns
extract sequence for islands using the reference genomefind all canonical donor and acceptor splice sites: GT, AGconsider all possible pairings to form introns: GT-AG (maxlength: 20,000bp)include ’single-island’ junctions in highly coverage islands
Gabriele Schweikert Bioinformatics 2 - Lecture 5 19
Spliced-alignment: Tophat
Gabriele Schweikert Bioinformatics 2 - Lecture 5 20
Tophat continued
4 check IUMs for reads spanning intron with seed-and-extendstrategy
create 2k-mer seeds from all junctions, with k = 5bp exonicsequence on either side of junctionindex IUM reads: table is keyed by 2k-merseach 2k-mers is associated with reads that contain this 2k-mersIUM read index queried with seedsreads containing the seed are checked for complete alignment(allowing some mismatches)
Gabriele Schweikert Bioinformatics 2 - Lecture 5 21
Tophat continued
5 heuristically filter by minimum minor isoform frequencyif frequency of splice junction < 15% of the depth of coverage ofexons→ discard junction
Gabriele Schweikert Bioinformatics 2 - Lecture 5 22
Alternative strategies
1 QPALMA: predicts splice sites from DNA sequence (trains onknown splice sites)
2 STAR
3 GSNAP
Gabriele Schweikert Bioinformatics 2 - Lecture 5 23
Single vs paired end reads
Gabriele Schweikert Bioinformatics 2 - Lecture 5 24
Single vs paired end reads
Gabriele Schweikert Bioinformatics 2 - Lecture 5 24
Paired end
75-150bp sequenced from both ends
Paired reads from long inserts (500-1000bp)
long-range exon connectivity
combining libraries with different inset size
Gabriele Schweikert Bioinformatics 2 - Lecture 5 25
Spliced-alignment: Tophat
Gabriele Schweikert Bioinformatics 2 - Lecture 5 26
Reference-based assembly strategies
Martin and Wang, Nature Reviews Genetics, 2011
Gabriele Schweikert Bioinformatics 2 - Lecture 5 27
Cufflinks
Trapnell et al., Nature Biotech, 2010
Gabriele Schweikert Bioinformatics 2 - Lecture 5 28
Cufflinks
Trapnell et al., Nature Biotech, 2010Gabriele Schweikert Bioinformatics 2 - Lecture 5 28
Transcriptome Assembly: Cufflinks
1 assemble overlapping bundles independently2 for each bundle, build a weighted bipartite graph
that represents compatibilities among fragments
identify pairs of incompatible fragmentsoriginating from different spliced isoformsfragments connected when compatible and overlappingeach fragment has one nodedirected edges placed from left to right between each pair ofcompatible fragments
Gabriele Schweikert Bioinformatics 2 - Lecture 5 29
Transcriptome Assembly: Cufflinks
1 assemble overlapping bundles independently2 for each bundle, build a weighted bipartite graph
that represents compatibilities among fragmentsidentify pairs of incompatible fragmentsoriginating from different spliced isoforms
fragments connected when compatible and overlappingeach fragment has one nodedirected edges placed from left to right between each pair ofcompatible fragments
Gabriele Schweikert Bioinformatics 2 - Lecture 5 29
Transcriptome Assembly: Cufflinks
1 assemble overlapping bundles independently2 for each bundle, build a weighted bipartite graph
that represents compatibilities among fragmentsidentify pairs of incompatible fragmentsoriginating from different spliced isoformsfragments connected when compatible and overlapping
each fragment has one nodedirected edges placed from left to right between each pair ofcompatible fragments
Gabriele Schweikert Bioinformatics 2 - Lecture 5 29
Transcriptome Assembly: Cufflinks
1 assemble overlapping bundles independently2 for each bundle, build a weighted bipartite graph
that represents compatibilities among fragmentsidentify pairs of incompatible fragmentsoriginating from different spliced isoformsfragments connected when compatible and overlappingeach fragment has one node
directed edges placed from left to right between each pair ofcompatible fragments
Gabriele Schweikert Bioinformatics 2 - Lecture 5 29
Transcriptome Assembly: Cufflinks
1 assemble overlapping bundles independently2 for each bundle, build a weighted bipartite graph
that represents compatibilities among fragmentsidentify pairs of incompatible fragmentsoriginating from different spliced isoformsfragments connected when compatible and overlappingeach fragment has one nodedirected edges placed from left to right between each pair ofcompatible fragments
Gabriele Schweikert Bioinformatics 2 - Lecture 5 29
Transcriptome Assembly: Cufflinks
1 assemble overlapping bundles independently2 build weighted bipartite graph3 isoforms assembled from the overlap graph
path though the graph corresponds to mutual compatiblefragmentsfind minimum path cover of the graphminimum set of transcripts that ’explain’ the intron junctionswithin reads
Gabriele Schweikert Bioinformatics 2 - Lecture 5 30
Transcriptome Assembly: Cufflinks
1 assemble overlapping bundles independently2 build weighted bipartite graph3 isoforms assembled from the overlap graph
path though the graph corresponds to mutual compatiblefragments
find minimum path cover of the graphminimum set of transcripts that ’explain’ the intron junctionswithin reads
Gabriele Schweikert Bioinformatics 2 - Lecture 5 30
Transcriptome Assembly: Cufflinks
1 assemble overlapping bundles independently2 build weighted bipartite graph3 isoforms assembled from the overlap graph
path though the graph corresponds to mutual compatiblefragmentsfind minimum path cover of the graphminimum set of transcripts that ’explain’ the intron junctionswithin reads
Gabriele Schweikert Bioinformatics 2 - Lecture 5 30
Cufflinks: Transcript Quantification
1 assign fragments to transcripts from which they could haveoriginated(potentially more than 1)
2 incorporate fragment lengths
3 statistical model: probability of observing each fragment is linearfunction of the abundance of the transcript
4 maximize function that assigns a likelihood to all possible sets ofrelative abundances
Gabriele Schweikert Bioinformatics 2 - Lecture 5 31
Cufflinks: Transcript Quantification
1 assign fragments to transcripts from which they could haveoriginated(potentially more than 1)
2 incorporate fragment lengths3 statistical model: probability of observing each fragment is linear
function of the abundance of the transcript4 maximize function that assigns a likelihood to all possible sets of
relative abundances
Gabriele Schweikert Bioinformatics 2 - Lecture 5 31
Cufflinks: Transcript Quantification
1 assign fragments to transcripts from which they could haveoriginated(potentially more than 1)
2 incorporate fragment lengths
3 statistical model: probability of observing each fragment is linearfunction of the abundance of the transcript
4 maximize function that assigns a likelihood to all possible sets ofrelative abundances
Gabriele Schweikert Bioinformatics 2 - Lecture 5 31
Cufflinks: Transcript Quantification
1 assign fragments to transcripts from which they could haveoriginated(potentially more than 1)
2 incorporate fragment lengths3 statistical model: probability of observing each fragment is linear
function of the abundance of the transcript4 maximize function that assigns a likelihood to all possible sets of
relative abundances
Gabriele Schweikert Bioinformatics 2 - Lecture 5 31
Cufflinks: Transcript Quantification
1 assign fragments to transcripts from which they could haveoriginated(potentially more than 1)
2 incorporate fragment lengths
3 statistical model: probability of observing each fragment is linearfunction of the abundance of the transcript
4 maximize function that assigns a likelihood to all possible sets ofrelative abundances
Gabriele Schweikert Bioinformatics 2 - Lecture 5 31
Reference-based assembly strategies
Advantages:1 many independent alignment problems: easy to parallelize
2 robust against contamination and sequencing artefactes3 very sensitive, even for transcripts with low abundance4 allows discoveries of novel transcripts5 requires relative moderate coverage: > 10X
Disadvantages:1 depends on high quality reference sequence
however: strawberry reference genome was used for assembly ofraspberry transcriptome
2 large introns can be missed3 what to do with reads that map to multiple places in the
genome?4 trans-spliced genes can’t be assembled (important in genetic
pathway of some types of cancer)
Gabriele Schweikert Bioinformatics 2 - Lecture 5 32
Reference-based assembly strategies
Advantages:1 many independent alignment problems: easy to parallelize2 robust against contamination and sequencing artefactes
3 very sensitive, even for transcripts with low abundance4 allows discoveries of novel transcripts5 requires relative moderate coverage: > 10X
Disadvantages:1 depends on high quality reference sequence
however: strawberry reference genome was used for assembly ofraspberry transcriptome
2 large introns can be missed3 what to do with reads that map to multiple places in the
genome?4 trans-spliced genes can’t be assembled (important in genetic
pathway of some types of cancer)
Gabriele Schweikert Bioinformatics 2 - Lecture 5 32
Reference-based assembly strategies
Advantages:1 many independent alignment problems: easy to parallelize2 robust against contamination and sequencing artefactes3 very sensitive, even for transcripts with low abundance
4 allows discoveries of novel transcripts5 requires relative moderate coverage: > 10X
Disadvantages:1 depends on high quality reference sequence
however: strawberry reference genome was used for assembly ofraspberry transcriptome
2 large introns can be missed3 what to do with reads that map to multiple places in the
genome?4 trans-spliced genes can’t be assembled (important in genetic
pathway of some types of cancer)
Gabriele Schweikert Bioinformatics 2 - Lecture 5 32
Reference-based assembly strategies
Advantages:1 many independent alignment problems: easy to parallelize2 robust against contamination and sequencing artefactes3 very sensitive, even for transcripts with low abundance4 allows discoveries of novel transcripts
5 requires relative moderate coverage: > 10X
Disadvantages:1 depends on high quality reference sequence
however: strawberry reference genome was used for assembly ofraspberry transcriptome
2 large introns can be missed3 what to do with reads that map to multiple places in the
genome?4 trans-spliced genes can’t be assembled (important in genetic
pathway of some types of cancer)
Gabriele Schweikert Bioinformatics 2 - Lecture 5 32
Reference-based assembly strategies
Advantages:1 many independent alignment problems: easy to parallelize2 robust against contamination and sequencing artefactes3 very sensitive, even for transcripts with low abundance4 allows discoveries of novel transcripts5 requires relative moderate coverage: > 10X
Disadvantages:1 depends on high quality reference sequence
however: strawberry reference genome was used for assembly ofraspberry transcriptome
2 large introns can be missed3 what to do with reads that map to multiple places in the
genome?4 trans-spliced genes can’t be assembled (important in genetic
pathway of some types of cancer)
Gabriele Schweikert Bioinformatics 2 - Lecture 5 32
Reference-based assembly strategies
Advantages:1 many independent alignment problems: easy to parallelize2 robust against contamination and sequencing artefactes3 very sensitive, even for transcripts with low abundance4 allows discoveries of novel transcripts5 requires relative moderate coverage: > 10X
Disadvantages:
1 depends on high quality reference sequencehowever: strawberry reference genome was used for assembly ofraspberry transcriptome
2 large introns can be missed3 what to do with reads that map to multiple places in the
genome?4 trans-spliced genes can’t be assembled (important in genetic
pathway of some types of cancer)
Gabriele Schweikert Bioinformatics 2 - Lecture 5 32
Reference-based assembly strategies
Advantages:1 many independent alignment problems: easy to parallelize2 robust against contamination and sequencing artefactes3 very sensitive, even for transcripts with low abundance4 allows discoveries of novel transcripts5 requires relative moderate coverage: > 10X
Disadvantages:1 depends on high quality reference sequence
however: strawberry reference genome was used for assembly ofraspberry transcriptome
2 large introns can be missed3 what to do with reads that map to multiple places in the
genome?4 trans-spliced genes can’t be assembled (important in genetic
pathway of some types of cancer)
Gabriele Schweikert Bioinformatics 2 - Lecture 5 32
Reference-based assembly strategies
Advantages:1 many independent alignment problems: easy to parallelize2 robust against contamination and sequencing artefactes3 very sensitive, even for transcripts with low abundance4 allows discoveries of novel transcripts5 requires relative moderate coverage: > 10X
Disadvantages:1 depends on high quality reference sequence
however: strawberry reference genome was used for assembly ofraspberry transcriptome
2 large introns can be missed
3 what to do with reads that map to multiple places in thegenome?
4 trans-spliced genes can’t be assembled (important in geneticpathway of some types of cancer)
Gabriele Schweikert Bioinformatics 2 - Lecture 5 32
Reference-based assembly strategies
Advantages:1 many independent alignment problems: easy to parallelize2 robust against contamination and sequencing artefactes3 very sensitive, even for transcripts with low abundance4 allows discoveries of novel transcripts5 requires relative moderate coverage: > 10X
Disadvantages:1 depends on high quality reference sequence
however: strawberry reference genome was used for assembly ofraspberry transcriptome
2 large introns can be missed3 what to do with reads that map to multiple places in the
genome?
4 trans-spliced genes can’t be assembled (important in geneticpathway of some types of cancer)
Gabriele Schweikert Bioinformatics 2 - Lecture 5 32
Reference-based assembly strategies
Advantages:1 many independent alignment problems: easy to parallelize2 robust against contamination and sequencing artefactes3 very sensitive, even for transcripts with low abundance4 allows discoveries of novel transcripts5 requires relative moderate coverage: > 10X
Disadvantages:1 depends on high quality reference sequence
however: strawberry reference genome was used for assembly ofraspberry transcriptome
2 large introns can be missed3 what to do with reads that map to multiple places in the
genome?4 trans-spliced genes can’t be assembled (important in genetic
pathway of some types of cancer)Gabriele Schweikert Bioinformatics 2 - Lecture 5 32
Reference-free assembly
1 Generate all substrings of length k from the reads
2 Generate the de Brujin graph(directed graph representing overlaps between sequences ofsymbols)
3 Collapse the de Brujin Graph4 Traverse the graph and assemble isoforms
Gabriele Schweikert Bioinformatics 2 - Lecture 5 33
Reference-free assembly
1 Generate all substrings of length k from the reads2 Generate the de Brujin graph
(directed graph representing overlaps between sequences ofsymbols)
3 Collapse the de Brujin Graph4 Traverse the graph and assemble isoforms
Gabriele Schweikert Bioinformatics 2 - Lecture 5 33
Reference-free assembly
1 Generate all substrings of length k from the reads
2 Generate the de Brujin graph(directed graph representing overlaps between sequences ofsymbols)
3 Collapse the de Brujin Graph
4 Traverse the graph and assemble isoforms
Gabriele Schweikert Bioinformatics 2 - Lecture 5 33
Reference-free assembly
1 Generate all substrings of length k from the reads2 Generate the de Brujin graph
(directed graph representing overlaps between sequences ofsymbols)
3 Collapse the de Brujin Graph4 Traverse the graph and assemble isoforms
Gabriele Schweikert Bioinformatics 2 - Lecture 5 33
Reference-free assembly strategies
Advantages:
1 independend of high quality reference sequence
2 can assemble transcripts that are absent from genome assembly
3 independent of correct alignment
4 no problems with long introns
5 trans-spliced transcripts are detectable
Disadvantages:
1 computationally demanding
2 requires high sequencing depth > 30X
3 sensitive to sequencing errors
4 contamination of chimeric reads
Gabriele Schweikert Bioinformatics 2 - Lecture 5 34
Reference-free assembly strategies
Advantages:
1 independend of high quality reference sequence
2 can assemble transcripts that are absent from genome assembly
3 independent of correct alignment
4 no problems with long introns
5 trans-spliced transcripts are detectable
Disadvantages:
1 computationally demanding
2 requires high sequencing depth > 30X
3 sensitive to sequencing errors
4 contamination of chimeric reads
Gabriele Schweikert Bioinformatics 2 - Lecture 5 34
Reference-free assembly strategies
Advantages:
1 independend of high quality reference sequence
2 can assemble transcripts that are absent from genome assembly
3 independent of correct alignment
4 no problems with long introns
5 trans-spliced transcripts are detectable
Disadvantages:
1 computationally demanding
2 requires high sequencing depth > 30X
3 sensitive to sequencing errors
4 contamination of chimeric reads
Gabriele Schweikert Bioinformatics 2 - Lecture 5 34
Reference-free assembly strategies
Advantages:
1 independend of high quality reference sequence
2 can assemble transcripts that are absent from genome assembly
3 independent of correct alignment
4 no problems with long introns
5 trans-spliced transcripts are detectable
Disadvantages:
1 computationally demanding
2 requires high sequencing depth > 30X
3 sensitive to sequencing errors
4 contamination of chimeric reads
Gabriele Schweikert Bioinformatics 2 - Lecture 5 34
Reference-free assembly strategies
Advantages:
1 independend of high quality reference sequence
2 can assemble transcripts that are absent from genome assembly
3 independent of correct alignment
4 no problems with long introns
5 trans-spliced transcripts are detectable
Disadvantages:
1 computationally demanding
2 requires high sequencing depth > 30X
3 sensitive to sequencing errors
4 contamination of chimeric reads
Gabriele Schweikert Bioinformatics 2 - Lecture 5 34
Reference-free assembly strategies
Advantages:
1 independend of high quality reference sequence
2 can assemble transcripts that are absent from genome assembly
3 independent of correct alignment
4 no problems with long introns
5 trans-spliced transcripts are detectable
Disadvantages:
1 computationally demanding
2 requires high sequencing depth > 30X
3 sensitive to sequencing errors
4 contamination of chimeric reads
Gabriele Schweikert Bioinformatics 2 - Lecture 5 34
Reference-free assembly strategies
Advantages:
1 independend of high quality reference sequence
2 can assemble transcripts that are absent from genome assembly
3 independent of correct alignment
4 no problems with long introns
5 trans-spliced transcripts are detectable
Disadvantages:
1 computationally demanding
2 requires high sequencing depth > 30X
3 sensitive to sequencing errors
4 contamination of chimeric reads
Gabriele Schweikert Bioinformatics 2 - Lecture 5 34
Reference-free assembly strategies
Advantages:
1 independend of high quality reference sequence
2 can assemble transcripts that are absent from genome assembly
3 independent of correct alignment
4 no problems with long introns
5 trans-spliced transcripts are detectable
Disadvantages:
1 computationally demanding
2 requires high sequencing depth > 30X
3 sensitive to sequencing errors
4 contamination of chimeric reads
Gabriele Schweikert Bioinformatics 2 - Lecture 5 34
Reference-free assembly strategies
Advantages:
1 independend of high quality reference sequence
2 can assemble transcripts that are absent from genome assembly
3 independent of correct alignment
4 no problems with long introns
5 trans-spliced transcripts are detectable
Disadvantages:
1 computationally demanding
2 requires high sequencing depth > 30X
3 sensitive to sequencing errors
4 contamination of chimeric reads
Gabriele Schweikert Bioinformatics 2 - Lecture 5 34
Reference-free assembly strategies
Advantages:
1 independend of high quality reference sequence
2 can assemble transcripts that are absent from genome assembly
3 independent of correct alignment
4 no problems with long introns
5 trans-spliced transcripts are detectable
Disadvantages:
1 computationally demanding
2 requires high sequencing depth > 30X
3 sensitive to sequencing errors
4 contamination of chimeric reads
Gabriele Schweikert Bioinformatics 2 - Lecture 5 34