bioinformatics 2 - lecture 5 · gabriele schweikert bioinformatics 2 - lecture 5 6. rna-seq:...

81
Bioinformatics 2 - Lecture 5 Gabriele Schweikert University of Edinburgh March 7, 2013 Gabriele Schweikert Bioinformatics 2 - Lecture 5 1

Upload: others

Post on 22-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Bioinformatics 2 - Lecture 5

Gabriele Schweikert

University of Edinburgh

March 7, 2013

Gabriele Schweikert Bioinformatics 2 - Lecture 5 1

Page 2: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

RNA-Seq

Credits: Darryl Leja (NHGRI), Ian Dunham (EBI)

Gabriele Schweikert Bioinformatics 2 - Lecture 5 2

Page 3: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

RNA vs DNA

Pedro Romero et al., PNAS 2006

Gabriele Schweikert Bioinformatics 2 - Lecture 5 3

Page 4: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

RNA-Seq: Transcript identification

Which parts of the DNA sequence are transcribed into RNA?1 protein-coding genes are transcribed into mRNAs.

in humans around 20,000 protein-coding genesaccount for around 1.5% of the genome sequenceheterogeneous in length, (100s-10000s bp)

coding regions are flanked by untranslated regions (UTRs)coding exons, usually interspersed by non-coding introns

Gabriele Schweikert Bioinformatics 2 - Lecture 5 4

Page 5: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

RNA-Seq: Transcript identification

Which parts of the DNA sequence are transcribed into RNA?1 protein-coding genes are transcribed into mRNAs.

in humans around 20,000 protein-coding genesaccount for around 1.5% of the genome sequenceheterogeneous in length, (100s-10000s bp)coding regions are flanked by untranslated regions (UTRs)coding exons, usually interspersed by non-coding introns

Gabriele Schweikert Bioinformatics 2 - Lecture 5 4

Page 6: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Alternative isoforms increase functional diversity

many genes are associated with several different isoforms:

alternative transcription starts sites (TSS)

alternative splicing

alternative cleavage sites

Pedro Romero et al., PNAS 2006

Gabriele Schweikert Bioinformatics 2 - Lecture 5 5

Page 7: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

RNA-Seq: Transcript identification

Which parts of the DNA sequence are transcribed into RNA?1 protein-coding genes2 RNA genes

transcribed into non-coding RNAs (ncRNA)functional RNA molecules that are not translated into proteins

include highly abundant and functionally important RNAs:

1 ribosomal RNA (rRNA): major structural components of theribosome (translation machinery)

2 transfer RNA (tRNA): act as adaptors between amino acids andmRNA during protein synthesis

other important types include:

1 miRNAs: regulate gene expression2 small nuclear RNAs (snRNAs): functions in RNA splicing,

regulating TFs or RNA pol II3 long noncoding RNAs (ncRNAs)4 many others

Gabriele Schweikert Bioinformatics 2 - Lecture 5 6

Page 8: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

RNA-Seq: Transcript identification

Which parts of the DNA sequence are transcribed into RNA?1 protein-coding genes2 RNA genes

transcribed into non-coding RNAs (ncRNA)functional RNA molecules that are not translated into proteinsinclude highly abundant and functionally important RNAs:

1 ribosomal RNA (rRNA): major structural components of theribosome (translation machinery)

2 transfer RNA (tRNA): act as adaptors between amino acids andmRNA during protein synthesis

other important types include:

1 miRNAs: regulate gene expression2 small nuclear RNAs (snRNAs): functions in RNA splicing,

regulating TFs or RNA pol II3 long noncoding RNAs (ncRNAs)4 many others

Gabriele Schweikert Bioinformatics 2 - Lecture 5 6

Page 9: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

RNA-Seq: Transcript identification

Which parts of the DNA sequence are transcribed into RNA?1 protein-coding genes2 RNA genes

transcribed into non-coding RNAs (ncRNA)functional RNA molecules that are not translated into proteinsinclude highly abundant and functionally important RNAs:

1 ribosomal RNA (rRNA): major structural components of theribosome (translation machinery)

2 transfer RNA (tRNA): act as adaptors between amino acids andmRNA during protein synthesis

other important types include:

1 miRNAs: regulate gene expression2 small nuclear RNAs (snRNAs): functions in RNA splicing,

regulating TFs or RNA pol II3 long noncoding RNAs (ncRNAs)4 many others

Gabriele Schweikert Bioinformatics 2 - Lecture 5 6

Page 10: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

RNA-Seq: Transcript identification

Which parts of the DNA sequence are transcribed into RNA?

1 protein-coding genes

2 RNA genes3 pseudogenes

dysfunctional copies of genesno longer expressed in the cellaccumulate multiple mutations

Gabriele Schweikert Bioinformatics 2 - Lecture 5 7

Page 11: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

RNA-Seq: Transcript identification

Which parts of the DNA sequence are transcribed into RNA?

1 protein-coding genes

2 RNA genes3 pseudogenes

dysfunctional copies of genesno longer expressed in the cellaccumulate multiple mutations

Gabriele Schweikert Bioinformatics 2 - Lecture 5 7

Page 12: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

RNA: summary

The typical eukaryotic cell contains ≈ 10−5µg of RNA

80-85% rRNA

15-20% low MW species including tRNAs, snRNAs, miRNAs

1-5% of total cellular RNA is messenger RNA (mRNA)

→ filter for RNAs that you are interested in !→ remove RNS that you are not interested in !

Gabriele Schweikert Bioinformatics 2 - Lecture 5 8

Page 13: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

RNA: summary

The typical eukaryotic cell contains ≈ 10−5µg of RNA

80-85% rRNA

15-20% low MW species including tRNAs, snRNAs, miRNAs

1-5% of total cellular RNA is messenger RNA (mRNA)

→ filter for RNAs that you are interested in !→ remove RNS that you are not interested in !

Gabriele Schweikert Bioinformatics 2 - Lecture 5 8

Page 14: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

RNA: summary

The typical eukaryotic cell contains ≈ 10−5µg of RNA

80-85% rRNA

15-20% low MW species including tRNAs, snRNAs, miRNAs

1-5% of total cellular RNA is messenger RNA (mRNA)

→ filter for RNAs that you are interested in !→ remove RNS that you are not interested in !

Gabriele Schweikert Bioinformatics 2 - Lecture 5 8

Page 15: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

RNA-Seq: Beyond transcript identification

Transcript Quantification:RNA-Seq allows to quantitatively measure the cellular RNAcomplement

Jonas Behr: Transcript Inference and Quantification Using MIP. NIPS-MLCB, December, 2011

Gabriele Schweikert Bioinformatics 2 - Lecture 5 9

Page 16: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

RNA-Seq: Beyond transcript identification

Transcript Quantification:RNA-Seq allows to quantitatively measure the cellular RNAcomplement

Jonas Behr: Transcript Inference and Quantification Using MIP. NIPS-MLCB, December, 2011

Gabriele Schweikert Bioinformatics 2 - Lecture 5 9

Page 17: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Library Preparation

Martin and Wang, Nature Reviews Genetics, 2011

Gabriele Schweikert Bioinformatics 2 - Lecture 5 10

Page 18: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Library Preparation

Martin and Wang, Nature Reviews Genetics, 2011

Gabriele Schweikert Bioinformatics 2 - Lecture 5 10

Page 19: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Library Preparation and Sequencing Considerations

Poly(A) selection vs depletion of rRNAs ?

Eliminate PCR amplification ?(to remove biases against GC content)

strand-specific RNA-Seq

read length?

single-end vs paired-end?

sequencing depth, multiplexing?

Gabriele Schweikert Bioinformatics 2 - Lecture 5 11

Page 20: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Data Analysis

Martin and Wang, Nature Reviews Genetics, 2011

Gabriele Schweikert Bioinformatics 2 - Lecture 5 12

Page 21: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Data Analysis

Martin and Wang, Nature Reviews Genetics, 2011

Gabriele Schweikert Bioinformatics 2 - Lecture 5 12

Page 22: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Computational Challenges

Sequencing depth of transcripts varies over several magnitudes

Transcription is strand specific

Transcript variants from the same gene can share exons;difficult to resolve unambiguously

Gabriele Schweikert Bioinformatics 2 - Lecture 5 13

Page 23: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Pre-processing

Removing artefacts

sequencing adaptors

low complexity sequences

duplicates which are derived from PCR amplification

rRNA sequence contamination

Sequencing errors (low quality scores)

Gabriele Schweikert Bioinformatics 2 - Lecture 5 14

Page 24: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Assembly Strategies

Reference-based

Reference-free

Hybrids

Gabriele Schweikert Bioinformatics 2 - Lecture 5 15

Page 25: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-based assembly strategies

Martin and Wang, Nature Reviews Genetics, 2011

Gabriele Schweikert Bioinformatics 2 - Lecture 5 16

Page 26: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-based assembly strategies

In the early days: known exons, or splice junctions needednow: ab-initio; mapping without annotation

Assembly in 3 steps:1 spliced alignment e.g. Cufflinks, Scripture, PALMapper, Oases,

Trinity

based on: seed-and-extend strategy, find exact match of asubstring than extend using Smith-Waterman alignmentBurrows-Wheeler Transform (BWT) :

2 building graph representation

3 traversing the graph to resolve individual isoforms

Gabriele Schweikert Bioinformatics 2 - Lecture 5 17

Page 27: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-based assembly strategies

In the early days: known exons, or splice junctions needednow: ab-initio; mapping without annotation

Assembly in 3 steps:1 spliced alignment e.g. Cufflinks, Scripture, PALMapper, Oases,

Trinity

based on: seed-and-extend strategy, find exact match of asubstring than extend using Smith-Waterman alignmentBurrows-Wheeler Transform (BWT) :

2 building graph representation

3 traversing the graph to resolve individual isoforms

Gabriele Schweikert Bioinformatics 2 - Lecture 5 17

Page 28: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-based assembly strategies

In the early days: known exons, or splice junctions needednow: ab-initio; mapping without annotation

Assembly in 3 steps:1 spliced alignment e.g. Cufflinks, Scripture, PALMapper, Oases,

Trinity

based on: seed-and-extend strategy, find exact match of asubstring than extend using Smith-Waterman alignmentBurrows-Wheeler Transform (BWT) :

2 building graph representation

3 traversing the graph to resolve individual isoforms

Gabriele Schweikert Bioinformatics 2 - Lecture 5 17

Page 29: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-based assembly strategies

In the early days: known exons, or splice junctions needednow: ab-initio; mapping without annotation

Assembly in 3 steps:1 spliced alignment e.g. Cufflinks, Scripture, PALMapper, Oases,

Trinity

based on: seed-and-extend strategy, find exact match of asubstring than extend using Smith-Waterman alignmentBurrows-Wheeler Transform (BWT) :

2 building graph representation

3 traversing the graph to resolve individual isoforms

Gabriele Schweikert Bioinformatics 2 - Lecture 5 17

Page 30: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Spliced-alignment: Tophat

Gabriele Schweikert Bioinformatics 2 - Lecture 5 18

Page 31: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Spliced-alignment: Tophat

1 map all reads using Bowtie(fast short read mapping program, uses BWA indexed genome)

allow some (2) mismatches in the 5’ end of the readsup to 10 alignments are allowed per read

→ low complexity reads are discarded→ initially unmapped reads (IUMs) put aside

2 assemble mapped reads into covered regions: islands

extend islands on both sides (45bp)remove gaps: merge nearby islands of low expressed genes(introns typically larger than 70bp)

3 find all possible introns

extract sequence for islands using the reference genomefind all canonical donor and acceptor splice sites: GT, AGconsider all possible pairings to form introns: GT-AG (maxlength: 20,000bp)include ’single-island’ junctions in highly coverage islands

Gabriele Schweikert Bioinformatics 2 - Lecture 5 19

Page 32: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Spliced-alignment: Tophat

1 map all reads using Bowtie(fast short read mapping program, uses BWA indexed genome)

allow some (2) mismatches in the 5’ end of the readsup to 10 alignments are allowed per read

→ low complexity reads are discarded→ initially unmapped reads (IUMs) put aside

2 assemble mapped reads into covered regions: islands

extend islands on both sides (45bp)remove gaps: merge nearby islands of low expressed genes(introns typically larger than 70bp)

3 find all possible introns

extract sequence for islands using the reference genomefind all canonical donor and acceptor splice sites: GT, AGconsider all possible pairings to form introns: GT-AG (maxlength: 20,000bp)include ’single-island’ junctions in highly coverage islands

Gabriele Schweikert Bioinformatics 2 - Lecture 5 19

Page 33: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Spliced-alignment: Tophat

1 map all reads using Bowtie(fast short read mapping program, uses BWA indexed genome)

allow some (2) mismatches in the 5’ end of the readsup to 10 alignments are allowed per read

→ low complexity reads are discarded→ initially unmapped reads (IUMs) put aside

2 assemble mapped reads into covered regions: islands

extend islands on both sides (45bp)remove gaps: merge nearby islands of low expressed genes(introns typically larger than 70bp)

3 find all possible introns

extract sequence for islands using the reference genomefind all canonical donor and acceptor splice sites: GT, AGconsider all possible pairings to form introns: GT-AG (maxlength: 20,000bp)include ’single-island’ junctions in highly coverage islands

Gabriele Schweikert Bioinformatics 2 - Lecture 5 19

Page 34: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Spliced-alignment: Tophat

Gabriele Schweikert Bioinformatics 2 - Lecture 5 20

Page 35: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Tophat continued

4 check IUMs for reads spanning intron with seed-and-extendstrategy

create 2k-mer seeds from all junctions, with k = 5bp exonicsequence on either side of junctionindex IUM reads: table is keyed by 2k-merseach 2k-mers is associated with reads that contain this 2k-mersIUM read index queried with seedsreads containing the seed are checked for complete alignment(allowing some mismatches)

Gabriele Schweikert Bioinformatics 2 - Lecture 5 21

Page 36: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Tophat continued

5 heuristically filter by minimum minor isoform frequencyif frequency of splice junction < 15% of the depth of coverage ofexons→ discard junction

Gabriele Schweikert Bioinformatics 2 - Lecture 5 22

Page 37: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Alternative strategies

1 QPALMA: predicts splice sites from DNA sequence (trains onknown splice sites)

2 STAR

3 GSNAP

Gabriele Schweikert Bioinformatics 2 - Lecture 5 23

Page 38: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Single vs paired end reads

Gabriele Schweikert Bioinformatics 2 - Lecture 5 24

Page 39: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Single vs paired end reads

Gabriele Schweikert Bioinformatics 2 - Lecture 5 24

Page 40: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Paired end

75-150bp sequenced from both ends

Paired reads from long inserts (500-1000bp)

long-range exon connectivity

combining libraries with different inset size

Gabriele Schweikert Bioinformatics 2 - Lecture 5 25

Page 41: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Spliced-alignment: Tophat

Gabriele Schweikert Bioinformatics 2 - Lecture 5 26

Page 42: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-based assembly strategies

Martin and Wang, Nature Reviews Genetics, 2011

Gabriele Schweikert Bioinformatics 2 - Lecture 5 27

Page 43: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Cufflinks

Trapnell et al., Nature Biotech, 2010

Gabriele Schweikert Bioinformatics 2 - Lecture 5 28

Page 44: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Cufflinks

Trapnell et al., Nature Biotech, 2010Gabriele Schweikert Bioinformatics 2 - Lecture 5 28

Page 45: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Transcriptome Assembly: Cufflinks

1 assemble overlapping bundles independently2 for each bundle, build a weighted bipartite graph

that represents compatibilities among fragments

identify pairs of incompatible fragmentsoriginating from different spliced isoformsfragments connected when compatible and overlappingeach fragment has one nodedirected edges placed from left to right between each pair ofcompatible fragments

Gabriele Schweikert Bioinformatics 2 - Lecture 5 29

Page 46: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Transcriptome Assembly: Cufflinks

1 assemble overlapping bundles independently2 for each bundle, build a weighted bipartite graph

that represents compatibilities among fragmentsidentify pairs of incompatible fragmentsoriginating from different spliced isoforms

fragments connected when compatible and overlappingeach fragment has one nodedirected edges placed from left to right between each pair ofcompatible fragments

Gabriele Schweikert Bioinformatics 2 - Lecture 5 29

Page 47: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Transcriptome Assembly: Cufflinks

1 assemble overlapping bundles independently2 for each bundle, build a weighted bipartite graph

that represents compatibilities among fragmentsidentify pairs of incompatible fragmentsoriginating from different spliced isoformsfragments connected when compatible and overlapping

each fragment has one nodedirected edges placed from left to right between each pair ofcompatible fragments

Gabriele Schweikert Bioinformatics 2 - Lecture 5 29

Page 48: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Transcriptome Assembly: Cufflinks

1 assemble overlapping bundles independently2 for each bundle, build a weighted bipartite graph

that represents compatibilities among fragmentsidentify pairs of incompatible fragmentsoriginating from different spliced isoformsfragments connected when compatible and overlappingeach fragment has one node

directed edges placed from left to right between each pair ofcompatible fragments

Gabriele Schweikert Bioinformatics 2 - Lecture 5 29

Page 49: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Transcriptome Assembly: Cufflinks

1 assemble overlapping bundles independently2 for each bundle, build a weighted bipartite graph

that represents compatibilities among fragmentsidentify pairs of incompatible fragmentsoriginating from different spliced isoformsfragments connected when compatible and overlappingeach fragment has one nodedirected edges placed from left to right between each pair ofcompatible fragments

Gabriele Schweikert Bioinformatics 2 - Lecture 5 29

Page 50: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Transcriptome Assembly: Cufflinks

1 assemble overlapping bundles independently2 build weighted bipartite graph3 isoforms assembled from the overlap graph

path though the graph corresponds to mutual compatiblefragmentsfind minimum path cover of the graphminimum set of transcripts that ’explain’ the intron junctionswithin reads

Gabriele Schweikert Bioinformatics 2 - Lecture 5 30

Page 51: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Transcriptome Assembly: Cufflinks

1 assemble overlapping bundles independently2 build weighted bipartite graph3 isoforms assembled from the overlap graph

path though the graph corresponds to mutual compatiblefragments

find minimum path cover of the graphminimum set of transcripts that ’explain’ the intron junctionswithin reads

Gabriele Schweikert Bioinformatics 2 - Lecture 5 30

Page 52: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Transcriptome Assembly: Cufflinks

1 assemble overlapping bundles independently2 build weighted bipartite graph3 isoforms assembled from the overlap graph

path though the graph corresponds to mutual compatiblefragmentsfind minimum path cover of the graphminimum set of transcripts that ’explain’ the intron junctionswithin reads

Gabriele Schweikert Bioinformatics 2 - Lecture 5 30

Page 53: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Cufflinks: Transcript Quantification

1 assign fragments to transcripts from which they could haveoriginated(potentially more than 1)

2 incorporate fragment lengths

3 statistical model: probability of observing each fragment is linearfunction of the abundance of the transcript

4 maximize function that assigns a likelihood to all possible sets ofrelative abundances

Gabriele Schweikert Bioinformatics 2 - Lecture 5 31

Page 54: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Cufflinks: Transcript Quantification

1 assign fragments to transcripts from which they could haveoriginated(potentially more than 1)

2 incorporate fragment lengths3 statistical model: probability of observing each fragment is linear

function of the abundance of the transcript4 maximize function that assigns a likelihood to all possible sets of

relative abundances

Gabriele Schweikert Bioinformatics 2 - Lecture 5 31

Page 55: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Cufflinks: Transcript Quantification

1 assign fragments to transcripts from which they could haveoriginated(potentially more than 1)

2 incorporate fragment lengths

3 statistical model: probability of observing each fragment is linearfunction of the abundance of the transcript

4 maximize function that assigns a likelihood to all possible sets ofrelative abundances

Gabriele Schweikert Bioinformatics 2 - Lecture 5 31

Page 56: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Cufflinks: Transcript Quantification

1 assign fragments to transcripts from which they could haveoriginated(potentially more than 1)

2 incorporate fragment lengths3 statistical model: probability of observing each fragment is linear

function of the abundance of the transcript4 maximize function that assigns a likelihood to all possible sets of

relative abundances

Gabriele Schweikert Bioinformatics 2 - Lecture 5 31

Page 57: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Cufflinks: Transcript Quantification

1 assign fragments to transcripts from which they could haveoriginated(potentially more than 1)

2 incorporate fragment lengths

3 statistical model: probability of observing each fragment is linearfunction of the abundance of the transcript

4 maximize function that assigns a likelihood to all possible sets ofrelative abundances

Gabriele Schweikert Bioinformatics 2 - Lecture 5 31

Page 58: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-based assembly strategies

Advantages:1 many independent alignment problems: easy to parallelize

2 robust against contamination and sequencing artefactes3 very sensitive, even for transcripts with low abundance4 allows discoveries of novel transcripts5 requires relative moderate coverage: > 10X

Disadvantages:1 depends on high quality reference sequence

however: strawberry reference genome was used for assembly ofraspberry transcriptome

2 large introns can be missed3 what to do with reads that map to multiple places in the

genome?4 trans-spliced genes can’t be assembled (important in genetic

pathway of some types of cancer)

Gabriele Schweikert Bioinformatics 2 - Lecture 5 32

Page 59: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-based assembly strategies

Advantages:1 many independent alignment problems: easy to parallelize2 robust against contamination and sequencing artefactes

3 very sensitive, even for transcripts with low abundance4 allows discoveries of novel transcripts5 requires relative moderate coverage: > 10X

Disadvantages:1 depends on high quality reference sequence

however: strawberry reference genome was used for assembly ofraspberry transcriptome

2 large introns can be missed3 what to do with reads that map to multiple places in the

genome?4 trans-spliced genes can’t be assembled (important in genetic

pathway of some types of cancer)

Gabriele Schweikert Bioinformatics 2 - Lecture 5 32

Page 60: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-based assembly strategies

Advantages:1 many independent alignment problems: easy to parallelize2 robust against contamination and sequencing artefactes3 very sensitive, even for transcripts with low abundance

4 allows discoveries of novel transcripts5 requires relative moderate coverage: > 10X

Disadvantages:1 depends on high quality reference sequence

however: strawberry reference genome was used for assembly ofraspberry transcriptome

2 large introns can be missed3 what to do with reads that map to multiple places in the

genome?4 trans-spliced genes can’t be assembled (important in genetic

pathway of some types of cancer)

Gabriele Schweikert Bioinformatics 2 - Lecture 5 32

Page 61: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-based assembly strategies

Advantages:1 many independent alignment problems: easy to parallelize2 robust against contamination and sequencing artefactes3 very sensitive, even for transcripts with low abundance4 allows discoveries of novel transcripts

5 requires relative moderate coverage: > 10X

Disadvantages:1 depends on high quality reference sequence

however: strawberry reference genome was used for assembly ofraspberry transcriptome

2 large introns can be missed3 what to do with reads that map to multiple places in the

genome?4 trans-spliced genes can’t be assembled (important in genetic

pathway of some types of cancer)

Gabriele Schweikert Bioinformatics 2 - Lecture 5 32

Page 62: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-based assembly strategies

Advantages:1 many independent alignment problems: easy to parallelize2 robust against contamination and sequencing artefactes3 very sensitive, even for transcripts with low abundance4 allows discoveries of novel transcripts5 requires relative moderate coverage: > 10X

Disadvantages:1 depends on high quality reference sequence

however: strawberry reference genome was used for assembly ofraspberry transcriptome

2 large introns can be missed3 what to do with reads that map to multiple places in the

genome?4 trans-spliced genes can’t be assembled (important in genetic

pathway of some types of cancer)

Gabriele Schweikert Bioinformatics 2 - Lecture 5 32

Page 63: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-based assembly strategies

Advantages:1 many independent alignment problems: easy to parallelize2 robust against contamination and sequencing artefactes3 very sensitive, even for transcripts with low abundance4 allows discoveries of novel transcripts5 requires relative moderate coverage: > 10X

Disadvantages:

1 depends on high quality reference sequencehowever: strawberry reference genome was used for assembly ofraspberry transcriptome

2 large introns can be missed3 what to do with reads that map to multiple places in the

genome?4 trans-spliced genes can’t be assembled (important in genetic

pathway of some types of cancer)

Gabriele Schweikert Bioinformatics 2 - Lecture 5 32

Page 64: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-based assembly strategies

Advantages:1 many independent alignment problems: easy to parallelize2 robust against contamination and sequencing artefactes3 very sensitive, even for transcripts with low abundance4 allows discoveries of novel transcripts5 requires relative moderate coverage: > 10X

Disadvantages:1 depends on high quality reference sequence

however: strawberry reference genome was used for assembly ofraspberry transcriptome

2 large introns can be missed3 what to do with reads that map to multiple places in the

genome?4 trans-spliced genes can’t be assembled (important in genetic

pathway of some types of cancer)

Gabriele Schweikert Bioinformatics 2 - Lecture 5 32

Page 65: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-based assembly strategies

Advantages:1 many independent alignment problems: easy to parallelize2 robust against contamination and sequencing artefactes3 very sensitive, even for transcripts with low abundance4 allows discoveries of novel transcripts5 requires relative moderate coverage: > 10X

Disadvantages:1 depends on high quality reference sequence

however: strawberry reference genome was used for assembly ofraspberry transcriptome

2 large introns can be missed

3 what to do with reads that map to multiple places in thegenome?

4 trans-spliced genes can’t be assembled (important in geneticpathway of some types of cancer)

Gabriele Schweikert Bioinformatics 2 - Lecture 5 32

Page 66: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-based assembly strategies

Advantages:1 many independent alignment problems: easy to parallelize2 robust against contamination and sequencing artefactes3 very sensitive, even for transcripts with low abundance4 allows discoveries of novel transcripts5 requires relative moderate coverage: > 10X

Disadvantages:1 depends on high quality reference sequence

however: strawberry reference genome was used for assembly ofraspberry transcriptome

2 large introns can be missed3 what to do with reads that map to multiple places in the

genome?

4 trans-spliced genes can’t be assembled (important in geneticpathway of some types of cancer)

Gabriele Schweikert Bioinformatics 2 - Lecture 5 32

Page 67: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-based assembly strategies

Advantages:1 many independent alignment problems: easy to parallelize2 robust against contamination and sequencing artefactes3 very sensitive, even for transcripts with low abundance4 allows discoveries of novel transcripts5 requires relative moderate coverage: > 10X

Disadvantages:1 depends on high quality reference sequence

however: strawberry reference genome was used for assembly ofraspberry transcriptome

2 large introns can be missed3 what to do with reads that map to multiple places in the

genome?4 trans-spliced genes can’t be assembled (important in genetic

pathway of some types of cancer)Gabriele Schweikert Bioinformatics 2 - Lecture 5 32

Page 68: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-free assembly

1 Generate all substrings of length k from the reads

2 Generate the de Brujin graph(directed graph representing overlaps between sequences ofsymbols)

3 Collapse the de Brujin Graph4 Traverse the graph and assemble isoforms

Gabriele Schweikert Bioinformatics 2 - Lecture 5 33

Page 69: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-free assembly

1 Generate all substrings of length k from the reads2 Generate the de Brujin graph

(directed graph representing overlaps between sequences ofsymbols)

3 Collapse the de Brujin Graph4 Traverse the graph and assemble isoforms

Gabriele Schweikert Bioinformatics 2 - Lecture 5 33

Page 70: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-free assembly

1 Generate all substrings of length k from the reads

2 Generate the de Brujin graph(directed graph representing overlaps between sequences ofsymbols)

3 Collapse the de Brujin Graph

4 Traverse the graph and assemble isoforms

Gabriele Schweikert Bioinformatics 2 - Lecture 5 33

Page 71: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-free assembly

1 Generate all substrings of length k from the reads2 Generate the de Brujin graph

(directed graph representing overlaps between sequences ofsymbols)

3 Collapse the de Brujin Graph4 Traverse the graph and assemble isoforms

Gabriele Schweikert Bioinformatics 2 - Lecture 5 33

Page 72: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-free assembly strategies

Advantages:

1 independend of high quality reference sequence

2 can assemble transcripts that are absent from genome assembly

3 independent of correct alignment

4 no problems with long introns

5 trans-spliced transcripts are detectable

Disadvantages:

1 computationally demanding

2 requires high sequencing depth > 30X

3 sensitive to sequencing errors

4 contamination of chimeric reads

Gabriele Schweikert Bioinformatics 2 - Lecture 5 34

Page 73: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-free assembly strategies

Advantages:

1 independend of high quality reference sequence

2 can assemble transcripts that are absent from genome assembly

3 independent of correct alignment

4 no problems with long introns

5 trans-spliced transcripts are detectable

Disadvantages:

1 computationally demanding

2 requires high sequencing depth > 30X

3 sensitive to sequencing errors

4 contamination of chimeric reads

Gabriele Schweikert Bioinformatics 2 - Lecture 5 34

Page 74: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-free assembly strategies

Advantages:

1 independend of high quality reference sequence

2 can assemble transcripts that are absent from genome assembly

3 independent of correct alignment

4 no problems with long introns

5 trans-spliced transcripts are detectable

Disadvantages:

1 computationally demanding

2 requires high sequencing depth > 30X

3 sensitive to sequencing errors

4 contamination of chimeric reads

Gabriele Schweikert Bioinformatics 2 - Lecture 5 34

Page 75: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-free assembly strategies

Advantages:

1 independend of high quality reference sequence

2 can assemble transcripts that are absent from genome assembly

3 independent of correct alignment

4 no problems with long introns

5 trans-spliced transcripts are detectable

Disadvantages:

1 computationally demanding

2 requires high sequencing depth > 30X

3 sensitive to sequencing errors

4 contamination of chimeric reads

Gabriele Schweikert Bioinformatics 2 - Lecture 5 34

Page 76: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-free assembly strategies

Advantages:

1 independend of high quality reference sequence

2 can assemble transcripts that are absent from genome assembly

3 independent of correct alignment

4 no problems with long introns

5 trans-spliced transcripts are detectable

Disadvantages:

1 computationally demanding

2 requires high sequencing depth > 30X

3 sensitive to sequencing errors

4 contamination of chimeric reads

Gabriele Schweikert Bioinformatics 2 - Lecture 5 34

Page 77: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-free assembly strategies

Advantages:

1 independend of high quality reference sequence

2 can assemble transcripts that are absent from genome assembly

3 independent of correct alignment

4 no problems with long introns

5 trans-spliced transcripts are detectable

Disadvantages:

1 computationally demanding

2 requires high sequencing depth > 30X

3 sensitive to sequencing errors

4 contamination of chimeric reads

Gabriele Schweikert Bioinformatics 2 - Lecture 5 34

Page 78: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-free assembly strategies

Advantages:

1 independend of high quality reference sequence

2 can assemble transcripts that are absent from genome assembly

3 independent of correct alignment

4 no problems with long introns

5 trans-spliced transcripts are detectable

Disadvantages:

1 computationally demanding

2 requires high sequencing depth > 30X

3 sensitive to sequencing errors

4 contamination of chimeric reads

Gabriele Schweikert Bioinformatics 2 - Lecture 5 34

Page 79: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-free assembly strategies

Advantages:

1 independend of high quality reference sequence

2 can assemble transcripts that are absent from genome assembly

3 independent of correct alignment

4 no problems with long introns

5 trans-spliced transcripts are detectable

Disadvantages:

1 computationally demanding

2 requires high sequencing depth > 30X

3 sensitive to sequencing errors

4 contamination of chimeric reads

Gabriele Schweikert Bioinformatics 2 - Lecture 5 34

Page 80: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-free assembly strategies

Advantages:

1 independend of high quality reference sequence

2 can assemble transcripts that are absent from genome assembly

3 independent of correct alignment

4 no problems with long introns

5 trans-spliced transcripts are detectable

Disadvantages:

1 computationally demanding

2 requires high sequencing depth > 30X

3 sensitive to sequencing errors

4 contamination of chimeric reads

Gabriele Schweikert Bioinformatics 2 - Lecture 5 34

Page 81: Bioinformatics 2 - Lecture 5 · Gabriele Schweikert Bioinformatics 2 - Lecture 5 6. RNA-Seq: Transcript identi cation Which parts of the DNA sequence are transcribed into RNA? 1 protein-coding

Reference-free assembly strategies

Advantages:

1 independend of high quality reference sequence

2 can assemble transcripts that are absent from genome assembly

3 independent of correct alignment

4 no problems with long introns

5 trans-spliced transcripts are detectable

Disadvantages:

1 computationally demanding

2 requires high sequencing depth > 30X

3 sensitive to sequencing errors

4 contamination of chimeric reads

Gabriele Schweikert Bioinformatics 2 - Lecture 5 34