bioinformatics 2 - lecture 5 · gabriele schweikert bioinformatics 2 - lecture 5 6. rna-seq:...

Bioinformatics 2 - Lecture 5

Gabriele Schweikert

University of Edinburgh

March 7, 2013

Gabriele Schweikert Bioinformatics 2 - Lecture 5 1

RNA-Seq

Credits: Darryl Leja (NHGRI), Ian Dunham (EBI)


RNA vs DNA

Pedro Romero et al., PNAS 2006


RNA-Seq: Transcript identification

Which parts of the DNA sequence are transcribed into RNA?1 protein-coding genes are transcribed into mRNAs.

in humans around 20,000 protein-coding genesaccount for around 1.5% of the genome sequenceheterogeneous in length, (100s-10000s bp)

coding regions are flanked by untranslated regions (UTRs)coding exons, usually interspersed by non-coding introns



Which parts of the DNA sequence are transcribed into RNA?1 protein-coding genes are transcribed into mRNAs.

in humans around 20,000 protein-coding genesaccount for around 1.5% of the genome sequenceheterogeneous in length, (100s-10000s bp)coding regions are flanked by untranslated regions (UTRs)coding exons, usually interspersed by non-coding introns


Alternative isoforms increase functional diversity

many genes are associated with several different isoforms:

alternative transcription starts sites (TSS)

alternative splicing

alternative cleavage sites

Pedro Romero et al., PNAS 2006



Which parts of the DNA sequence are transcribed into RNA?1 protein-coding genes2 RNA genes

transcribed into non-coding RNAs (ncRNA)functional RNA molecules that are not translated into proteins

include highly abundant and functionally important RNAs:

1 ribosomal RNA (rRNA): major structural components of theribosome (translation machinery)

2 transfer RNA (tRNA): act as adaptors between amino acids andmRNA during protein synthesis

other important types include:

1 miRNAs: regulate gene expression2 small nuclear RNAs (snRNAs): functions in RNA splicing,

regulating TFs or RNA pol II3 long noncoding RNAs (ncRNAs)4 many others



Which parts of the DNA sequence are transcribed into RNA?1 protein-coding genes2 RNA genes

transcribed into non-coding RNAs (ncRNA)functional RNA molecules that are not translated into proteinsinclude highly abundant and functionally important RNAs:

1 ribosomal RNA (rRNA): major structural components of theribosome (translation machinery)

2 transfer RNA (tRNA): act as adaptors between amino acids andmRNA during protein synthesis

other important types include:

1 miRNAs: regulate gene expression2 small nuclear RNAs (snRNAs): functions in RNA splicing,

regulating TFs or RNA pol II3 long noncoding RNAs (ncRNAs)4 many others



Which parts of the DNA sequence are transcribed into RNA?

1 protein-coding genes

2 RNA genes3 pseudogenes

dysfunctional copies of genesno longer expressed in the cellaccumulate multiple mutations


RNA: summary

The typical eukaryotic cell contains ≈ 10−5µg of RNA

80-85% rRNA

15-20% low MW species including tRNAs, snRNAs, miRNAs

1-5% of total cellular RNA is messenger RNA (mRNA)

→ filter for RNAs that you are interested in !→ remove RNS that you are not interested in !


RNA-Seq: Beyond transcript identification

Transcript Quantification:RNA-Seq allows to quantitatively measure the cellular RNAcomplement

Jonas Behr: Transcript Inference and Quantification Using MIP. NIPS-MLCB, December, 2011


Library Preparation

Martin and Wang, Nature Reviews Genetics, 2011


Library Preparation and Sequencing Considerations

Poly(A) selection vs depletion of rRNAs ?

Eliminate PCR amplification ?(to remove biases against GC content)

strand-specific RNA-Seq

read length?

single-end vs paired-end?

sequencing depth, multiplexing?


Data Analysis



Computational Challenges

Sequencing depth of transcripts varies over several magnitudes

Transcription is strand specific

Transcript variants from the same gene can share exons;difficult to resolve unambiguously


Pre-processing

Removing artefacts

sequencing adaptors

low complexity sequences

duplicates which are derived from PCR amplification

rRNA sequence contamination

Sequencing errors (low quality scores)


Assembly Strategies

Reference-based

Reference-free

Hybrids


Reference-based assembly strategies




In the early days: known exons, or splice junctions needednow: ab-initio; mapping without annotation

Assembly in 3 steps:1 spliced alignment e.g. Cufflinks, Scripture, PALMapper, Oases,

Trinity

based on: seed-and-extend strategy, find exact match of asubstring than extend using Smith-Waterman alignmentBurrows-Wheeler Transform (BWT) :

2 building graph representation

3 traversing the graph to resolve individual isoforms


Spliced-alignment: Tophat



1 map all reads using Bowtie(fast short read mapping program, uses BWA indexed genome)

allow some (2) mismatches in the 5’ end of the readsup to 10 alignments are allowed per read

→ low complexity reads are discarded→ initially unmapped reads (IUMs) put aside

2 assemble mapped reads into covered regions: islands

extend islands on both sides (45bp)remove gaps: merge nearby islands of low expressed genes(introns typically larger than 70bp)

3 find all possible introns

extract sequence for islands using the reference genomefind all canonical donor and acceptor splice sites: GT, AGconsider all possible pairings to form introns: GT-AG (maxlength: 20,000bp)include ’single-island’ junctions in highly coverage islands


Tophat continued

4 check IUMs for reads spanning intron with seed-and-extendstrategy

create 2k-mer seeds from all junctions, with k = 5bp exonicsequence on either side of junctionindex IUM reads: table is keyed by 2k-merseach 2k-mers is associated with reads that contain this 2k-mersIUM read index queried with seedsreads containing the seed are checked for complete alignment(allowing some mismatches)


Tophat continued

5 heuristically filter by minimum minor isoform frequencyif frequency of splice junction < 15% of the depth of coverage ofexons→ discard junction


Alternative strategies

1 QPALMA: predicts splice sites from DNA sequence (trains onknown splice sites)

2 STAR

3 GSNAP


Single vs paired end reads


Paired end

75-150bp sequenced from both ends

Paired reads from long inserts (500-1000bp)

long-range exon connectivity

combining libraries with different inset size


Cufflinks

Trapnell et al., Nature Biotech, 2010


Cufflinks

Trapnell et al., Nature Biotech, 2010Gabriele Schweikert Bioinformatics 2 - Lecture 5 28

Transcriptome Assembly: Cufflinks

1 assemble overlapping bundles independently2 for each bundle, build a weighted bipartite graph

that represents compatibilities among fragments

identify pairs of incompatible fragmentsoriginating from different spliced isoformsfragments connected when compatible and overlappingeach fragment has one nodedirected edges placed from left to right between each pair ofcompatible fragments




that represents compatibilities among fragmentsidentify pairs of incompatible fragmentsoriginating from different spliced isoforms

fragments connected when compatible and overlappingeach fragment has one nodedirected edges placed from left to right between each pair ofcompatible fragments




that represents compatibilities among fragmentsidentify pairs of incompatible fragmentsoriginating from different spliced isoformsfragments connected when compatible and overlapping

each fragment has one nodedirected edges placed from left to right between each pair ofcompatible fragments




that represents compatibilities among fragmentsidentify pairs of incompatible fragmentsoriginating from different spliced isoformsfragments connected when compatible and overlappingeach fragment has one node

directed edges placed from left to right between each pair ofcompatible fragments




that represents compatibilities among fragmentsidentify pairs of incompatible fragmentsoriginating from different spliced isoformsfragments connected when compatible and overlappingeach fragment has one nodedirected edges placed from left to right between each pair ofcompatible fragments



1 assemble overlapping bundles independently2 build weighted bipartite graph3 isoforms assembled from the overlap graph

path though the graph corresponds to mutual compatiblefragmentsfind minimum path cover of the graphminimum set of transcripts that ’explain’ the intron junctionswithin reads




path though the graph corresponds to mutual compatiblefragments

find minimum path cover of the graphminimum set of transcripts that ’explain’ the intron junctionswithin reads




path though the graph corresponds to mutual compatiblefragmentsfind minimum path cover of the graphminimum set of transcripts that ’explain’ the intron junctionswithin reads


Cufflinks: Transcript Quantification

1 assign fragments to transcripts from which they could haveoriginated(potentially more than 1)

2 incorporate fragment lengths

3 statistical model: probability of observing each fragment is linearfunction of the abundance of the transcript

4 maximize function that assigns a likelihood to all possible sets ofrelative abundances




2 incorporate fragment lengths3 statistical model: probability of observing each fragment is linear

function of the abundance of the transcript4 maximize function that assigns a likelihood to all possible sets of

relative abundances



Advantages:1 many independent alignment problems: easy to parallelize

2 robust against contamination and sequencing artefactes3 very sensitive, even for transcripts with low abundance4 allows discoveries of novel transcripts5 requires relative moderate coverage: > 10X

Disadvantages:1 depends on high quality reference sequence

however: strawberry reference genome was used for assembly ofraspberry transcriptome

2 large introns can be missed3 what to do with reads that map to multiple places in the

genome?4 trans-spliced genes can’t be assembled (important in genetic

pathway of some types of cancer)



Advantages:1 many independent alignment problems: easy to parallelize2 robust against contamination and sequencing artefactes

3 very sensitive, even for transcripts with low abundance4 allows discoveries of novel transcripts5 requires relative moderate coverage: > 10X








Advantages:1 many independent alignment problems: easy to parallelize2 robust against contamination and sequencing artefactes3 very sensitive, even for transcripts with low abundance

4 allows discoveries of novel transcripts5 requires relative moderate coverage: > 10X








Advantages:1 many independent alignment problems: easy to parallelize2 robust against contamination and sequencing artefactes3 very sensitive, even for transcripts with low abundance4 allows discoveries of novel transcripts

5 requires relative moderate coverage: > 10X








Advantages:1 many independent alignment problems: easy to parallelize2 robust against contamination and sequencing artefactes3 very sensitive, even for transcripts with low abundance4 allows discoveries of novel transcripts5 requires relative moderate coverage: > 10X









Disadvantages:

1 depends on high quality reference sequencehowever: strawberry reference genome was used for assembly ofraspberry transcriptome









2 large introns can be missed

3 what to do with reads that map to multiple places in thegenome?

4 trans-spliced genes can’t be assembled (important in geneticpathway of some types of cancer)







genome?

4 trans-spliced genes can’t be assembled (important in geneticpathway of some types of cancer)








pathway of some types of cancer)Gabriele Schweikert Bioinformatics 2 - Lecture 5 32

Reference-free assembly

1 Generate all substrings of length k from the reads

2 Generate the de Brujin graph(directed graph representing overlaps between sequences ofsymbols)

3 Collapse the de Brujin Graph4 Traverse the graph and assemble isoforms



1 Generate all substrings of length k from the reads2 Generate the de Brujin graph

(directed graph representing overlaps between sequences ofsymbols)




1 Generate all substrings of length k from the reads

2 Generate the de Brujin graph(directed graph representing overlaps between sequences ofsymbols)

3 Collapse the de Brujin Graph

4 Traverse the graph and assemble isoforms



1 Generate all substrings of length k from the reads2 Generate the de Brujin graph

(directed graph representing overlaps between sequences ofsymbols)



Reference-free assembly strategies

Advantages:

1 independend of high quality reference sequence

2 can assemble transcripts that are absent from genome assembly

3 independent of correct alignment

4 no problems with long introns

5 trans-spliced transcripts are detectable

Disadvantages:

1 computationally demanding

2 requires high sequencing depth > 30X

3 sensitive to sequencing errors

4 contamination of chimeric reads


bioinformatics 2 - lecture 5 · gabriele schweikert bioinformatics 2 - lecture 5 6. rna-seq:...

Documents