inference of allele specific isoform expression (asie) levels from rna- seq data

33
Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA-Seq Data Sahar Al Seesi and Ion Măndoiu Computer Science and Engineering CANGS 2012

Upload: gizela

Post on 22-Feb-2016

57 views

Category:

Documents


0 download

DESCRIPTION

Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq Data. Sahar Al Seesi and Ion M ă ndoiu Computer Science and Engineering CANGS 2012. Outline. Problem definition Challenges and limitations of current approaches ASIE pipeline SNVQ RefHap Diploid IsoEM - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA-Seq Data

Sahar Al Seesi and Ion MăndoiuComputer Science and Engineering

CANGS 2012

Page 2: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

Outline

• Problem definition• Challenges and limitations of current

approaches• ASIE pipeline– SNVQ– RefHap– Diploid IsoEM

• Results

Page 3: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

Gene/Isoform Expression Estimation

Make cDNA & shatter into fragments

Sequence fragment ends

A B C D E

Map reads

Gene Expression (GE) Isoform Expression (IE)A B C

A C

D E

Page 4: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

Allele Specific Gene/Isoform Expression Estimation

Make cDNA & shatter into fragments

A B C D E

Map reads

Allele Specific Gene Expression (GE) Allele Specific Isoform Expression (IE)

A B C D E

Sequence fragment ends

H0 H1

H0 H1

H0

H1

H0H1 H0

H1H1

H0H1

Page 5: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

Challenges and limitations of current approaches

• Need for diploid transcriptome• Existing studies rely on simple alleles coverage

analysis for heterozygous SNP sites– Not isoform specific– Read mapping bias towards the reference allele– Use less information less robust estimates

Page 6: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

Pipeline for ASIE from RNA-Seq Reads

Page 7: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

Pipeline for ASIE from RNA-Seq Reads

Page 8: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

Hybrid Approach Based on Merging Alignments

mRNA reads

Transcript Library

Mapping

Genome Mapping

Read Merging

Transcript mapped reads

Genome mapped reads

Mapped reads

Page 9: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

Merging Rules for Short ReadsGenome Transcripts Agree? Hard Merge

Unique Unique Yes Keep

Unique Unique No Throw

Unique Multiple No Throw

Unique Not Mapped No Keep

Multiple Unique No Throw

Multiple Multiple No Throw

Multiple Not Mapped No Throw

Not mapped Unique No Keep

Not mapped Multiple No Throw

Not mapped Not Mapped Yes Throw

Page 10: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

Merging Local Alignments of ION Reads: HardMerge at Base-Level

• Input: SAM files with alignments from genome and transcriptome mapping

• The following alignments are filtered out– Any local alignments of length <= 15 bases– All alignments of read that has alignments on different chromosomes or different

strands

• Key idea: a read base mapped to multiple locations is discarded

• Output alignments are generated from contiguous stretches of non-ambiguously mapped bases, based on the unique genomic location of these bases– Subject to the above filtering criteria

Page 11: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

HardMerge Example

Input alignments in genome coordinates:

Filter multiple local alignments/sub-alignments

Output alignment:

Page 12: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

SNV Detection and Genotyping

• A reliable hybrid mapping strategy• Bayesian model for SNV detection based on

quality scores

J. Duitama and P.K. Srivastava and I.I. Mandoiu, Towards Accurate Detection and Genotyping of Expressed Variants from Whole Transcriptome Sequencing Data, BMC Genomics 13(Suppl 2):S6, 2012

Page 13: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

SNVQ Model• Calculate conditional probabilities by multiplying contributions of

individual reads

Page 14: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

Accuracy per Coverage Bins

Page 15: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

Pipeline for ASIE from RNA-Seq Reads

Page 16: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

ReFHap

• Problem Formulation– Alleles for each locus are encoded with 0 and 1– Fragment: Aligned read showing coocurrance of two or

more alleles in the same chromosome copy

Locus 1 2 3 4 5 6 7 8 9 ...

f - 0 1 1 - 1 - 0 0 ...

J. Duitama and T. Huebsch and G. McEwen and E. Suk and M.R. Hoehe, ReFHap: A Reliable and Fast Algorithm for Single Individual Haplotyping, Proc. 1st ACM Intl. Conf. on Bioinformatics and Computational Biology, pp. 160-169, 2010

Page 17: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

Problem Formulation

• Input: Matrix M of m fragments covering n loci

Locus 1 2 3 4 5 ... n

f1 1 1 0 - 1 -

f2 - 0 1 0 0 1

f3 - 0 0 0 1 -

...

fm - - - - 1 0

Page 18: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

ReFHap vs HapCUT

Page 19: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

Pipeline for ASIE from RNA-Seq Reads

Page 20: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

IsoEM: Isoform Expression Level Estimation

• Expectation-Maximization algorithm• Unified probabilistic model incorporating– Single and/or paired reads– Fragment length distribution– Strand information– Base quality scores– Repeat and hexamer bias correction

Page 21: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

Read-isoform compatibilityirw ,

a

aaair FQOw ,

Page 22: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

Fragment length distribution

• Paired reads

A B C

A C

A B C

A CA C

A B Ci

j

Series1

Fa(i)

Series1

Fa (j)

Page 23: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

IsoEM vs. Cufflinks 1.0.3 on ION reads

IsoEM HBR Cufflinks HBR IsoEM UHR Cufflinks UHR0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

R2 fo

r Iso

EM/C

cuffl

inks

Esti

mat

es v

s qPC

R

Page 24: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

Simplified Pipeline for ASIE in F1 Hybrids

Generate Isoform

Sequences

Align to Diploid

Transcriptome

IsoEM

Reference Transcriptome

Diploid Transcriptome

>chrXGAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCACTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGGATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAA

CBA

CBA

CA

CA

AAAAATGTTGAGCCATAAATACCATCACTTTGAAGTATTC

AAAAATGTTGAGCCATAAATACCATCACTTTGAAGTATTC

AAAAATGTTGAGCCTTTGAAGTATTC

AAAAATGTTGAGCCTTTGAAGTATTC

Short Reads>name:EI1W3PE02ILQXTGAATTCTGTGAAAGCCTGTAGCTATAA>name:EI1W3PE02ILQXAAAAAATGTTGAGCCATAAATACCATCA>name:EI1W3PE02ILQXBCTTTGAAGTATTCTGAGACTTGTAGGA>name:EI1W3PE02ILQXCAGGTGAAGTAAATATCTAATATAATTG>name:EI1W3PE02ILQXDGATTGTATGTTTTTGATTATTTTTTGTTA>name:EI1W3PE02ILQXEGGCTGTGATGGGCTCAAGTAATTGAAA>name:EI1W3PE02ILQXFAATACAGATGGATTCAGGAGAGGTAC>name:EI1W3PE02ILQXGTTCCAGGGGGTCAAGGGGAGAAATAC>name:EI1W3PE02ILQXHCTCCTAATTCTGGAGTAGGGGCTAGGC

Allele Specific Expression Levels

CBA

>chrXGAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCACTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGGATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAA

ABC AC Allele Specific Read Mapping

CBA

CBA

CA

CA

Prental Genome Sequences

Page 25: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

Strain/HybridNumber of read pairs

Number of mapped read pairs

Percentage of mapped Pairs

C57BL 57,187,342 21,756,070 38.044 BALBc 62,465,347 28,358,653 45.399 A/J 46,993,887 22,449,227 47.771 CAST 54,569,423 22,307,194 40.879 SPRET 57,411,555 19,016,949 33.124 C57BLxBALBc 114,374,684 47,682,108 41.689 C57BLxAJ 93,987,774 35,353,398 37.615 C57BLxCAST 109,138,846 43,134,951 39.523 C57BLxSPRET 114,374,684 40,780,806 35.655

Whole Brain RNA-Seq Data - Sanger Institute Mouse Genomes Project

Strain SNPs Private SNPsC57BL 9,844 1,488 BALBc 3,920,925 29,973 A/J 4,198,324 44,837 CAST 17,673,726 5,368,019 SPRET 35,441,735 23,455,525

Page 26: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

Correlation between FPKM values, for each strain, inferred from the separate strain RNA-Seq Read vs. the pooled read of the two strains (synthetic hybrid)

Hybrid C57BL IE Strain IE C57BL GE Strain GE

C57BLxStrain Pearson Pearson Pearson Pearson

C57BLxSPRET 0.952 0.726 0.951 0.725

C57BLxBALBc 0.705 0.675 0.706 0.675

C57BLxAJ 0.855 0.902 0.856 0.903

C57BLxCAST 0.872 0.824 0.924 0.882

C57BLxSPRET 0.952 0.726 0.951 0.725

Page 27: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

Allele Specific Isoform Expression for Synthetic Hybrid C57BLxAJ

Correlation between FPKM values, for each strain, inferred from the separate strain RNA-Seq Read vs. the pooled read of the two strains (synthetic hybrid)

R2 = 0.73 R2 = 0.81

Page 28: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

Allele Specific Isoform Expression for Synthetic Hybrid C57BLxCAST

Correlation between FPKM values, for each strain, inferred from the separate strain RNA-Seq Read vs. the pooled read of the two strains (synthetic hybrid)

R2 = 0.76 R2 = 0.68

Page 29: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

Allele Specific Expression on Drosophila RNA-Seq data from [McManus et al. 10]

1 100

1

100R² = 0.892234244861626

D.Mel.

D.M

el. I

n Pa

rent

al P

ool

1 100

0.000000001

0.0000001

0.00001

0.001

0.1

10R² = 0.933304143243501

D.Sec.

D.Se

c.in

Pare

ntal

Poo

l

Page 30: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

Allele Specific Expression for Mouse RNA-Seq Data from [Gregg et al. 2010]

Page 31: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

Conclusion

• Proposed novel RNA-Seq analysis pipeline– Reconstructs diploid transcriptome– Not affected by mapping bias towards reference

allele– Estimation of allele specific expression levels of

isoforms– Robust estimation based on all reads

Page 32: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

What’s Next?• Test whole pipeline• Use read coverage information SNVs along

with max cut sizes in RefHap to phase isolated SNPs

• Incorporate flowgram data, when available, in SNV detection

• Deploy on Galaxy• Develop ASIE plugin for ION Torrent

Page 33: Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq  Data

Acknowledgments• Ion Mandoiu (Uconn)• Jorge Duitama (KU Leuven)• Marius Nicolae (Uconn)

• Alex Zelikovsky (GSU) • Serghei Mangul (GSU)• Adrian Caciula (GSU)• Dumitru Brinza (Life Tech)• Pramod Srivastava (UCHC)