inference of allele specific isoform expression (asie) levels from rna- seq data
DESCRIPTION
Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA- Seq Data. Sahar Al Seesi and Ion M ă ndoiu Computer Science and Engineering CANGS 2012. Outline. Problem definition Challenges and limitations of current approaches ASIE pipeline SNVQ RefHap Diploid IsoEM - PowerPoint PPT PresentationTRANSCRIPT
Inference of Allele Specific Isoform Expression (ASIE) Levels from RNA-Seq Data
Sahar Al Seesi and Ion MăndoiuComputer Science and Engineering
CANGS 2012
Outline
• Problem definition• Challenges and limitations of current
approaches• ASIE pipeline– SNVQ– RefHap– Diploid IsoEM
• Results
Gene/Isoform Expression Estimation
Make cDNA & shatter into fragments
Sequence fragment ends
A B C D E
Map reads
Gene Expression (GE) Isoform Expression (IE)A B C
A C
D E
Allele Specific Gene/Isoform Expression Estimation
Make cDNA & shatter into fragments
A B C D E
Map reads
Allele Specific Gene Expression (GE) Allele Specific Isoform Expression (IE)
A B C D E
Sequence fragment ends
H0 H1
H0 H1
H0
H1
H0H1 H0
H1H1
H0H1
Challenges and limitations of current approaches
• Need for diploid transcriptome• Existing studies rely on simple alleles coverage
analysis for heterozygous SNP sites– Not isoform specific– Read mapping bias towards the reference allele– Use less information less robust estimates
Pipeline for ASIE from RNA-Seq Reads
Pipeline for ASIE from RNA-Seq Reads
Hybrid Approach Based on Merging Alignments
mRNA reads
Transcript Library
Mapping
Genome Mapping
Read Merging
Transcript mapped reads
Genome mapped reads
Mapped reads
Merging Rules for Short ReadsGenome Transcripts Agree? Hard Merge
Unique Unique Yes Keep
Unique Unique No Throw
Unique Multiple No Throw
Unique Not Mapped No Keep
Multiple Unique No Throw
Multiple Multiple No Throw
Multiple Not Mapped No Throw
Not mapped Unique No Keep
Not mapped Multiple No Throw
Not mapped Not Mapped Yes Throw
Merging Local Alignments of ION Reads: HardMerge at Base-Level
• Input: SAM files with alignments from genome and transcriptome mapping
• The following alignments are filtered out– Any local alignments of length <= 15 bases– All alignments of read that has alignments on different chromosomes or different
strands
• Key idea: a read base mapped to multiple locations is discarded
• Output alignments are generated from contiguous stretches of non-ambiguously mapped bases, based on the unique genomic location of these bases– Subject to the above filtering criteria
HardMerge Example
Input alignments in genome coordinates:
Filter multiple local alignments/sub-alignments
Output alignment:
SNV Detection and Genotyping
• A reliable hybrid mapping strategy• Bayesian model for SNV detection based on
quality scores
J. Duitama and P.K. Srivastava and I.I. Mandoiu, Towards Accurate Detection and Genotyping of Expressed Variants from Whole Transcriptome Sequencing Data, BMC Genomics 13(Suppl 2):S6, 2012
SNVQ Model• Calculate conditional probabilities by multiplying contributions of
individual reads
Accuracy per Coverage Bins
Pipeline for ASIE from RNA-Seq Reads
ReFHap
• Problem Formulation– Alleles for each locus are encoded with 0 and 1– Fragment: Aligned read showing coocurrance of two or
more alleles in the same chromosome copy
Locus 1 2 3 4 5 6 7 8 9 ...
f - 0 1 1 - 1 - 0 0 ...
J. Duitama and T. Huebsch and G. McEwen and E. Suk and M.R. Hoehe, ReFHap: A Reliable and Fast Algorithm for Single Individual Haplotyping, Proc. 1st ACM Intl. Conf. on Bioinformatics and Computational Biology, pp. 160-169, 2010
Problem Formulation
• Input: Matrix M of m fragments covering n loci
Locus 1 2 3 4 5 ... n
f1 1 1 0 - 1 -
f2 - 0 1 0 0 1
f3 - 0 0 0 1 -
...
fm - - - - 1 0
ReFHap vs HapCUT
Pipeline for ASIE from RNA-Seq Reads
IsoEM: Isoform Expression Level Estimation
• Expectation-Maximization algorithm• Unified probabilistic model incorporating– Single and/or paired reads– Fragment length distribution– Strand information– Base quality scores– Repeat and hexamer bias correction
Read-isoform compatibilityirw ,
a
aaair FQOw ,
Fragment length distribution
• Paired reads
A B C
A C
A B C
A CA C
A B Ci
j
Series1
Fa(i)
Series1
Fa (j)
IsoEM vs. Cufflinks 1.0.3 on ION reads
IsoEM HBR Cufflinks HBR IsoEM UHR Cufflinks UHR0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
R2 fo
r Iso
EM/C
cuffl
inks
Esti
mat
es v
s qPC
R
Simplified Pipeline for ASIE in F1 Hybrids
Generate Isoform
Sequences
Align to Diploid
Transcriptome
IsoEM
Reference Transcriptome
Diploid Transcriptome
>chrXGAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCACTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGGATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAA
CBA
CBA
CA
CA
AAAAATGTTGAGCCATAAATACCATCACTTTGAAGTATTC
AAAAATGTTGAGCCATAAATACCATCACTTTGAAGTATTC
AAAAATGTTGAGCCTTTGAAGTATTC
AAAAATGTTGAGCCTTTGAAGTATTC
Short Reads>name:EI1W3PE02ILQXTGAATTCTGTGAAAGCCTGTAGCTATAA>name:EI1W3PE02ILQXAAAAAATGTTGAGCCATAAATACCATCA>name:EI1W3PE02ILQXBCTTTGAAGTATTCTGAGACTTGTAGGA>name:EI1W3PE02ILQXCAGGTGAAGTAAATATCTAATATAATTG>name:EI1W3PE02ILQXDGATTGTATGTTTTTGATTATTTTTTGTTA>name:EI1W3PE02ILQXEGGCTGTGATGGGCTCAAGTAATTGAAA>name:EI1W3PE02ILQXFAATACAGATGGATTCAGGAGAGGTAC>name:EI1W3PE02ILQXGTTCCAGGGGGTCAAGGGGAGAAATAC>name:EI1W3PE02ILQXHCTCCTAATTCTGGAGTAGGGGCTAGGC
Allele Specific Expression Levels
CBA
>chrXGAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCACTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGGATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAA
ABC AC Allele Specific Read Mapping
CBA
CBA
CA
CA
Prental Genome Sequences
Strain/HybridNumber of read pairs
Number of mapped read pairs
Percentage of mapped Pairs
C57BL 57,187,342 21,756,070 38.044 BALBc 62,465,347 28,358,653 45.399 A/J 46,993,887 22,449,227 47.771 CAST 54,569,423 22,307,194 40.879 SPRET 57,411,555 19,016,949 33.124 C57BLxBALBc 114,374,684 47,682,108 41.689 C57BLxAJ 93,987,774 35,353,398 37.615 C57BLxCAST 109,138,846 43,134,951 39.523 C57BLxSPRET 114,374,684 40,780,806 35.655
Whole Brain RNA-Seq Data - Sanger Institute Mouse Genomes Project
Strain SNPs Private SNPsC57BL 9,844 1,488 BALBc 3,920,925 29,973 A/J 4,198,324 44,837 CAST 17,673,726 5,368,019 SPRET 35,441,735 23,455,525
Correlation between FPKM values, for each strain, inferred from the separate strain RNA-Seq Read vs. the pooled read of the two strains (synthetic hybrid)
Hybrid C57BL IE Strain IE C57BL GE Strain GE
C57BLxStrain Pearson Pearson Pearson Pearson
C57BLxSPRET 0.952 0.726 0.951 0.725
C57BLxBALBc 0.705 0.675 0.706 0.675
C57BLxAJ 0.855 0.902 0.856 0.903
C57BLxCAST 0.872 0.824 0.924 0.882
C57BLxSPRET 0.952 0.726 0.951 0.725
Allele Specific Isoform Expression for Synthetic Hybrid C57BLxAJ
Correlation between FPKM values, for each strain, inferred from the separate strain RNA-Seq Read vs. the pooled read of the two strains (synthetic hybrid)
R2 = 0.73 R2 = 0.81
Allele Specific Isoform Expression for Synthetic Hybrid C57BLxCAST
Correlation between FPKM values, for each strain, inferred from the separate strain RNA-Seq Read vs. the pooled read of the two strains (synthetic hybrid)
R2 = 0.76 R2 = 0.68
Allele Specific Expression on Drosophila RNA-Seq data from [McManus et al. 10]
1 100
1
100R² = 0.892234244861626
D.Mel.
D.M
el. I
n Pa
rent
al P
ool
1 100
0.000000001
0.0000001
0.00001
0.001
0.1
10R² = 0.933304143243501
D.Sec.
D.Se
c.in
Pare
ntal
Poo
l
Allele Specific Expression for Mouse RNA-Seq Data from [Gregg et al. 2010]
Conclusion
• Proposed novel RNA-Seq analysis pipeline– Reconstructs diploid transcriptome– Not affected by mapping bias towards reference
allele– Estimation of allele specific expression levels of
isoforms– Robust estimation based on all reads
What’s Next?• Test whole pipeline• Use read coverage information SNVs along
with max cut sizes in RefHap to phase isolated SNPs
• Incorporate flowgram data, when available, in SNV detection
• Deploy on Galaxy• Develop ASIE plugin for ION Torrent
Acknowledgments• Ion Mandoiu (Uconn)• Jorge Duitama (KU Leuven)• Marius Nicolae (Uconn)
• Alex Zelikovsky (GSU) • Serghei Mangul (GSU)• Adrian Caciula (GSU)• Dumitru Brinza (Life Tech)• Pramod Srivastava (UCHC)