prediction of > 3000 novel human micrornas … martin reczko ics/imbb bioinformatics program...

34
Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science – FORTH

Upload: beverly-goodwin

Post on 17-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

Prediction of > 3000 novel human microRNAs …

Martin ReczkoICS/IMBB Bioinformatics Program

Biomedical Informatics LabInstitute for Computer Science – FORTH

Page 2: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

ID #miRNAs name-------------------------------------------aga 42 A. gambiae (MOZ2)ame 26 A. mellifera (AMEL2.0)ath 117 A. thaliana (RefSeq entries)cbr 82 C. briggsae (cb25.agp8)cel 115 C. elegans (WormBase WS140)cfa 6 C. familiaris (BROADD1)dme 78 D. melanogaster (BDGP4)dps 73 D. pseudoobscura (DPSE2.0)dre 293 D. rerio (WTSI Zv5)fru 130 F. rubripes (FUGU2.0)gga 122 G. gallus (WASHUC1)hsa 325 H. sapiens (NBCI35)mmu 255 M. musculus (NCBIM34)osa 123 O. sativa (TIGR 3.0)ptr 67 P. troglodytes (CHIMP1)rno 189 R. norvegicus (RGSC3.4)tni 131 T. nigroviridis (TETRAODON7)zma 95 Z. mays (TIGR AZM4)ebv 5 Epstein Barr virus (EMBL:V01555.1)hcmv 8 Human cytomegalovirus (Refseq:NC_001347.2)kshv 11 Kaposi sarcoma associated herpesvirus (EMBL:U75698.1)mghv 9 Mouse gammaherpesvirus 68 (EMBL:U97553.1)

microrna.sanger.ac.uk

Rfam/miRBase 7.1 (October 2005)

used 227 from miRBase 6.0

Page 3: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

~ 9 MBases http://www.ensembl.org/BioMart/

Negative examples: 3’UTR s

Page 4: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

Conservation: MultiZ alignments

11111111111111111111111111111111111111110111111111111111111101111111111111111110111111111111111111111111 011111011111111111111111111111111111111010111111111111111111111111111110111110110111111111111111111111111 111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 211111101111111111111111111111111111111111101101011111111111111111111111111111110111111101111111111111111 311101001101101111111111111111111111110011111011011111111011011111111001111011100111111101111111111111111 411100001101101111111111111111111111110010100001011111111011001111111000111010000111111101111111111111111 5

Conservation rules: # 1’s above >= 120 , at least one stretch of 12 1’s

Page 5: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

Genome wide prediction pipeline

Process windows of 104 nt along genome:

1. Fast filtering using composition and palindromes

2. Comparative analysis with other genomes (BLASTZ)

3. Approximate secondary structure prediction (stem-loop) using a novel dynamic programming algorithm.

4. Feature extraction and classification (SVMs)

5. Filter conserved secondary structures

Page 6: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

-No window containing unknown base-No windows with complete repeat-regions gain 40% reduction in analyzed size,

100% - > 98.4 % sensitivity(lost: hsa-mir-151 hsa-mir-370 hsa-mir-422a hsa-mir-513-1 hsa-mir-513-2)

- Single nt composition, both strands: max A 43% min 9% max C 38% min 10.6% max G 45% min 11% max T 40% min 9.3%

- Single nt composition, single strands: max A 37.5% min 9% max C 38% min 10.6% max G 43.8% min 12.5% max T 40% min 12.7%

’Fast’ rules:

Page 7: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

-Double nt composition, single strands:max AA 15.4% min 0%max AC 10.7% min 0%max AG 14.2% min 1%max AT 16.1% min 0%max CA 14.7% min 0%max CC 18.3% min 0%max CG 15.8% min 0%max CT 16.4% min 1.3%max GA 11.9% min 0%max GC 17.6% min 0%max GG 19.3% min 1%max GT 13.4% min 1.4%max TA 15.7% min 0%max TC 15.6% min 1.1%max TG 18.8% min 2.9%max TT 25.8% min 0%

More ’fast’ rules:

Page 8: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

>= 4nt palindrome rule:

Hash-table with 4^4=256 entries:

Hash-key occured at position rev.comp

---------------------------------------000 AAAA 3 255001 AAAC 0 254002 AAAG 0 253003 AAAU 4 252004 AACA 0 251 005 ...

...254 UUUG 0 001255 UUUU 60 000

Page 9: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

microRNA computational prediction pipeline

2 851 352 871 bases

Inverted repeats, composition

RNA secondary structureprediction

Energy + structural features

Cross-species conservation

SVM

SS-conservation

Novel microRNAs: Microarray verification

Page 10: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

1. Stem_Length 2. GC_Content 3. Stem_BPs 4. maxLinHelix 5. MatureCons 6. MatureOppositeCons 7. ArmCons 8. SS_Energy 9. MatureBPs 10. MatureEnergyProfile

Prediction featurespredicted seconddary structure

comparative analysis

=> 10 features for SVM classification

Page 11: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

Histogram for feature: stem length

Page 12: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

Histogram for feature: GC content

Page 13: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

Histogram for feature: #base pairs in stem

Page 14: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

Feature: longest ‘linear’ helix

maxlinhelix = 18 nt

maxlinhelix = 26 nt

Page 15: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

Histogram for feature: longest ‘linear’ helix

Page 16: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

Features related to mature region

window of 23 nt

Sliding 0 to 15 nt from loop

Calculate ‘mature’ feature at all positions and keepprediction with highest score

Page 17: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

Histogram for feature: #conserved bases in mature region

Page 18: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

Histogram for feature: #conserved bases in mature region(on opposite strand)

Page 19: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

Histogram for feature: #conserved bases in both arms of the stem

Page 20: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

Histogram for feature: secondary structure minimal free energy

Page 21: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

Histogram for feature: #paired bases in mature region

Page 22: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

Mature region: average stacking energy

Page 23: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

Histogram for feature: correlation with averagemature energy profile in mature region

Page 24: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

Learning with Support Vector Machines

Training data Test data

‘Soft-margin’hyperplanes,

cost parameter C

Page 25: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

Training with libsvm-2.6 package by C.-C. Chang & C.-J. Lin

http://www.csie.ntu.edu.tw/~cjlin/libsvm/

Modification:optimize Mathewscorrelation,not % correct

Page 26: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

All features:Cross Validation Accuracy = 87.2728%

Feature ‘knockout’:Cross Validation Accuracy = 75.4618% ss-energy ***Cross Validation Accuracy = 84.6784% stem-start Cross Validation Accuracy = 84.409% stem-end Cross Validation Accuracy = 85.2758% loop-length Cross Validation Accuracy = 82.3163% loop-start Cross Validation Accuracy = 82.3909% # base-pairsCross Validation Accuracy = 76.4124% GC-content **Cross Validation Accuracy = 86.3902% higher arm conservationCross Validation Accuracy = 84.97% lower arm conservationCross Validation Accuracy = 85.0393% loop conservationCross Validation Accuracy = 84.0942% # GU pairsCross Validation Accuracy = 85.4047% length of longest bulge

Importance of features with ‘knockout’ retraining:

Page 27: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

Q SENS SPEC CORR cp cn fp fn threshold---------------------------------------------------------------------99.60 96.74 28.16 +0.5208 89 56497 227 3 0.01000099.76 95.65 39.82 +0.6163 88 56591 133 4 0.02000099.83 95.65 48.09 +0.6776 88 56629 95 4 0.03000099.86 95.65 54.32 +0.7203 88 56650 74 4 0.04000099.87 95.65 55.00 +0.7248 88 56652 72 4 0.05000099.92 95.65 67.18 +0.8012 88 56681 43 4 0.10000099.94 95.65 75.21 +0.8479 88 56695 29 4 0.15000099.95 95.65 78.57 +0.8667 88 56700 24 4 0.20000099.96 95.65 82.24 +0.8868 88 56705 19 4 0.25000099.96 95.65 83.02 +0.8909 88 56706 18 4 0.300000 ***99.96 94.57 85.29 +0.8979 87 56709 15 5 0.35000099.97 94.57 86.14 +0.9024 87 56710 14 5 0.40000099.97 92.39 87.63 +0.8996 85 56712 12 7 0.45000099.97 91.30 90.32 +0.9080 84 56715 9 8 0.50000099.97 88.04 91.01 +0.8950 81 56716 8 11 0.55000099.96 85.87 90.80 +0.8828 79 56716 8 13 0.60000099.96 85.87 91.86 +0.8880 79 56717 7 13 0.65000099.97 85.87 94.05 +0.8985 79 56719 5 13 0.70000099.96 82.61 93.83 +0.8802 76 56719 5 16 0.75000099.96 80.43 96.10 +0.8790 74 56721 3 18 0.80000099.96 80.43 96.10 +0.8790 74 56721 3 18 0.84999999.96 77.17 97.26 +0.8662 71 56722 2 21 0.899999

Test-set results for various SVM thresholds

Page 28: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

< 3 weeks on ~40 AMD-242-Opterons (ICS-FORTH)

Page 29: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

precursor #candidatessensitivity (incl. known miRNAs) hit-rate---------------------------------------------- 95.1% 96699 16 ppm 90.3% 45231 7.6 ppm 85.9% 23025 3.9 ppm 80.6% 14429 2.4 ppm 75.7% 9732 1.6 ppm 70.9% 6912 1.2 ppm---------------------------------------Total nt processed: 5976557831

Hg17-scan results for various SVM thresholds

Page 30: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

Secondary structure conservation:

From RNAfold-library:

structure – stucture comparison:

Null, H, B, I, M, S, E -------------------------------------{ 0, 2, 2, 2, 2, 1, 1} Null { 2, 0, 2, 2, 2, INF, INF} H { 2, 2, 0, 1, 2, INF, INF} B { 2, 2, 1, 0, 2, INF, INF} I { 2, 2, 2, 2, 0, INF, INF} M { 1, INF, INF, INF, INF, 0, INF} S { 1, INF, INF, INF, INF, INF, 0} E

'H' hairpin loop 'I' interior loop 'B' bulge 'M' multi-loop 'S' stack 'E' external elements

Page 31: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

Secondary structure conservationvs. SVM scores

Page 32: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

Q:099.96 SENS:085.87 SPEC:091.86 CORR:+0.8880 cp 79 cn 56717 fp 7 fn 13 th 0.67

spec=cp/(cp+fp)=cp/nhits => (expected cp)=spec*nhits=0.9168*7664=7026

- 2 probes with 60 nt for each candidate- end of 5' probes reach 75% into the hairpin-loop - 3' probes start after 50% of the hairpin-loop

- sensitivity detecting mature miRNA: 86 %- Chip in preparation at UoToronto

Probe-design for experimental verification (RNA-RNA chip):

Estimate for the number of true miRNAs:

All predictions are avaliable !

Page 33: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

Just the tip of an iceberg

-tiling window expression analysis of mouse:

30 % of the genome is transcribed !

- mRNA genes are 3% of the truth….

Page 34: Prediction of > 3000 novel human microRNAs … Martin Reczko ICS/IMBB Bioinformatics Program Biomedical Informatics Lab Institute for Computer Science –

Acknowledgments:

Artemis Hatzigeorgiou,Praveen Sethupathy, Molly Megraw, Karol SzafranskiCenter for Bioinformatics, School of Medicine, University of Pennsylvania

Yannis TollisPanayiota PoïraziAnastasis OulasAlkiviadis Simeonidis

Angelos Bilas, Michalis FlourisAdvanced Computing Systems,Computer Architecture and VLSI Systems Lab, ICS-FORTH