1 justin choo, tin wee tan, shoba ranganathan a comprehensive assessment of n-terminal signal...

Download 1 Justin Choo, Tin Wee Tan, Shoba Ranganathan A comprehensive assessment of N-terminal signal peptide prediction methods

If you can't read please download the document

Upload: florence-horton

Post on 18-Jan-2018

215 views

Category:

Documents


0 download

DESCRIPTION

3 Features of SP: tri-partite regions Length of 13 to 36aa (Molhoj and Degan, 2004) P3 P1

TRANSCRIPT

1 Justin Choo, Tin Wee Tan, Shoba Ranganathan A comprehensive assessment of N-terminal signal peptide prediction methods 2 Targeting of secretory proteins Secretory proteins reportedly represent 30% of the proteome of an organism (Skach, 2007) with functionally diverse classes of molecules such as cytokines, chemokines, hormones, digestive enzymes, antibodies, extracellular proteinases, morphogens, toxins Short peptide called signal peptides (SPs), virtually controls majority of these proteins to the secretory pathway (Gierasch, 1989; Rapoport, 1992) function as address labels / postal codes 3 Features of SP: tri-partite regions Length of 13 to 36aa (Molhoj and Degan, 2004) P3 P1 4 Varying length distribution (Choo and Ranganathan, 2008) 5 More than just targeting In vitro evidence of free SPs inhibit protein translocation (Chen et al., 1987; Simon et al., 1992) Prevent premature or misfolding of secretory preproteins (Weiss and Bassford, 1990; Li et al., 1996) Affect translocation efficiency (Thornton et al., 2006) and modulate secretion Serve as ligand for opening translocation channel (Rutkowski et al., 2001) Influence regulation of proteins to destination (Kurys et al., 2000) 6 More than just targeting (2) Associated with risk for autoimmune diseases due to inefficient processing of autoimmunity (Anjos et al., 2002) Post-targeting functions : immune surveilance of healthy cells (Lemberg et al., 2001) Signaling function fragments found bound to MHC complexes on cell surface (OCallaghan et al., 1998) Cytosolic calmodulin (Martoglio et al., 1997) Mutation or minor alterations implicated in a host of diseases and complications e.g. neurohypophyseal diabetes insipidus (Rittig et al., 2002), classic Ehlers-Danlos syndrome (connective tissue disease) (Symoens et al., 2008) 7 Challenge Cleaved off by type I signal peptidase (SPase I) 8 (Choo and Ranganathan, 2008) P1 P3 9 Existing methods Philius (Reynolds et al., 2008) Bayesian networks Phobius (Kll et al., 2004) HMM PrediSi (Hiller et al., 2004) Position weight matrix (PWM) RPSP (Plewczynski et al., 2008) ANN SigCleave (Rice et al., 2000) PWM SigHMM (Zhang and Wood, 2003) Profile HMM SignalP HMM (Nielsen and Krogh, 1998) ANN (Nielsen et al., 1997; Bendtsen et al., 2004) 10 Existing methods (2) Signal-BLAST (Frank and Sippl, 2008) Pairwise alignment using BLAST Signal-CF (Chou and Shen, 2007), Signal-3L (Shen and Chou, 2007) KNN+subsite coupling SIG-Pred (Bradford, 2001) PWM SOSUIsignal (Gomi et al., 2004) Indices SPEPlip (Fariselli et al., 2003) ANN+PROSITE pattern SPOCTOPUS (Viklund et al., 2008) ANN+HMM 11 Objective Benchmark 13 most popular, recent methods to provide a comparable results Test with 3 datasets involving thousands of sequences 12 Omission popular datasets (Menne et al. 2000; Nielsen et al., 1997) excluded - > derived from earlier Swiss-Prot (Rel & Rel. 38.0) -> ours include them neural network-based approaches (Jagla and Schuchhardt, 2000; Reczko et al., 2002) SVMs-based approaches (Mukherjee and Mukherjee, 2002; Vert, 2002; Cai et al., 2003; Sun and Wang, 2008) profile HMM-based method CJ-SPHMM (Chen et al., 2003) matrix-based + information theory (Liu et al., 2005) a BLOMAP-encoding scheme (Maetschke et al., 2005) hybrid:bio-basis func NNs + decision trees (Sidhu and Yang, 2006) global alignment tool (Liu et al., 2007) subcellular localizations (e.g. iPSORT, ProteinProwler) N-terminus targeting signals (e.g. Predotar), that predict the presence of SPs but dont indicate cleavage sites Specialized tools e.g. SecretomeP which predict non-classical SPs i.e. signal sequences that remain uncleaved and TargetP, since it uses SignalP for SP prediction. SPEPlip -> unavailable 13 General filtering criteria a) Annotation hinting of uncertainty or experimentally unverified (e.g. probable , missing , by similarity , inferred , potential , putative and conflict ) b)Lipoprotein cleaved by SPase II ( PROKAR_LIPOPROTEIN under the DR field) c)Fragment sequence d)Organellar protein (under OG field) e)Mollicutes, a division of bacteria that lack cell wall (under OC field) f)Bacteria without any classification (e.g. [Swiss-Prot: SAT_RIFPS]) g) Sequences with ambiguous characters or non-standard amino acid code (e.g. X , Z , U etc.) (e.g. [Swiss-Prot:KV3A6_MOUSE]) h) Duplicates, redundancy reduction 14 Benchmark datasets +ve -ve 15 Evaluation 16 Aggregated results from all experiments 1 st SignalP most accurate; ANN slightly > HMM 2 nd Rapid Prediction of Signal Peptides (RPSP) 17 Experiment # % % >80% acc Euk 18 #2 Euk GN GP 4,704 sequences 19 #3 Euk GN GP 456 sequences 20 Discussion Non-linear feature may be involved in the recognition of cleavage site (Ladunga, 2000) better accuracy by ML techniques ? Alignment-based approaches (e.g. Signal-BLAST and SigHMM highly dependent on balance between sensitivity & specificity not suitable for detecting seqs sharing weak homology Most tools results of Euk >> Bac datasets Larger set -> better model 21 Discussion (2) Most tools easily distinguish sec vs non-sec proteins; studies (Nielsen et al., 1998) involving discrimination between signal anchors and SPs lead to similar conclusions Report on > 1/3 of the putatively assigned cleavage sites was observed to be inaccurate (Zhang and Henzel, 2004) SignalP leads for all 3 organism groups across 3 experiments consistency for both ANN and HMM versions more complex models and robustness of its method various specific scoring schemes to tackle different aspects (including SP-likeness, the probability of a segment containing the cleavage site and so on) seq. window relatively wider (Euk:[-11,+2]; Gneg:[-21,+2], Gpos:[-15,+2]) 22 Discussion (3) The majority of the tools clearly require active learning or regular update to their underlying models to reflect the latest data distribution Canonical Ala-X-Ala motif (von Heijne, 1986) - the essence for the postulation of the (-3,-1) rule The rule states: P1 must be small residues (Ala, Ser, Gly, Cys, Thr or Gln) but prohibits aromatic (Phe, His, Tyr, Trp), charged (Asp, Glu, Lys, Arg) or large polar (Asn, Gln) at P3. Further, Pro must be absent from P3 to P1 Gram+ (61.9%), Gram- (77.5%) observed in our data P3 and P1 have been known to be critical recognition sites for SPases I (Karla et al., 2005) 23 Conclusion 23 Alternative approach Two-step prediction vs One Larger window frame More data needed to evaluate further 24 Benchmark datasets Dataset #1: +ve: 270 secreted recombinant human proteins taken from (http://share.gene.com/cleavagesite/index.html)http://share.gene.com/cleavagesite/index.html -ve: Original study omit specificity test; 270 human non- secretory proteins from (Zhang & Henzel, 2004) -> SigHMM; Dataset #2: +ve: 2349 SPdb5.1 (Choo et al., 2005) - filtered from Swiss-Prot 55.0 and used by majority prediction methods construction -ve: a mix of cytoplasmic and nuclear (Euk only) proteins Dataset #3: +ve: Swiss-Prot 57.0 excludes entries existed in #1, #2 50% of Euk instances and > 90% of Bac sequences putative SPs with high probability 25 Detailed results 25