prints a protein family database with a difference terri attwood faculty of life sciences &...
TRANSCRIPT
PRINTSPRINTSA protein family database with a differenceA protein family database with a difference
Terri AttwoodFaculty of Life Sciences & School of Computer
ScienceUniversity of Manchester, Oxford Road
Manchester M13 9PT, UKhttp://www.bioinf.manchester.ac.uk/dbbrowser/
Understanding the differenceUnderstanding the difference
PrefacePreface
04/21/23 2
• Pattern-recognition tools come in different shapes & sizes– the databases they underpin consequently differ
[GSTALIVMFYWC]-[GSTANCPDE]-{EDPKRH}-x-{PQ}-[LIVMNQGA]-{RK}-{RK}-[LIVMFT]- [GSTANC]-[LIVMFYWSTAC]-[DENH]-R-[FYWCSH]-{PE}-x-[LIVM]
• The challenge is to understand the consequences of those differences• But it isn’t just the underlying methods that differ
– the database search tools differ– & the results of using those tools differ
– the annotation philosophies of the databases also differ
• How, then, do we know if the family we see in different databases is the same or different?
– how do we know if the differences are meaningful?– the smallest things can be highly significant
04/21/23 3
OverviewOverview
• Setting the scene– health warnings
• Methods of family analysis – where fingerprints fit in
• Some examples– & cautionary tales
• Conclusions• Epilogue
04/21/23 4
Health warning 1Health warning 1Remember the biologyRemember the biology
• Proteins exhibit rich evolutionary relationships & complex molecular interactions – so they present significant challenges for in silico analysis
• Problems arise when we lose sight of the underlying biology
04/21/23 5
We are using biology-unaware search tools to analyse such complex systems…
In trying to understand molecular function, we must be realistic about what we can achieve using such naïve approaches…
04/21/23 6
Health warning 2Health warning 2Remember the limitations of the search methodsRemember the limitations of the search methods
• Pairwise search methods (BLAST, FastA...) & catch-all family-based methods (profiles, HMMs…) ‘see’ generic similarity
• These methods do not see the often-subtle differences that constitute the functional determinants between closely-related families
• But identifying similarity between sequences is not the same as identifying their functions– in trying to derive functional insights, it is therefore imperative to
recognise the limitations of the methods used
• What you see depends on how you look!
04/21/23 7
Aims of family analysisAims of family analysisIdentifying patternsIdentifying patterns
• Given a set of sequences, we usually want to know– what are these proteins; to what family do they belong? – what is the function; how can we explain this in structural terms?
• We try to answer these questions by seeking patterns that will allow us to infer relationships with previously-characterised sequences
• We do this in 3 main ways…
04/21/23 8
Full domain alignment methods
Single motif methods
Multiple motif methods
Regular expressions (PROSITE)
Profiles (Profile Library)
HMMs (Pfam)
Identity matrices (PRINTS)
Methods of family analysisMethods of family analysis
04/21/23 9
Challenge of family analysisChallenge of family analysispatterns of conservation changepatterns of conservation change
• Highly divergent family with single function?• Superfamily with many diverse functional families?
– must distinguish if we are to diagnose function reliably– but this is not always straightforward
04/21/23 10
Where fingerprints fit inWhere fingerprints fit in• Fingerprints are sets of motifs that characterise families
– taken together, the motifs create diagnostic signatures
• Offer improved diagnostic reliability over single motifs by virtue of the biological context provided by motif neighbours
order
interval
04/21/23 11
Single-motif searchSingle-motif search
Need convincing?Need convincing?……itit’’s actually common sense!s actually common sense!
04/21/23 12
Two-motif searchTwo-motif search
04/21/23 13
Three-motif searchThree-motif search
From 406 hits with 1 motif, we converged to 1 hit with 3 motifs – so, adding motifs improves diagnostic reliability
04/21/23 14
N CN C
Visualising fingerprintsVisualising fingerprintsthe significance of motif contextthe significance of motif context
04/21/23 15/55
Creating fingerprintsCreating fingerprints
signature
database
annotation!
UniProt
UniProt
UniProt
PRINTS
04/21/23 16
loop regionTM domain TM domain
Families are hierarchicalFamilies are hierarchical& hence so are fingerprints& hence so are fingerprints
Fingerprints allow us to focus on differences as well as similarities
04/21/23 17
Differences yield functional insightsDifferences yield functional insights
K1 K2 K3
K4 K5PTP1PTP1
K6
PTP2
PTP3
K7 K8
PTP4
PTP5
PTP6
WPD
HCX5R
A
D
B
C
04/21/23 18
Perspectives from InterProPerspectives from InterProhighlighting similarities & differenceshighlighting similarities & differences
Similarities are informative Similarities are informative They give insights into shared high-level functionsThey give insights into shared high-level functions
Differences are informativeDifferences are informativeThey give insights into unique functional specificities They give insights into unique functional specificities
The more differences, the more you learn about the tool’s functional nicheProtein families are just the same!
04/21/23 21
Examples & cautionary talesExamples & cautionary tales
• Similarity searches have been the mainstay of functional annotation efforts– because they allow us to recognise similarities with things
we’ve seen before• & allow us to transfer characteristics of known to unknown
proteins
• Results of in silico searches need to be considered carefully– let’s take a closer look…
04/21/23 22
-opioid receptor -opioid receptor -opioid receptor true
Q23293_CAEEL Putative uncharacterized protein
04/21/23 23
When is a GPCR not an SSR?When is a GPCR not an SSR?
Query length: 389 AA Date run: 2002-10-18 09:08:29 UTC+0100 on sib-blast.unil.chTaxon: Homo sapiensDatabase: XXswissprot
120,412 sequences; 45,523,583 total letters SWISS-PROT Release 40.29 of 10-Oct-2002
Db AC Description Score E-value sp Q9UKP6 Q9UKP6 Orphan receptor [Homo sapiens... 782 0.0sp P31391 SSR4_HUMAN Somatostatin receptor type 4 (SS4R) [SSTR4]... 167 3e-41sp O43603 GALS_HUMAN Galanin receptor type 2 (GAL2-R) (GALR2) [G... 147 4e-35sp P30872 SSR1_HUMAN Somatostatin receptor type 1 (SS1R) (SRIF-2... 144 3e-34sp P32745 SSR3_HUMAN Somatostatin receptor type 3 (SS3R) (SSR-28... 140 3e-33sp P35346 SSR5_HUMAN Somatostatin receptor type 3 (SS5R) (SSTR5)... 140 6e-33sp P30874 SPLICE ISOFORM B of P30874 [SSTR2] [Homo sapiens... 134 3e-31sp P30874 SSR2_HUMAN Somatostatin receptor type 2 (SS2R) (SRIF-1... 134 3e-31sp P48145 GPR7_HUMAN Neuropeptides B/W receptor type 1 (G protei... 133 7e-31sp O60755 GALT_HUMAN Galanin receptor type 3 (GAL3-R) (GALR3) [G... 132 2e-30sp P41143 OPRD_HUMAN Delta-type opioid receptor (DOR-1) [OPRD1] ... 128 2e-29sp P35372 SPLICE ISOFORM 1A of P35372 [OPRM1] [Homo sapien... 125 1e-28sp P35372 OPRM_HUMAN Mu-type opioid receptor (MOR-1) [OPRM1] [Ho... 125 1e-28
04/21/23 24
When is a GPCR not an SSR?When is a GPCR not an SSR?…when it’s a UR2R…when it’s a UR2R
Query length: 389 AA Date run: 2002-10-18 09:08:29 UTC+0100 on sib-blast.unil.chTaxon: Homo sapiensDatabase: XXswissprot
120,412 sequences; 45,523,583 total letters SWISS-PROT Release 40.29 of 10-Oct-2002
Db AC Description Score E-value sp Q9UKP6 UR2R_HUMAN Urotensin II receptor (UR-II-R) [GPR14] [Ho... 782 0.0sp P31391 SSR4_HUMAN Somatostatin receptor type 4 (SS4R) [SSTR4]... 167 3e-41sp O43603 GALS_HUMAN Galanin receptor type 2 (GAL2-R) (GALR2) [G... 147 4e-35sp P30872 SSR1_HUMAN Somatostatin receptor type 1 (SS1R) (SRIF-2... 144 3e-34sp P32745 SSR3_HUMAN Somatostatin receptor type 3 (SS3R) (SSR-28... 140 3e-33sp P35346 SSR5_HUMAN Somatostatin receptor type 3 (SS5R) (SSTR5)... 140 6e-33sp P30874 SPLICE ISOFORM B of P30874 [SSTR2] [Homo sapiens... 134 3e-31sp P30874 SSR2_HUMAN Somatostatin receptor type 2 (SS2R) (SRIF-1... 134 3e-31sp P48145 GPR7_HUMAN Neuropeptides B/W receptor type 1 (G protei... 133 7e-31sp O60755 GALT_HUMAN Galanin receptor type 3 (GAL3-R) (GALR3) [G... 132 2e-30sp P41143 OPRD_HUMAN Delta-type opioid receptor (DOR-1) [OPRD1] ... 128 2e-29sp P35372 SPLICE ISOFORM 1A of P35372 [OPRM1] [Homo sapien... 125 1e-28sp P35372 OPRM_HUMAN Mu-type opioid receptor (MOR-1) [OPRM1] [Ho... 125 1e-28
04/21/23 25
UR2R_HUMAN vs GPCRRHODOPSN
Perspectives from other resourcesPerspectives from other resources
04/21/23 26
04/21/23 27
ID Q6NV75 PRELIMINARY; PRT; 609 AA.AC Q6NV75;DT 05-JUL-2004 (TrEMBLrel. 27, Created)DT 05-JUL-2004 (TrEMBLrel. 27, Last sequence update)DT 05-JUL-2004 (TrEMBLrel. 27, Last annotation update)DE G protein-coupled receptor 153.GN Name=GPR153;OS Homo sapiens (Human).OX NCBI_TaxID=9606 RN [1]RP SEQUENCE FROM N.A.RC TISSUE=Brain;RA Strausberg R.L., Feingold E.A., Grouse L.H., Derge J.G.,RA Jones S.J., Marra M.A.;RT "Generation and initial analysis of more than 15,000 full-lengthRT human and mouse cDNA sequences.";RL Proc. Natl. Acad. Sci. U.S.A. 99:16899-16903(2002).RP SEQUENCE FROM N.A.RC TISSUE=Brain;RA Strausberg R.;RL Submitted (MAR-2004) to the EMBL/GenBank/DDBJ databases.DR EMBL; BC068275; AAH68275.1; -. DR GO; GO:0004872 DR InterPro; IPR000276; GPCR_Rhodpsn.DR Pfam; PF00001; 7tm_1; 1.DR PROSITE; PS50262; G_PROTEIN_RECEP_F1_2; 1.KW ReceptorSQ SEQUENCE 609 AA; 65341 MW; E525CC7F60D0891C CRC64; MSDERRLPGS AVGWLVCGGL SLLANAWGIL SVGAKQKKWK PLEFLLCTLA ATHMLNVAVP IATYSVVQLR RQRPDFEWNE GLCKVFVSTF YTLTLATCFS VTSLSYHRMW MVCWPVNYRL SNAKKQAVHT VMGIWMVSFI LSALPAVGWH DTSERFYTHG CRFIVAEIGL GFGVCFLLLV GGSVAMGVIC TAIALFQTLA VQVGRQADHR AFTVPTIVVE DAQGKRRSSI DGSEPAKTSL QTTGLVTTIV FIYDCLMGFP VLVVSFSSLR ADASAPWMAL CVLWCSVAQA LLLPVFLWAC DRYRADLKAV REKCMALMAN DEESDDETSL EGGISPDLVL ERSLDYGYGG DFVALDRMAK YEISALEGGL PQLYPLRPLQ EDKMQYLQVP PTRRFSHDDA DVWAAVPLPA FLPRWGSGED LAALAHLVLP AGPERRRASL LAFAEDAPPS RARRRSAESL LSLRPSALDS GPRGARDSPP GSPRRRPGPG PRSASASLLP DAFALTAFEC EPQALRRPPG PFPAAPAAPD GADPGEAPTP PSSAQRSPGP RPSAHSHAGS LRPGLSASWG EPGGLRAAGG GGSTSSFLSS PSESSGYATL HSDSLGSAS//
Pfam match Q6NV75/24-297
GPCR?
PROSITE (profile) no match
PROSITE (regex) no match
PRINTS no match
ClustalW – sequences too
divergent to be aligned
false negative
04/21/23 28
Rhodopsin-like superfamily Rhodopsin-like superfamily GPCRs in InterPro 2005 GPCRs in InterPro 2005
IPR000276 GPCR_Rhodopsn 7,752 proteins
PS50262 G_PROTEIN_RECEP_F1_2 7,702 proteins
PF00001 7tm_1 7,064 proteins
PS00237 G_PROTEIN_RECEP_F1_1 6,527 proteins
PR00237 GPCRRHODOPSN 5,821 proteins (don’t include partials)
04/21/23 29
Rhodopsin-like superfamily Rhodopsin-like superfamily GPCRs in the source databases GPCRs in the source databases
Pfam FP ? FN ? U ? TP? 8,776 matches 7,064
PROSITE (profile) FP 3 FN 3 U 12 TP 1,837 matches
7,702
PROSITE (regex) FP 92 FN 261 U 0 TP 1,530 matches 6,527
PRINTS FP 0 FN ? U 0 TP 1,154 matches 5,821
04/21/23 30
Rhodopsin-like superfamily Rhodopsin-like superfamily GPCRs in InterPro 2006 GPCRs in InterPro 2006
IPR000276 GPCR_Rhodopsn 14,206 proteins
PS50262 G_PROTEIN_RECEP_F1_2 14,108 proteins
PF00001 7tm_1 13,148 proteins
PR00237 GPCRRHODOPSN 11,357 proteins
PS00237 G_PROTEIN_RECEP_F1_1 11,109 proteins
04/21/23 31
Rhodopsin-like superfamily Rhodopsin-like superfamily GPCRs in InterPro 2009 GPCRs in InterPro 2009
IPR000276 7TM_GPCR_Rhodopsn 24,039 proteins
PF00001 7tm_1 23,702 proteins 16,975
PR00237 GPCRRHODOPSN 20,158 proteins 6,660 (incl.partials)
PS00237 G_PROTEIN_RECEP_F1_1 15,939 proteins 1,950
PS50262 G_PROTEIN_RECEP_F1_2 ? proteins2,390What does it all mean? How are users supposed to know?
No human curator has time to validate all these matches…
25,248 GPCR rhodopsin-like superfamily
04/21/23 32
The annotation paradoxThe annotation paradox
• Without annotation, data are meaningless• But, there’s too much data for manual annotation to
be practicable− it took ~600 person years, over 23-years, to annotate ~500K
Swiss-Prot entries− but...9 million entries in TrEMBL, 163 million in EMBL?!
• So manual annotation is clearly impossible, but is nevertheless a necessary evil
• Like PROSITE, therefore, fingerprints are manually annotated prior to inclusion in PRINTS– & hence, like PROSITE, the database has remained small– let’s briefly take a closer look…
33
Protein family annotationProtein family annotationa PRINTS viewa PRINTS view
34
Protein family annotationProtein family annotationa PRINTS viewa PRINTS view
Where do we get this information?UniProt:Swiss-ProtPROSITEInterProPubMed/literatureAuto-annotation tools
PRECIS, METIS, BioIE…MINOTAUR
04/21/23 35
Protein family annotationProtein family annotationa PROSITE viewa PROSITE view
04/21/23 36
Protein family annotationProtein family annotationa PROSITE viewa PROSITE view
Protein family annotationProtein family annotationa Pfam viewa Pfam view
04/21/23 38
Protein family annotationProtein family annotationan InterPro viewan InterPro view
Where does this information come from?
04/21/23 39
In an ideal world…In an ideal world…
Automatic nonsense!
04/21/23 56
ConclusionsConclusions
• Similarity searches have been the mainstay of functional annotation efforts– because they reduce a complex problem to a more tractable one
• i.e., identifying & quantifying relationships between sequences
• But identifying similarity between sequences is not the same as identifying their functions
• Failure to appreciate this fundamental point has generated numerous annotation errors in our databases– & in the literature!
04/21/23 57
ConclusionsConclusions• In characterising unknown sequences, it is wise to run
pairwise & family-based searches– top hits aren’t always the most biologically significant– BLAST/FastA/profiles/HMMs offer broad brush strokes– motif-based methods add fine detail
• no method alone is best (they all have limitations)
• different methods give different perspectives
• The differences revealed by these perspectives are often more important than the similarities they uncover– differences may shed light on unique functional determinants
• Never lose sight of the underlying biology!
Rhodopsin - rod cell, achromatic receptorOpsin - green-sensitive cone photoreceptor
Argininosuccinate lyase - amino acid biosynthesisDelta crystallin - non-enzymatic, structural eye-lens protein
Hands OnHands On
• Review the UniProt entry Q6NV75: http://www.uniprot.org/uniprot/Q6NV75
• Submit this to ScanProsite: http://www.expasy.ch/tools/scanprosite
• Submit this to FingerPRINTScan: http://www.bioinf.manchester.ac.uk/cgi-bin/dbbrowser/fingerPRINTScan/muppet/FPScan_fam.cgi
• Submit this to GraphScan: http://www.bioinf.manchester.ac.uk/cgi-bin/dbbrowser/fingerPRINTScan/muppet/GRAPHScan.cgi
• Access Utopia: http://utopia.cs.manchester.ac.uk/
04/21/23 60