anastasia nikolskaya lai-su yeh protein information resource georgetown university medical center
DESCRIPTION
PIR: a comprehensive resource for functional analysis of protein sequences and families . Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center Washington, DC. PIR Web Site. NEW web site, soon to become public http://pir.georgetown.edu - PowerPoint PPT PresentationTRANSCRIPT
Anastasia Nikolskaya Lai-Su Yeh
Protein Information ResourceGeorgetown University Medical CenterWashington, DC
PIR: a comprehensive resource for functional analysis of protein sequences and families
2
PIR Web Site NEW web site, soon to become publichttp://pir.georgetown.edu currently an old version
PIR and UniProt web sites interlinked and cross-navigable
PIR-specific features
Text Search Sequence Search Classification Database Search
3
i
• Integration of protein family, function, structure
• Rich links (executive summary + hypertext links) to > 90 databases
• Value-added reports for 1.96 Million UniProtKB protein entries
i
iProClass Protein Knowledgebase
Disease/Variation
OMIMHapMap
…Ontology
GO
Protein Sequence
UniProtUniRefUniParcRefSeq
GenPept…
Gene/Genome
GenBank/EMBL/DDBJLocusLinkUniGene
MGITIGR
…
Gene Expression
GEOGXD
ArrayExpressCleanExSOURCE
…
Structure
PDBSCOPCATH
PDBSumMMDB
…
Family
PIRSFInterPro
PfamPrositeCOG
…
Interaction
DIPBIND
…
Taxonomy
NCBI TaxonNEWT
Protein Expression
Swiss-2DPAGEPMG
…
Literature
PubMed
Function/Pathway
EC-IUBMBKEGG
BioCartaEcoCyc
WIT…
Modification
RESIDPhosphoBase
…
iProClass
Integrated Protein Knowledgebase
iProClass
Integrated Protein Knowledgebase
http://pir.georgetown.edu/iproclass
4
Example
Want to find info on chorismate mutases,Specifically:Start with Bacillus subtilis P19080 = CHMU_BACSU
Relatedness to other chorismate mutases- Homology- Domain architecture
- Is it related to E.coli P07022 (a well-studied bifunctional enzyme (P-protein), chorismate mutase/prephenate dehydratase)
5
iProClass Sequence Report
6
What can we find about “chorismate mutase”
Protein Analysis: I. Text Search iProClass
7
Text SearchResults (I)
UniProt ID
8
Text SearchResults (II)
Display options: add or remove columns
9
Text Search Results (III)
Find chorismate mutase(s) from B. subtilis
10
Determining Protein HomologyIs B. subtilis CM P19080 homologous to E.coli P-protein P07022? to B. subtilis AroA(G) P39912?Which domains, if any, in multidomain chorismate mutases it corresponds to?What kinds of domain architecture exist in chorismate mutases?
11
Retrieve Proteins by UID in Batch Mode
ID mapping option: can use various non-UniProt IDs
Batch Retrieval
12
Determining Protein Homology:Sequence Search
BLAST FASTA SSearch
13
Blast Search ResultsBLAST query UniProt sequence P19080hits PIRSF005965 family members as best hits
14
Pre-compiled Related Sequences: saves time
15
BLAST/SSEARCH Results
SSEARCH Alignment
BLASTAlignment
16
Determining Protein Homology: Peptide Search
17
Peptide Search Results
18
Protein families reflect evolutionary relationships Function often follows along the family lines Therefore, matching a protein sequence a protein family
provides information about a protein (need a highly curated and annotated family)
Faster and often more accurate than searching against a protein database
Protein classification facilitates sequence and functional analysis of proteins and is used for accurate automatic annotation (PIRSF is used for UniProt annotation)
Family Classification System:One-Stop Platform for Protein Analysis
19
PIRSF Classification System PIRSF: reflects evolutionary relationships of full-length
proteins
Definitions: Basic unit = Homeomorphic Family Homologous: Inferred by sequence similarity Homeomorphic: Full-length sequence similarity and common domain
architecture Hierarchy: Flexible number of levels with varying degrees of sequence
conservation; Network Structure: multiple domain parents
Advantages: Annotation of both generic biochemical and specific biological functions Accurate propagation of annotation and development of standardized
protein nomenclature and ontology
20
PIRSF001499: Bifunctional CM/PDH (T-protein)
PIRSF006786: PDH, feedback inhibition-insensitive
PIRSF005547: PDH, feedback inhibition-sensitive
PF02153: Prephenatedehydrogenase (PDH)
PIRSF017318: CM of AroQ class, eukaryotic type
PIRSF001501: CM of AroQ class, prokaryotic type
PIRSF026640: Periplasmic CM
PIRSF001500: Bifunctional CM/PDT (P-protein)
PIRSF001499: Bifunctional CM/PDH (T-protein)
PF01817: Chorismatemutase (CM)
PIRSF006493: Ku, prokaryotic type
PIRSF500001: IGFBP-1
…
PIRSF500006: IGFBP-6
PIRSF Homeomorphic Subfamily
• 0 or more levels
• Functional specialization
PIRSF018239: IGFBP-related protein, MAC25 type
PIRSF001969: IGFBP
PIRSF003033: Ku70 autoantigen
PIRSF016570: Ku80 autoantigen
PIRSF Homeomorphic Family• Exactly one level
• Full-length sequence similarity and common domain architecture
PIRSF Superfamily
• 0 or more levels
• One or more common domains
PF00219: Insulin-like growth factor binding protein
(IGFBP)
PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain
Domain Superfamily• One common Pfam
domain
PIRSF001499: Bifunctional CM/PDH (T-protein)
PIRSF006786: PDH, feedback inhibition-insensitive
PIRSF005547: PDH, feedback inhibition-sensitive
PF02153: Prephenatedehydrogenase (PDH)
PIRSF017318: CM of AroQ class, eukaryotic type
PIRSF001501: CM of AroQ class, prokaryotic type
PIRSF026640: Periplasmic CM
PIRSF001500: Bifunctional CM/PDT (P-protein)
PIRSF001499: Bifunctional CM/PDH (T-protein)
PF01817: Chorismatemutase (CM)
PIRSF006493: Ku, prokaryotic type
PIRSF500001: IGFBP-1
…
PIRSF500006: IGFBP-6
PIRSF Homeomorphic Subfamily
• 0 or more levels
• Functional specialization
PIRSF018239: IGFBP-related protein, MAC25 type
PIRSF001969: IGFBP
PIRSF003033: Ku70 autoantigen
PIRSF016570: Ku80 autoantigen
PIRSF Homeomorphic Family• Exactly one level
• Full-length sequence similarity and common domain architecture
PIRSF Superfamily
• 0 or more levels
• One or more common domains
PF00219: Insulin-like growth factor binding protein
(IGFBP)
PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain
Domain Superfamily• One common Pfam
domain
PIRSF Classification SystemA protein may be assigned to only one homeomorphic family, which may have zero or more child nodes and zero or more parent nodes. Each homeomorphic family may have as many domain superfamily parents as its members have domains.
21
Unclassified UniProtKB proteins
Uncurated Homeomorphic Clusters
Orphans
Preliminary Homeomorphic Families
Final Families, Subfamilies, Superfamilies
Add/Remove Members
Name, Refs, Abstract, Domain Arch.
Automatic Clustering
Computer-assisted Manual Curation
Automatic Procedure Unassigned Proteins
Au
tom
atic
Pla
ce
me
nt
Hierarchies (Superfamilies/Subfamilies)
Map Domains on Clusters
Merge/Split Clusters
New Proteins
Protein Name Rules/Site Rules Build and Test HMMs
1
2
3
4
5
6
7 8
22
Unclassified UniProtKB proteins
Uncurated Homeomorphic Clusters
Orphans
Preliminary Homeomorphic Families
Final Families, Subfamilies, Superfamilies
Add/remove members
Name, refs, abstract, domain arch.
Automatic clustering
Computer-assisted Manual Curation
Automatic Procedure Unassigned proteins
Au
tom
atic
pla
ceme
nt
Hierarchies (superfamilies/subfamilies)
Map domains on Clusters
Merge/splitclusters
New proteins
Protein Name Rule/Site Rule Build and test HMMs
1
2
3
4
5
6
7 8
Unclassified UniProtKB proteins
Uncurated Homeomorphic Clusters
Orphans
Preliminary Homeomorphic Families
Final Families, Subfamilies, Superfamilies
Add/remove members
Name, refs, abstract, domain arch.
Automatic clustering
Computer-assisted Manual Curation
Automatic Procedure Unassigned proteins
Au
tom
atic
pla
ceme
nt
Hierarchies (superfamilies/subfamilies)
Map domains on Clusters
Merge/splitclusters
New proteins
Protein Name Rule/Site Rule Build and test HMMs
1
2
3
4
5
6
7 8
23
Tool: Curator’s Decision Maker
24
Classification Tool: BlastClust Curator-guided
clustering
Single-linkage clustering using BLAST
Retrieve all proteins sharing a common domain
Iterative BlastClust (fixed length coverage)
25
Family Analysis of Homologous Proteins1. Fully Curated Protein Family:
Especially important when the protein of interest is underannotated or misannotated (happens often!)
Evidence types: Characterized (validated), Predicted (by computational methods) or Uncharacterized
2. Preliminary or Uncurated Family Have to do some analysis OR contact PIR and ask to prioritize this family
3. No Family Classification Have to do some analysis OR contact PIR and ask to prioritize this family
iProClass search PIRSF - blank
26
Underannotated Proteins
Search iProClass with PIRSF005965
Providing more information
27
PIRSF SCAN (sequence search)
UniProt sequence Q8Y5X7 is automatically classified as chorismate mutase of the AroH classPIRSF005965
Returns only matches to fully curated PIRSFs
28
Taxonomic distribution of PIRSF can be used to infer evolutionary history of the proteins in the PIRSF
PIRSF Family Report: Curated Protein Family Information
Phylogenetic tree and alignment view allows further sequence analysis
29
PIRSF Family Report (II)
Integrated value added information from other databases
Mapping to other protein classification databases
30
CM from B.subtilis P19080 does not bring B.subtilis AroA(G) or E. coli P-protein (or related proteins) in BLAST search
Contains a different PFAM domain Identical conserved motifs are not found NOT homologous
PIRSF reports: abstracts contain most of this info PIRSF domain architecture (curated or uncurated): Pfam and
newly defined domains Structure information (PDB links) Hierarchy in DAG (under development)
Chorismate Mutase Results from iProClass Analysis
Use PIRSF family database for the same analysis:
31
PIRSF Text Search
New domain
AroA(G)
32
Chorismate Mutase Convergent Evolution – EC 5.4.99.5 (Non-Orthologous Gene
Displacement) Two Distinct Sequence/Structure Types
AroQ Class: SCOP (all ), core: 6 helices, bundle AroH Class: SCOP (+), core: beta-alpha-beta-alpha-beta(2)
Two Pfam Domains: PF01817, PF07736 (New PFAM domain)
AroQAroQ AroHAroH
33
Developing DAG Viewer
Before:all chorismate mutase proteins and families hit PF01817includingPIRSF005965(not homologous to the rest)
Subfamily
Network structure (in DAG) for PIRSF family classification system reflects PIRSF family hierarchy which is based on evolutionary relationships
34
DAG Viewer (II)
After:PFAM created a new domain PF07736which is found in PIRSF005965 members
“Orphans”: no family classification
35
PIR Team Dr. Cathy Wu, Director
Protein Classification teamDr. Winona Barker Dr. Lai-Su Yeh Dr. Anastasia NikolskayaDr. Darren Natale Dr. Zhang-Zhi Hu Dr. Raja Mazumder Dr. CR Vinayaka Dr. Sona Vasudevan Dr. Cecilia Arighi
Informatics teamDr. Hongzhan Huang Dr. Peter McGarvey Baris Suzek, M.S. Sehee Chung, M.S. Dr. Leslie Arminski Dr. Hsing-Kuo Hua Yongxing Chen, M.S. Jian Zhang, M.S. Dr. Xin Yuan
Students
Christina Fang Vincent Hermoso Natalia Petrova
UniProt is supported by the National Institutes of Health, grant # 1 U01 HG02712-01