unreveiling new biological knowledge from multiresolution structural proteomics data: a data base...
Post on 11-Jan-2016
213 Views
Preview:
TRANSCRIPT
Unreveiling new biological knowledge from multiresolution structural
proteomics data:
A Data Base and Pattern Recognition Approach
José María Carazo
BioComputing Unit, Centro Nacional de Biotecnología, Madrid, Spain
(Who am I?) Research Areas
Image Processing
HelicaseStruc/Func.
Analysis
StructuralDatabases
Hypothesis: Medium resolution EM data represents a
rich biological information resource. Therefore:
• Step 1) Keep them organized (institutionally) in a new structural data base (do not loose them. Keep them organized and accesible)
• Step 2) Extract the now appearing Macro-architecture features (realize the general organizational principles of large assemblies)
• Step 3) Make the “link” to structural proteomics at the aminoacid level (go from “density blobs” to defined protein structures. “Connect” atomic resolution information with “medium resolution)
• Step 4) Integrate this new structural information with other information sources
Step 1: Motivated by impetus in cryo EM “Construct the EM Data
Base (EMDB)”• The work started in 97 with the “BioImage”
project of the EU as pilot study among research groups
• The work continued through 2000-2003 in the IIMS project, creating the EM Data Base as part of the core facilities of the EBI (European BioInformatics Institute)
EMDB
• IIMS: to integrate the results of three-dimensional electron microscopy (3D-EM) with models from X-ray and NMR methods.
• Part of the MSD (Macromolecular Structure Database)
The project is funded by the European Commission as the IIMS,contract-no. QLRI-CT-2000-31237 under the RTD programme "Quality of Life and Management of Living Resources"
EMDB
• Relational Data Model– Fully integrated in the MSD, together with
PDB data
• XML-based Data Model• EMDep, the Electron Microscopy Deposition
Tool– Dictionary driven
We note that the European Bioinformatics Institute (EBI) through the Macromolecular Structure Database (MSD) now provides a permanent resource for the deposition of three-dimensional maps derived by electron microscopy (see www.ebi.ac.uk/msdsrv/emdep). In addition, coordinate data derived from these maps are deposited in the PDB archive for macromolecular structural data. We intend to use these facilities for the routine deposition of maps and coordinate data produced by our work. These databases are open to the international community and will become part of a family of linked databases in biomedical research.We encourage our colleagues to follow our example by submitting maps, at the stage of publication, to these archival databases.
IIMS Workshop November 15-16, 2002
Sending data to EMD
… more than a hundred EM structures are now being published in the journals in a typical year. Without EMDB, these data would not be archived for future general use. So the size and usefulness of the database are likely to increase dramatically. Nature Structural Biology is strongly supportive of the general principle that scientific data should be professionally maintained and freely accessible, and so its editors will from now on encourage scientists to deposit their work in EMDB when papers describing EM structures are published in the journal.
Step 2: Discover biological Knowledge: “Extract information on
general organizational principles”
• GOAL: Since EM provides information on (potentially) quite large specimens, device ways to extract automatically topological and geometrical information of the assemblies
• Driven principle: In order to close gaps between differentn techniques of structure determination such as X-rays and cryo-EM, develop techniques able to work transparently accross multiple resolution levels
( HERE COME “ALTERNATIVE REPRESENTATIONS”)
FEMME Database
Purpose:to store, in a universal data model, the topological and geometric features of 3D-reconstructed macromolecules regardless of the resolution achieved.
Final aim:Automatic detection of general organizational principles
Query by content in structural databases.
Methodology:
Vector quantization and alpha-shape
representation theory
J.Struct. Biol, 2004
Methodology Original dataset:Set of multimeric proteins
coming from
IDENTIFICATION, EXTRACTION AND CHARACTERISATION OF CHANNELS/CAVITIES/(PROTUSSIONS)
pseudo-atoms
ALPHA COMPLEX
PDB/PQS databases
(High resolution)
Macromolecular topology given by the atomic coordinates
(Liang et al 1998)
3D-EM (Medium resolution)
Macromolecular topology given by the selection of a set of
pseudoatoms
(De-Alarcón et al 2002)
Around 140 entries corresponding to alpha-shape representations of
macromolecules and macromolecular structural features from data at any
resolution level
FEMME contents
One of the possible applications: detection of shape similarities among complexes
Detailed description about the number and kind of structural
features contained in the macromolecule
Shape, Size, Protrusions, Channels, Cavities ...
Structurally characterised macromolecule
Several descriptors of the macromolecule structure
CCT
ACTIN
TRICORN PROTEASE
RIBOSOME
FEMME DATABASE STORAGE
Query by content
Final aim
Step 3: Discover biological knowledge: “Make the “link” at the
aminoacid level” (Quantitative “visualization” of fine features)• Goal: Bridging from atomic resolution to medium
resolution
• Motivation: At some moment the link from “density blobs” to define aminoacids has to done. This is so in order to “attach” biochemical and functional information to the medium resolution structures.
• Note: There are many substeps here, we will concentrate on “superfamily recognition” (and in cooperation with other groups in the field, like Chiu’s group)
Superfamily recognition
• Is surface information enough to detect a fold ?
• Can we detect the fold present in an 3DEM map just docking other known fold maps in it ?
• Can some form of flexible docking using SSE be of help?
• Identification of the SSE elements of a protein
• Their spacial distribution and conectivity (topology)
• Assignment of a structural family to the protein
• Assignment of a sequence family to the protein
• Assignment of a function
Increasing difficulty
Information that can be used : • Protein sequence/atomic resolution information: A bunch of methods: neural networks, threading, etc• Medium resolution views of the protein = 3DEM maps
A working definition of Superfamily recognition:
What are we doing ?
• Is surface information enough to help assigning a superfamily ?
– Application of the spin-image-representation method by De Alarcon, P.A. Y Pascual-Montano, A.
• Can we assign a superfamily in an 3DEM map just docking other known fold maps in it ?
– Application of the COAN docking method by Volkmann, N. within a new Bayesian Schema
• Can we assign a superfamily by some form of flexible docking, possibly using SSS elements ?
– Work in progress
Superfamily assignmentusing surface information
• Surface information can give information about similarity between different folds.• Surface comparison can be performed using techniques derived from the field of computer vision.
• Our studies reveal that similar folds according to the classification given by CATH (belonging to the same superfamily) also have similar surfaces at different resolutions ranging from 8 to 12 Å.• Similarities in the surface are related to similarities in the fold sequence of aminoacids.
• The surface info can be used to detect folds or entire proteins in large assemblies.
Spin image representation (s.i.r.) of 3D-EM Maps
Spin-image-representation of a 3D object:
A) s.i.r principle: to project every point x of the surface with respect to the plane defined by a p point and its normal n.
B) a 3D object with a point and a its normal. C) Points of a surface projected into a plane. D) Spin image obtained from the binning of the surface points
projection.
n
A B C
D
Applications: Partial Matching.Applications: Partial Matching.
Local patches of the query object can be highlighted according to local similiarity with objects in the database.
Query Plane 1st match 3rd match2nd match Coloured Patches
Proteins instead of airplanes….(dealing with multiple domains)
• Possibility of docking isolated domains into entire maps
• Take into account the surface info
• Speed
• Modularity
Fold recognitionusing fitting information
• Docking information can be used to detect the CATH superfamily of a single fold present in a electron microscopy map.
• Repeated experiments of cross correlation and a bayesian probability framework have been use.
• The results show that the use of multiple dockings can overcome the uncertainty when the fold present in the 3D-EM is unknown.
Fold recognition using docking info and bayesian probability
the probability of having a fold given a density map
Bayesian probability
background probability of having an individual fold i, computed as the frequency of realizations of that fold in the total data set of structures to dock.
probability of having a density map given a fold i, computed as follows: 1. a set of elements of the CATH superfamily that represents the fold are
docked to the density map.2. The probability that the density map belongs to that fold is computed
as the probability that the sample values of cross-correlation came from the same population than the sample of cross-correlations from the elements of the CATH superfamily.
3. This test of homogeneity is done by a chi-squared test.
The fold with the highest value of is assigned to the map.
()
i
PfoldDensityMap
Fold recognition using docking info. Results:
At 12 Å resoltuion the information content is a very discriminant measure. 8 of 9 experiments detect the correspondig family with the best value. Example:
Map belonging to family1.10.220.10 SF=Superfamily M=Map
SF Docked elements mean CC Std. Dev P(M|SF) P(M|Non-related) IC P(fold|M)3.40.30.10 10 0,782 0,058 0,518 0,521 -0,003 0,00%2.60.120.60 10 0,636 0,053 0,080 0,478 -0,143 0,00%
1.10.238.10 10 0,832 0,110 0,855 0,734 0,131 6,94%1.20.90.10 10 0,766 0,012 0,003 0,116 -0,011 0,00%
1.10.220.10 10 0,952 0,036 0,730 0,086 1,563 82,88%3.40.50.300 10 0,650 0,098 0,649 0,674 -0,025 0,00%
3.10.100.10 10 0,799 0,044 0,519 0,383 0,158 8,40%2.40.10.10 10 0,787 0,029 0,306 0,274 0,034 1,79%
3.20.20.80 10 0,461 0,048 0,000 0,138 0,000 0,00%2.60.120.20 10 0,585 0,056 0,203 0,344 -0,107 0,00%
Resol. Success rate:
Superfamilies correctly discriminated
12 Å 8 / 9
14 Å 6 / 10
16 Å 4 / 10
20 Å 4 / 10
12 Å
Fold recognitionExtension of the work to multidomain maps
Can a single fold be detected in the entire electron microscopy map?
The cross correlation approach fails in many cases Correct position Position found by cross correlation
Fold recognitionFlexible docking
• By flexible docking we mean to deform ceartain points in the fold to better resemble what we have in the medium resolution density.
• The important points chosen to deform are those points located at the ends of the secondary structure elements of the fold.• To allow for deformations we need to consider different alternatives for each point and choose those ones which better respect the fold superfamily arquitecture. But it doesn´t need to be very same.
Step 4: Discover biological knowledge: “Integrate
information”• Goal: Integrate structural information at all levels
of resolution with other sources of information
• Mean: Semantic mediation over heterogeneous data sources
• Obviously, this is a necessary step towards new powerful data mining approaches, and in data mining the “user” should be in the analysis loop via some graphical interface
Motivating example: DNA binding macromolecules
PQS database
multimeric structure
CATH/SCOP databases
DNA clamp fold
FEMME database
Central channel
Multimeric structures containing the DNA clamp fold and with a central
channel
Ultimate mean: Semantic Data Mediation
• Programmable integrator– Interlieves
information access and algorithm execution
• Semantic mediator– Encodes and
executes domain-specific expert-rules for data joining
USER/ClientUSER/Client
S1 S2 S3
XML-Wrapper
CM-Wrapper
XML-Wrapper
CM-Wrapper
XML-Wrapper
CM-Wrapper
GCM
CM S1
GCM
CM S2
GCM
CM S3
CM (Integrated View)
MediatorEngine
FL rule proc.
LP rule proc.
Graph proc.XSB Engine
Domain MapDM
Integrated View Definition IVD
Logic API(capabilities)
CM Queries & Results(exchanged in XML)
CM Plug-In
Relational Databases
Web-sources(html, XML)
Service applications
USER/ClientUSER/Client
S1 S2 S3
XML-Wrapper
CM-Wrapper
XML-Wrapper
CM-Wrapper
XML-Wrapper
CM-Wrapper
GCM
CM S1
GCM
CM S2
GCM
CM S3
CM (Integrated View)
MediatorEngine
FL rule proc.
LP rule proc.
Graph proc.XSB Engine
Domain MapDM
Integrated View Definition IVD
Logic API(capabilities)
CM Queries & Results(exchanged in XML)
CM Plug-In
Relational Databases
Web-sources(html, XML)
Service applications
USER/ClientUSER/Client
S1 S2 S3
XML-Wrapper
CM-Wrapper
XML-Wrapper
CM-Wrapper
XML-Wrapper
CM-Wrapper
XML-Wrapper
CM-Wrapper
GCM
CM S1
GCM
CM S2
GCM
CM S3
GCM
CM S1
GCM
CM S1
GCM
CM S2
GCM
CM S2
GCM
CM S3
GCM
CM S3
CM (Integrated View)
MediatorEngine
FL rule proc.
LP rule proc.
Graph proc.XSB Engine
MediatorEngine
FL rule proc.
LP rule proc.
FL rule proc.
LP rule proc.
Graph proc.XSB Engine
Domain MapDM
Integrated View Definition IVD
Logic API(capabilities)
CM Queries & Results(exchanged in XML)
CM Plug-In
Relational Databases
Web-sources(html, XML)
Service applications
USER/ClientUSER/Client
S1 S2 S3
XML-Wrapper
CM-Wrapper
XML-Wrapper
CM-Wrapper
XML-Wrapper
CM-Wrapper
GCM
CM S1
GCM
CM S2
GCM
CM S3
CM (Integrated View)
MediatorEngine
FL rule proc.
LP rule proc.
Graph proc.XSB Engine
Domain MapDM
Integrated View Definition IVD
Logic API(capabilities)
CM Queries & Results(exchanged in XML)
CM Plug-In
Relational Databases
Web-sources(html, XML)
Service applications
USER/ClientUSER/Client
S1 S2 S3
XML-Wrapper
CM-Wrapper
XML-Wrapper
CM-Wrapper
XML-Wrapper
CM-Wrapper
GCM
CM S1
GCM
CM S2
GCM
CM S3
CM (Integrated View)
MediatorEngine
FL rule proc.
LP rule proc.
Graph proc.XSB Engine
Domain MapDM
Integrated View Definition IVD
Logic API(capabilities)
CM Queries & Results(exchanged in XML)
CM Plug-In
Relational Databases
Web-sources(html, XML)
Service applications
USER/ClientUSER/Client
S1 S2 S3
XML-Wrapper
CM-Wrapper
XML-Wrapper
CM-Wrapper
XML-Wrapper
CM-Wrapper
XML-Wrapper
CM-Wrapper
GCM
CM S1
GCM
CM S2
GCM
CM S3
GCM
CM S1
GCM
CM S1
GCM
CM S2
GCM
CM S2
GCM
CM S3
GCM
CM S3
CM (Integrated View)
MediatorEngine
FL rule proc.
LP rule proc.
Graph proc.XSB Engine
MediatorEngine
FL rule proc.
LP rule proc.
FL rule proc.
LP rule proc.
Graph proc.XSB Engine
Domain MapDM
Integrated View Definition IVD
Logic API(capabilities)
CM Queries & Results(exchanged in XML)
CM Plug-In
Relational Databases
Web-sources(html, XML)
Service applications
Extended Domain Map in a Structural Biology Extended Domain Map in a Structural Biology ContextContext
My_Polypeptide chain
Has +
My_proteinHas *
name
My_function
SwissprotSwissprot
PDBPDB
PQSPQS
CATH CATH SuperfamilSuperfamil
yy
Has +
Enzyme Enzyme DatabaseDatabase
InterProInterProHas +
Has *
Medium-Resolution3D Image
Has +
Alpha-shape
SSE
Has +
Found_in+
Has
Has
Triangulated Surface
Derive
Helix hunterBeta hunter
Fold Fold InstanceInstance
Found_in+
Superfamily detector
Found_in+
Fold hunter
Cavity/Channels
Derive+3D Point Has +
Connectiviy
Has
Protrusion
Derive+
Properties(area, …)
Has+
X,Y,Z Has
Curvature
Normal
Has
Has
Red-framed boxes require visualization tools!!
Current state: PLAN – a Language for a Programmable Integrator
• XML-based language
• XQuery
PLAN Example
Retrieve those folds in CATH corresponding to proteins which contain a given InterPro motif (IPR001198)
InterPro
SwissProt matches
PDB chains
CATH codes
http://www.ebi.ac.uk/interpro
BLASTp search
CATH Domain Description File
W.S.J. Valdar, J.M. Thornton, Protein–Protein Interfaces: Analysis of Amino Acid Conservation in Homodimers PROTEINS: Structure, Function, and Genetics 42:108–124 (2001)
• the protomer to be studied must form a stable, symmetric complex with one other protomer to which it is identical (or nearly identical) such as the oligomer is homodimeric and the conservation of only one chain need be considered;
• the full wild-type complex must be available in PDB or PQS;• of all the structures available for the complex, the structure chosen must
have the best combination of the following properties:– high resolution, inclusion of any bound cofactors that occur naturally, the
inclusion of a ligand similar in size and shape to that of the natural substrate.• to enable the robust identification of a diverse set of homologues, the
promoter should be represented in the CATH• the promoter sequence must have non-fragment homologues in the
SwissProt that are numerous (>10) and diverse (<70% mean pairwise sequence identity), and by their annotation, share its function and multimeric state
1. The oligomer is homodimeric
2. Available in CATH
3. Group by protein
3a. Numerous distant homologues
3b. Wild-type protein
4. Share multimeric state
5. Final selection
PQS
CATH
BLAST
SwissProt
PDB, ENZYME
Data sources Operation Criteria
Filtering
Collection
<QUERY> <result> LET $x := set("","ipr","IPR001198"), $x := set($x,"display","n"), $x := set($x,"dmax","20000"), $y := constructURL("GET","http://www.ebi.ac.uk/interpro/ISpy",$x) RETURN $y </result> </QUERY> <TRAVERSE>POP</TRAVERSE>
<QUERY> <result> <DATA NAME="InterProMatches" TYPE="Add"> RETURN stream() </DATA> </result> </QUERY>
URL constructor
Wrapper call
Internal data buffer (allows XML filtering)
PLAN Example (I)
<WHILE> <CONDITION> <STACK> <CONDITION>NONEMPTY</CONDITION> </STACK> </CONDITION> <DO> <TRAVERSE>POP</TRAVERSE> <QUERY> <result> <DATA NAME="spToPdb" TYPE="Add"> RETURN stream() </DATA> </result> </QUERY> </DO> </WHILE>
<CONSTRUCT> <DATA NAME="r1" /> </CONSTRUCT> <DELETE FILE="./resultFiles/q1_IPR001198.xml" /> <PRINTOUT FILE="./resultFiles/q1_IPR001198.xml" />
<XMLBUFFER NAME="InterproMatches" />
Working register is… PLAN Example (II)
Save result data in a file
Nesting requests
Final Remark: “Infrastructures”
• All our software is public domain and with a sustained tradition of making it really accesible (XMIPP, BPR…)
Acknowledgements• The CNB Biocomputing
Unit:
• L.E.Donate• Mikel Valle• Carmen San Martin • María Gómez• Yolanda Robledo
Rafael Núñez• Yacob
• Monica Chagoyen • Roberto Marabini• Alberto Pascual • Carlos-Oscar Sanchez• Natalia Jiménez-Lozano• Javier A. Velázquez-Muriel• Pedro Carmona• David Elguero• Jesus Cuenca
• Extra mural:
• The EBI Team• Herbert Edelsbrunner• Wah Chiu’s Lab• SDSC (Gupta’s Lab)• Ioannis Kakadiaris’s Lab• Niels Voksmann• Gruss and Cheng Lab• Mark Ellisman Lab
• (and MANY other interactions)
top related