unreveiling new biological knowledge from multiresolution structural proteomics data: a data base...

Unreveiling new biological knowledge from multiresolution structural

proteomics data:

A Data Base and Pattern Recognition Approach

José María Carazo

BioComputing Unit, Centro Nacional de Biotecnología, Madrid, Spain

(Who am I?) Research Areas

Image Processing

HelicaseStruc/Func.

Analysis

StructuralDatabases

Hypothesis: Medium resolution EM data represents a

rich biological information resource. Therefore:

• Step 1) Keep them organized (institutionally) in a new structural data base (do not loose them. Keep them organized and accesible)

• Step 2) Extract the now appearing Macro-architecture features (realize the general organizational principles of large assemblies)

• Step 3) Make the “link” to structural proteomics at the aminoacid level (go from “density blobs” to defined protein structures. “Connect” atomic resolution information with “medium resolution)

• Step 4) Integrate this new structural information with other information sources

Step 1: Motivated by impetus in cryo EM “Construct the EM Data

Base (EMDB)”• The work started in 97 with the “BioImage”

project of the EU as pilot study among research groups

• The work continued through 2000-2003 in the IIMS project, creating the EM Data Base as part of the core facilities of the EBI (European BioInformatics Institute)

• IIMS: to integrate the results of three-dimensional electron microscopy (3D-EM) with models from X-ray and NMR methods.

• Part of the MSD (Macromolecular Structure Database)

The project is funded by the European Commission as the IIMS,contract-no. QLRI-CT-2000-31237 under the RTD programme "Quality of Life and Management of Living Resources"

• Relational Data Model– Fully integrated in the MSD, together with

PDB data

• XML-based Data Model• EMDep, the Electron Microscopy Deposition

Tool– Dictionary driven

We note that the European Bioinformatics Institute (EBI) through the Macromolecular Structure Database (MSD) now provides a permanent resource for the deposition of three-dimensional maps derived by electron microscopy (see www.ebi.ac.uk/msdsrv/emdep). In addition, coordinate data derived from these maps are deposited in the PDB archive for macromolecular structural data. We intend to use these facilities for the routine deposition of maps and coordinate data produced by our work. These databases are open to the international community and will become part of a family of linked databases in biomedical research.We encourage our colleagues to follow our example by submitting maps, at the stage of publication, to these archival databases.

IIMS Workshop November 15-16, 2002

Sending data to EMD

… more than a hundred EM structures are now being published in the journals in a typical year. Without EMDB, these data would not be archived for future general use. So the size and usefulness of the database are likely to increase dramatically. Nature Structural Biology is strongly supportive of the general principle that scientific data should be professionally maintained and freely accessible, and so its editors will from now on encourage scientists to deposit their work in EMDB when papers describing EM structures are published in the journal.

Step 2: Discover biological Knowledge: “Extract information on

general organizational principles”

• GOAL: Since EM provides information on (potentially) quite large specimens, device ways to extract automatically topological and geometrical information of the assemblies

• Driven principle: In order to close gaps between differentn techniques of structure determination such as X-rays and cryo-EM, develop techniques able to work transparently accross multiple resolution levels

( HERE COME “ALTERNATIVE REPRESENTATIONS”)

FEMME Database

Purpose:to store, in a universal data model, the topological and geometric features of 3D-reconstructed macromolecules regardless of the resolution achieved.

Final aim:Automatic detection of general organizational principles

Query by content in structural databases.

Methodology:

Vector quantization and alpha-shape

representation theory

J.Struct. Biol, 2004

Methodology Original dataset:Set of multimeric proteins

coming from

IDENTIFICATION, EXTRACTION AND CHARACTERISATION OF CHANNELS/CAVITIES/(PROTUSSIONS)

pseudo-atoms

ALPHA COMPLEX

PDB/PQS databases

(High resolution)

Macromolecular topology given by the atomic coordinates

(Liang et al 1998)

3D-EM (Medium resolution)

Macromolecular topology given by the selection of a set of

pseudoatoms

(De-Alarcón et al 2002)

Around 140 entries corresponding to alpha-shape representations of

macromolecules and macromolecular structural features from data at any

resolution level

FEMME contents

One of the possible applications: detection of shape similarities among complexes

Detailed description about the number and kind of structural

features contained in the macromolecule

Shape, Size, Protrusions, Channels, Cavities ...

Structurally characterised macromolecule

Several descriptors of the macromolecule structure

TRICORN PROTEASE

RIBOSOME

FEMME DATABASE STORAGE

Query by content

Final aim

Step 3: Discover biological knowledge: “Make the “link” at the

aminoacid level” (Quantitative “visualization” of fine features)• Goal: Bridging from atomic resolution to medium

resolution

• Motivation: At some moment the link from “density blobs” to define aminoacids has to done. This is so in order to “attach” biochemical and functional information to the medium resolution structures.

• Note: There are many substeps here, we will concentrate on “superfamily recognition” (and in cooperation with other groups in the field, like Chiu’s group)

Superfamily recognition

• Is surface information enough to detect a fold ?

• Can we detect the fold present in an 3DEM map just docking other known fold maps in it ?

• Can some form of flexible docking using SSE be of help?

• Identification of the SSE elements of a protein

• Their spacial distribution and conectivity (topology)

• Assignment of a structural family to the protein

• Assignment of a sequence family to the protein

• Assignment of a function

Increasing difficulty

Information that can be used : • Protein sequence/atomic resolution information: A bunch of methods: neural networks, threading, etc• Medium resolution views of the protein = 3DEM maps

A working definition of Superfamily recognition:

What are we doing ?

• Is surface information enough to help assigning a superfamily ?

– Application of the spin-image-representation method by De Alarcon, P.A. Y Pascual-Montano, A.

• Can we assign a superfamily in an 3DEM map just docking other known fold maps in it ?

– Application of the COAN docking method by Volkmann, N. within a new Bayesian Schema

• Can we assign a superfamily by some form of flexible docking, possibly using SSS elements ?

– Work in progress

Superfamily assignmentusing surface information

• Surface information can give information about similarity between different folds.• Surface comparison can be performed using techniques derived from the field of computer vision.

• Our studies reveal that similar folds according to the classification given by CATH (belonging to the same superfamily) also have similar surfaces at different resolutions ranging from 8 to 12 Å.• Similarities in the surface are related to similarities in the fold sequence of aminoacids.

• The surface info can be used to detect folds or entire proteins in large assemblies.

Spin image representation (s.i.r.) of 3D-EM Maps

Spin-image-representation of a 3D object:

A) s.i.r principle: to project every point x of the surface with respect to the plane defined by a p point and its normal n.

B) a 3D object with a point and a its normal. C) Points of a surface projected into a plane. D) Spin image obtained from the binning of the surface points

projection.

Applications: Partial Matching.Applications: Partial Matching.

Local patches of the query object can be highlighted according to local similiarity with objects in the database.

Query Plane 1st match 3rd match2nd match Coloured Patches

Proteins instead of airplanes….(dealing with multiple domains)

• Possibility of docking isolated domains into entire maps

• Take into account the surface info

• Speed

• Modularity

Fold recognitionusing fitting information

• Docking information can be used to detect the CATH superfamily of a single fold present in a electron microscopy map.

• Repeated experiments of cross correlation and a bayesian probability framework have been use.

• The results show that the use of multiple dockings can overcome the uncertainty when the fold present in the 3D-EM is unknown.

Fold recognition using docking info and bayesian probability

the probability of having a fold given a density map

Bayesian probability

background probability of having an individual fold i, computed as the frequency of realizations of that fold in the total data set of structures to dock.

probability of having a density map given a fold i, computed as follows: 1. a set of elements of the CATH superfamily that represents the fold are

docked to the density map.2. The probability that the density map belongs to that fold is computed

as the probability that the sample values of cross-correlation came from the same population than the sample of cross-correlations from the elements of the CATH superfamily.

3. This test of homogeneity is done by a chi-squared test.

The fold with the highest value of is assigned to the map.

PfoldDensityMap

Fold recognition using docking info. Results:

At 12 Å resoltuion the information content is a very discriminant measure. 8 of 9 experiments detect the correspondig family with the best value. Example:

Map belonging to family1.10.220.10 SF=Superfamily M=Map

SF Docked elements mean CC Std. Dev P(M|SF) P(M|Non-related) IC P(fold|M)3.40.30.10 10 0,782 0,058 0,518 0,521 -0,003 0,00%2.60.120.60 10 0,636 0,053 0,080 0,478 -0,143 0,00%

1.10.238.10 10 0,832 0,110 0,855 0,734 0,131 6,94%1.20.90.10 10 0,766 0,012 0,003 0,116 -0,011 0,00%

1.10.220.10 10 0,952 0,036 0,730 0,086 1,563 82,88%3.40.50.300 10 0,650 0,098 0,649 0,674 -0,025 0,00%

3.10.100.10 10 0,799 0,044 0,519 0,383 0,158 8,40%2.40.10.10 10 0,787 0,029 0,306 0,274 0,034 1,79%

3.20.20.80 10 0,461 0,048 0,000 0,138 0,000 0,00%2.60.120.20 10 0,585 0,056 0,203 0,344 -0,107 0,00%

Resol. Success rate:

Superfamilies correctly discriminated

12 Å 8 / 9

14 Å 6 / 10

16 Å 4 / 10

20 Å 4 / 10

Fold recognitionExtension of the work to multidomain maps

Can a single fold be detected in the entire electron microscopy map?

The cross correlation approach fails in many cases Correct position Position found by cross correlation

Fold recognitionFlexible docking

• By flexible docking we mean to deform ceartain points in the fold to better resemble what we have in the medium resolution density.

• The important points chosen to deform are those points located at the ends of the secondary structure elements of the fold.• To allow for deformations we need to consider different alternatives for each point and choose those ones which better respect the fold superfamily arquitecture. But it doesn´t need to be very same.

Step 4: Discover biological knowledge: “Integrate

information”• Goal: Integrate structural information at all levels

of resolution with other sources of information

• Mean: Semantic mediation over heterogeneous data sources

• Obviously, this is a necessary step towards new powerful data mining approaches, and in data mining the “user” should be in the analysis loop via some graphical interface

Motivating example: DNA binding macromolecules

PQS database

multimeric structure

CATH/SCOP databases

DNA clamp fold

FEMME database

Central channel

Multimeric structures containing the DNA clamp fold and with a central

channel

Ultimate mean: Semantic Data Mediation

• Programmable integrator– Interlieves

information access and algorithm execution

• Semantic mediator– Encodes and

executes domain-specific expert-rules for data joining

USER/ClientUSER/Client

S1 S2 S3

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

CM (Integrated View)

MediatorEngine

FL rule proc.

LP rule proc.

Graph proc.XSB Engine

Domain MapDM

Integrated View Definition IVD

Logic API(capabilities)

CM Queries & Results(exchanged in XML)

CM Plug-In

Relational Databases

Web-sources(html, XML)

Service applications

S1 S2 S3

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

MediatorEngine

FL rule proc.

LP rule proc.

Domain MapDM

CM Plug-In

S1 S2 S3

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

MediatorEngine

FL rule proc.

LP rule proc.

MediatorEngine

FL rule proc.

LP rule proc.

FL rule proc.

LP rule proc.

Domain MapDM

CM Plug-In

S1 S2 S3

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

MediatorEngine

FL rule proc.

LP rule proc.

Domain MapDM

CM Plug-In

S1 S2 S3

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

MediatorEngine

FL rule proc.

LP rule proc.

Domain MapDM

CM Plug-In

S1 S2 S3

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

MediatorEngine

FL rule proc.

LP rule proc.

MediatorEngine

FL rule proc.

LP rule proc.

FL rule proc.

LP rule proc.

Domain MapDM

CM Plug-In

Extended Domain Map in a Structural Biology Extended Domain Map in a Structural Biology ContextContext

My_Polypeptide chain

My_proteinHas *

My_function

SwissprotSwissprot

PDBPDB

PQSPQS

CATH CATH SuperfamilSuperfamil

Enzyme Enzyme DatabaseDatabase

InterProInterProHas +

Medium-Resolution3D Image

Alpha-shape

Found_in+

Triangulated Surface

Derive

Helix hunterBeta hunter

Fold Fold InstanceInstance

Found_in+

Superfamily detector

Found_in+

Fold hunter

Cavity/Channels

Derive+3D Point Has +

Connectiviy

Protrusion

Derive+

Properties(area, …)

X,Y,Z Has

Curvature

Normal

Red-framed boxes require visualization tools!!

Current state: PLAN – a Language for a Programmable Integrator

• XML-based language

• XQuery

PLAN Example

Retrieve those folds in CATH corresponding to proteins which contain a given InterPro motif (IPR001198)

InterPro

SwissProt matches

PDB chains

CATH codes

http://www.ebi.ac.uk/interpro

BLASTp search

CATH Domain Description File

W.S.J. Valdar, J.M. Thornton, Protein–Protein Interfaces: Analysis of Amino Acid Conservation in Homodimers PROTEINS: Structure, Function, and Genetics 42:108–124 (2001)

• the protomer to be studied must form a stable, symmetric complex with one other protomer to which it is identical (or nearly identical) such as the oligomer is homodimeric and the conservation of only one chain need be considered;

• the full wild-type complex must be available in PDB or PQS;• of all the structures available for the complex, the structure chosen must

have the best combination of the following properties:– high resolution, inclusion of any bound cofactors that occur naturally, the

inclusion of a ligand similar in size and shape to that of the natural substrate.• to enable the robust identification of a diverse set of homologues, the

promoter should be represented in the CATH• the promoter sequence must have non-fragment homologues in the

SwissProt that are numerous (>10) and diverse (<70% mean pairwise sequence identity), and by their annotation, share its function and multimeric state

1. The oligomer is homodimeric

2. Available in CATH

3. Group by protein

3a. Numerous distant homologues

3b. Wild-type protein

4. Share multimeric state

5. Final selection

SwissProt

PDB, ENZYME

Data sources Operation Criteria

Filtering

Collection

<QUERY> <result> LET $x := set("","ipr","IPR001198"), $x := set($x,"display","n"), $x := set($x,"dmax","20000"), $y := constructURL("GET","http://www.ebi.ac.uk/interpro/ISpy",$x) RETURN $y </result> </QUERY> <TRAVERSE>POP</TRAVERSE>

<QUERY> <result> <DATA NAME="InterProMatches" TYPE="Add"> RETURN stream() </DATA> </result> </QUERY>

URL constructor

Wrapper call

Internal data buffer (allows XML filtering)

PLAN Example (I)

<WHILE> <CONDITION> <STACK> <CONDITION>NONEMPTY</CONDITION> </STACK> </CONDITION> <DO> <TRAVERSE>POP</TRAVERSE> <QUERY> <result> <DATA NAME="spToPdb" TYPE="Add"> RETURN stream() </DATA> </result> </QUERY> </DO> </WHILE>

Working register is… PLAN Example (II)

Save result data in a file

Nesting requests

Final Remark: “Infrastructures”

• All our software is public domain and with a sustained tradition of making it really accesible (XMIPP, BPR…)

Acknowledgements• The CNB Biocomputing

• L.E.Donate• Mikel Valle• Carmen San Martin • María Gómez• Yolanda Robledo

Rafael Núñez• Yacob

• Monica Chagoyen • Roberto Marabini• Alberto Pascual • Carlos-Oscar Sanchez• Natalia Jiménez-Lozano• Javier A. Velázquez-Muriel• Pedro Carmona• David Elguero• Jesus Cuenca

• Extra mural:

• The EBI Team• Herbert Edelsbrunner• Wah Chiu’s Lab• SDSC (Gupta’s Lab)• Ioannis Kakadiaris’s Lab• Niels Voksmann• Gruss and Cheng Lab• Mark Ellisman Lab

• (and MANY other interactions)

unreveiling new biological knowledge from multiresolution structural proteomics data: a data base...

macromolecular structural

scientific data

new structural data

em data base emdbthe

em structures

new structural information

cryo em

extract information

Documents

multiresolution image...

wavelets and multiresolution processing (multiresolution...

research papers iucrj , a deep learning-based approach ·...

carazo communications capabilities presentation

wavelets and multiresolution processing (multiresolution...

pacific symposium on biocomputing 2018

introduction to biocomputing: structure (dna & rna)

i2pc instruct image processing centre jm carazo. cryoem and...

pacific symposium on biocomputing...

biology 4900 biocomputing. chapter 6 multiple sequence...

multiresolution spline

a theory for multiresolution signal decomposition: the...

mesh processing course : multiresolution

generating molecular database using biocomputing approach

wwavelets & multiresolution analysis

multiresolution frequency domain technique domain...

pacific symposium on biocomputing 2014 abstract book...

unm division of biocomputing public web applications

project water sanitation-carazo “promote rural...

wavelets (2ª. parte). chapter 7 wavelets and...