“high throughput” protein structure prediction application in euchinagrid

23
EGEE-II INFSO-RI- 031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks “High throughput” protein structure prediction application in EUChinaGRID G. Minervini , G. La Rocca, P.L. Luisi and F. Polticelli Dept. of Biology, Univ. Roma Tre, and INFN Catania, Italy EGEE User Forum – Manchester, 10 May 2007

Upload: adina

Post on 19-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

“High throughput” protein structure prediction application in EUChinaGRID. G. Minervini , G. La Rocca, P.L. Luisi and F. Polticelli Dept. of Biology, Univ. Roma Tre, and INFN Catania, Italy EGEE User Forum – Manchester, 10 May 2007. EUChinaGRID Project. Overview - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: “High throughput” protein structure  prediction application in EUChinaGRID

EGEE-II INFSO-RI-

031688

Enabling Grids for E-sciencE

www.eu-egee.org

EGEE and gLite are registered

trademarks

“High throughput” protein structure prediction application in EUChinaGRID

G. Minervini, G. La Rocca, P.L. Luisi and F. PolticelliDept. of Biology, Univ. Roma Tre, and INFN Catania, Italy

EGEE User Forum – Manchester, 10 May 2007

Page 2: “High throughput” protein structure  prediction application in EUChinaGRID

G. Minervini EGEE User Forum * Manchester, 10-5-2007 2

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

EUChinaGRID Project

Overview

EUChinaGRID project promotes the integration and interoperability between

Europe (EGEE) and China (CNGrid) Grid infrastructures

The goals of the EUChinaGRID are:

• Promoting porting of new applications on the Grid infrastructures

• Training of new user communities

• Supporting the adoption of grid tools for scientific applications

• Validating the intercontinental Grid infrastructure

Page 3: “High throughput” protein structure  prediction application in EUChinaGRID

G. Minervini EGEE User Forum * Manchester, 10-5-2007 3

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Biological Applications The protein folding “problem” and the structural genomics challenge

– The combination of the 20 natural amino acids in a protein specific sequence dictates the three-dimensional structure of the entire protein

– Protein function is linked to the specific three-dimensional arrangement of amino acids functional groups

– With the advancement of molecular biology techniques a huge amount of

information on protein sequences has been made available but far less

information is available on structure and function of these proteins

– The “ab-initio” prediction of protein structure is a key instrument to better understand the protein folding principles and successfully exploit the information provided by the “genomic revolution”

Page 4: “High throughput” protein structure  prediction application in EUChinaGRID

G. Minervini EGEE User Forum * Manchester, 10-5-2007 4

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

There may be an entire universe of “Never Born Proteins” (NBP), whose properties have never been sampled by Nature

Contingency theory:

Extant proteins are the result of the simultaneous interplay of several concomitant causes (Gould, 1994).

Determinist theory:

The life constituents are the result of an evolutive fine work; what we see is the better possible solution for the biological needs (de Duve, 1995).

Natural proteinsPossible-protein space

THEORETICAL CONTEXT

Page 5: “High throughput” protein structure  prediction application in EUChinaGRID

G. Minervini EGEE User Forum * Manchester, 10-5-2007 5

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

SOME BASIC CALCULATIONS WITH A 70 AMINO ACID PEPTIDE . . . .

1 X 1070 POSSIBLE AMINO ACID COMBINATIONS

SYNTHESIS OF ONE MOLECULE OF EACH

COMPOUND:

– 1.1 X 1042 Kg OF MATERIAL

– 1.8 X 1017 TIMES THE WEIGHT OF THE EARTH

With 20 different comonomers a protein chain of just 70 amino acids can

theoretically exist in 2070 chemically and structurally unique combinations

Page 6: “High throughput” protein structure  prediction application in EUChinaGRID

G. Minervini EGEE User Forum * Manchester, 10-5-2007 6

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

The “Never Born Proteins”

It seems unlikely that nature tried all possible combinations, in other words, there exist a big number of NBP that have never been exploited by biological systems.

The NBP pose a series of interesting questions for the biology and basic science in general:

– Which are the criteria with which the existing proteins have been selected?

– Natural proteins have peculiar properties? (i.e. of thermal stability, solubility in water or amino acid composition?)

– Or else they represent just a subset of the possible protein sequences generated only by the contemporary action of contingency and physico-chemical forces?

Razionale

Page 7: “High throughput” protein structure  prediction application in EUChinaGRID

G. Minervini EGEE User Forum * Manchester, 10-5-2007 7

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

The approach

The problem is tackled by a “high throughput” approach made feasible by the use of the GRID infrastructure.

A library of 107-109 random amino acid sequences of fixed length is generated (n=70).

“ab initio” protein structure prediction software is used

Analysis of the structural characteristics of the resulting proteins in terms of:– Frequency of compact folds and characteristics of the corresponding amino acid

sequences– Occurrence of novel yet unknown folds– Hydrophobicity/Hydrophilicity characteristics– Presence of putative catalytic sites– Experimental validation on “interesting” cases

Page 8: “High throughput” protein structure  prediction application in EUChinaGRID

G. Minervini EGEE User Forum * Manchester, 10-5-2007 8

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Never born proteins

Software– Rosetta

Developed by David Baker – University of Washington

Based on a “fragment assembly” strategy

semi-empirical force field for the evaluation of the thermodinamics of the predicted structure

Particularly successful in the prediction of novel folds in the CASP competitions (Critical Assesment of Structure Prediction)

(http://depts.washington.edu/bakerpg/)

Page 9: “High throughput” protein structure  prediction application in EUChinaGRID

G. Minervini EGEE User Forum * Manchester, 10-5-2007 9

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Rosetta Software “ab-initio”

– Based on the assumption that local interactions bias the conformation of sequence fragments while global interactions select the three-dimensional structure with minimal energy, compatible with the local biases.

– To define the local sequence-structure relationships the software uses the Protein Data Bank (www.rcsb.org) to extracts the most likely distribution of conformations adopted by short protein segments in experimental structures

– Taking them as an approximation of the distribution adopted by

sequence segments during the folding process.

Page 10: “High throughput” protein structure  prediction application in EUChinaGRID

G. Minervini EGEE User Forum * Manchester, 10-5-2007 10

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Method details Module I - Input generation

– The query sequence is divided in fragments of 3 and 9 amino acids

– The software extracts from the data base of protein structures the distribution of three-

dimensional structures adopted by these fragments based on their specific sequence

– For each query sequence is derived a fragments data base which contains all the possible

local structures adopted by each fragment of the entire sequence.

Module II - Ab initio protein structure prediction

– The sets of fragments are assembled in a high number of different conbinations by a Monte

Carlo procedure.

– The resulting structures are subjected to a energy minimization procedure

– The principal non-local interactions considered are hydrophobic interactions, electrostatic

interactions, main chain hydrogen bonds and excluded volume.

– The structures compatible both with local biases and non-local interactions are ranked

according to their total energy resulting from the minimization procedure.

Page 11: “High throughput” protein structure  prediction application in EUChinaGRID

G. Minervini EGEE User Forum * Manchester, 10-5-2007 11

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Module I

– The procedure for input generation is rather complex but computationally inexpensive (10 min of CPU time on a Pentium IV 3,2 GHz)

– Due to the many dependencies of module I (Blast and psipred), the input generation is carried out locally with a script that automatizes the procedure for a large dataset of sequences

– Approximately 500 input datasets are currently being generated weekly

Page 12: “High throughput” protein structure  prediction application in EUChinaGRID

G. Minervini EGEE User Forum * Manchester, 10-5-2007 12

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Module II

– Input

– fragment files generated by module I

– secondary structure prediction using psipred

In output the user obtains a number of structural models of the query

sequence ranked by total energy

A single run with just the lowest energy structure as output takes approx.

10-40 min of CPU time depending on the degree of refinement of the

structure

Page 13: “High throughput” protein structure  prediction application in EUChinaGRID

G. Minervini EGEE User Forum * Manchester, 10-5-2007 13

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Integration on the GILDA facility Single job execution on GILDA

– A single Rosetta abinitio run consists of two different phases. In the first phase an initial model of the protein structure is generated using the fragment libraries. The initial model is then used as input for the second phase in which the model is refined.

– A shell script has been prepared which:• registers the program executable and the required input files (fragment

libraries and secondary structure prediction file) on the LFC catalog

• calls the Rosetta executable and proceeds with workflow execution.

– A JDL file was created to run the application on the GILDA working nodes which use the gLite middleware

Page 14: “High throughput” protein structure  prediction application in EUChinaGRID

G. Minervini EGEE User Forum * Manchester, 10-5-2007 14

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Integration on the GENIUS web portal To make easy the use of the Rosetta abinitio on Grid environment by the

computational biology community, the application was integrated within the GENIUS portal (https://glite-tutor.ct.infn.it).

After a simple MyProxy server initialization procedure, input files and executable uploading, JDL file preparation, application running, run status monitoring and download of the output file are carried out from GENIUS web portal

The Rosetta abinitio output (universal .pdb format) can easily downloaded on local machine just typing download on Genius web interface. For an istant on-line check, is also provides a high resolution graphical output of the predicted structure in .png and .wrl formats is also provided

Page 15: “High throughput” protein structure  prediction application in EUChinaGRID

G. Minervini EGEE User Forum * Manchester, 10-5-2007 15

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Step 1. After MyProxy initialization the user connects to the GENIUS portal to set up the parametric JDL, specifying the number of runs (equivalent to the number of amino acid sequences to be simulated) to be carried out.

Page 16: “High throughput” protein structure  prediction application in EUChinaGRID

G. Minervini EGEE User Forum * Manchester, 10-5-2007 16

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Step 2. The user specifies the working directory and the name of the shell script.

Page 17: “High throughput” protein structure  prediction application in EUChinaGRID

G. Minervini EGEE User Forum * Manchester, 10-5-2007 17

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Step 3. Input files (fragment libraries) are loaded as a single .tar.gz folder per amino acid sequence.

Page 18: “High throughput” protein structure  prediction application in EUChinaGRID

G. Minervini EGEE User Forum * Manchester, 10-5-2007 18

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Step 4. Output files (initial and refined model coordinates) are specified in parametric form.

Page 19: “High throughput” protein structure  prediction application in EUChinaGRID

G. Minervini EGEE User Forum * Manchester, 10-5-2007 19

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Step 5. The parametric JDL file is generated and visualized to be inspected by the user.

Page 20: “High throughput” protein structure  prediction application in EUChinaGRID

G. Minervini EGEE User Forum * Manchester, 10-5-2007 20

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Step 6. The parametric job is submitted and its status as well as the status of individual runs of the same job can be checked.

Page 21: “High throughput” protein structure  prediction application in EUChinaGRID

G. Minervini EGEE User Forum * Manchester, 10-5-2007 21

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Graphical output of the predicted structure representation generated in .png format by Molscript and Raster3D.

Page 22: “High throughput” protein structure  prediction application in EUChinaGRID

G. Minervini EGEE User Forum * Manchester, 10-5-2007 22

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

CONCLUSIONS We are currently accumulating data on NBP structures

Collecting tools for analysis (structure and function analysis)

Studying portability of other applications (e.g. function recognition software developed “in house”) in GRID

Envisioning application of ported tools for structural genomics initiatives on biomedically relevant targets

– Example: prediction of the structure/function of the entire set of proteins of selected viral and microbial pathogens for target selection and in silico drug discovery

Page 23: “High throughput” protein structure  prediction application in EUChinaGRID

EGEE-II INFSO-RI-

031688

Enabling Grids for E-sciencE

www.eu-egee.org

EGEE and gLite are registered

trademarks

Thank you!