star: recombination site prediction

22
Predicting structural disruption caused by crossover: a machine learning approach Denis C. Bauer Talk CIBCB 2005

Upload: denis-bauer

Post on 01-Jul-2015

758 views

Category:

Education


2 download

DESCRIPTION

The presentation was given at the CIBCB, 2005, in San Diego about our approach to predict recombination sites in protein sequence. Recombination is the method of choice for designing new proteins with desired new or enhanced properties. The publication is :Bauer, D.C., Bodén, M., Thier, R. and Gillam, E. M. “STAR: Predicting recombination sites from amino acid sequence.” BMC Bioinformatics, 2006 Oct 8; 7:437. PMID: 17026775

TRANSCRIPT

Page 1: STAR: Recombination site prediction

Predicting structural disruption caused by crossover: a machine learning approach

Denis C. Bauer

Talk CIBCB 2005

Page 2: STAR: Recombination site prediction

Outline

• Introduction in Protein Design

• Theory of SCHEMA

• Our Approach

• Results

• Summary

Page 3: STAR: Recombination site prediction

Protein• Biological Functions

– Proteins are fundamental components of all living cells

• Messenger Function (e.g. Hormones)• Catalystic Function (e.g. Enzymes)• Regulatoy Function (e.g. Antibodies)

• Protein Design for Industry and Medicine – Better adjusted– New function

Introduction

Page 4: STAR: Recombination site prediction

Protein Structure• Primary Structure

• Secondary Structure

• Tertiary Structure

• Quaternary Structure

Pictures from: Principles of BIOCHEMISTRY, Horton, Moran, Ochs, Rawn, Scrimgeours

Introduction

Page 5: STAR: Recombination site prediction

– Huge sequence space

– Not every possible sequence is stable

Protein Design

• Creating new amino acid sequences

20100

possible Amino Acid sequences

Solution: using sequences which already exist

Introduction

Gly Ala– Glu ThrPro Val Gly Asp– – –Glu ThrPro– –– – – – Gly Ala– Glu Pro– ––

Page 6: STAR: Recombination site prediction

KEMHQPLTFGELENLPLLNTDKPVQALM

Benefit of Recombination

Problem: how to identify recombination sites ?

Introduction

KIPDELGLIFKFEAPGRVTRVLSSQ…MH KL NE K AP

TIKELPQPPTFGELKKLPLLNTDKPVQALML KP GK

G

MKIADELGEIFKFEAPGRVTRYLSSQ…AP EL YAMKIPDELGLIFKFEAPGRVTRALSSQ…MKIPDELGLIFKFEAPGRVTRALSSQ…

KEMHQPLTFGELENLPLLNTDKPVQAL KEMHQPLTFGELENLPLLNTDKPVQAL

Better resistant to heat

Higher performance

Higher performance

Better resistant to heat

Mayfly

Lives where its hot

Page 7: STAR: Recombination site prediction

SCHEMA

• Research group of Prof. Francis Arnold

• Idea: Positions where the least interaction are disrupted

SCHEMA

SCHEMA profile

Page 8: STAR: Recombination site prediction

Limitations

• 3D Structure necessary– Problem: hard to derive for some proteins

• time consuming• expensive

Solution: Disengaging from 3D structure

SCHEMA

Page 9: STAR: Recombination site prediction

Our approach

Page 10: STAR: Recombination site prediction

0 50 100 150 200 250 300 350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.71A31A

residues

SC

HE

MA

sco

re

0 50 100 150 200 250 300 350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.71A31A

residues

SC

HE

MA

sco

re

Alternative to SCHEMA3D Structure Information Schema Alg Schema Score

PredictingSequence

Benefit: All Proteins can be processed

Our Approach

Page 11: STAR: Recombination site prediction

0 50 100 150 200 250 300 350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

residues

1A31A

0 50 100 150 200 250 300 350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.71A31A

SC

HE

MA

sco

re

residues

Model

Predicting Schema-Profile

Bidirectional RecurrentNetwork

Predicted Schema Score

Sequence

Support Vector Regression

PredictiveModel

Feed Forward NeuralNetwork

*

* Bodén, M., Yuan, Z. and Bailey, T. L. Prediction of protein continuum secondary structure with probabilistic models. submitted

Our Approach

Page 12: STAR: Recombination site prediction

Results

Method r devA

FFNN 0.86 0.57

BRNN 0.88 0.52

SVR eps 0.82 0.63

SVR nu 0.83 0.62

Table 1 Results for all approaches. r = correlation coefficient (ideally 1), devA = Root Mean Square Error (RMSE) normalized by the standard deviation (ideally 0).

Results

Page 13: STAR: Recombination site prediction

Results

0 50 100 150 200 250 3000

0.2

0.4

0.6

0.81A4U-A

Sco

re

0 50 100 150 200 250 3000

0.2

0.4

0.6

0.81BKP-B

Sco

re

0 50 100 150 200 250 3000

0.2

0.4

0.6

0.81AJZ_

Sequence position

Sco

re

Results

Page 14: STAR: Recombination site prediction

Results

Results

Page 15: STAR: Recombination site prediction

Refinements

Contact Numbers

Predicting Model

Predicted Schema Score

ML model

predicted

0 50 100 150 200 250 300 350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.71A31A

SC

HE

MA

sco

re

residues

Input features

Solvent AccessibilityScore

CC

0.88

0.88

0.6Ensemble

ML model

ML model

ML model

0.88

Results

Page 16: STAR: Recombination site prediction

However…

• Only a limited number of connections are considered• Broken connections are reconnected after recombination

Page 17: STAR: Recombination site prediction

Summary

• Design proteins with recombination rather than from scratch– Identifiy recombination site – Idea: finding the sites where the least interactions are disrupted

(SCHEMA)

• Predicting SCHEMA-score to overcome the limitation• SCHEMA too limited to be the only means for

recombination site prediction• Future work

– All interactions– Actual recombination process

Page 18: STAR: Recombination site prediction

Acknowledgments

• Supervisors Dr. Mikael Bodén and Dr. Ricarda Thier• Dr. Zheng Yuan • Prof. Francis Arnold’s research group

Page 19: STAR: Recombination site prediction

Thank youRef:C. A. Voigt, C. Martinez, Z.-G. Wang, S. L. Mayo, and F. H. Arnold, Protein building blocks preserved by recombination, Nat Struct Biol, vol. 9, no. 7, pp. 553-558, Jul 2002.

Meyer MM, Silberg JJ, Voigt CA, Endelman JB, Mayo SL, Wang ZG, Arnold FH. Library analysis of SCHEMA-guided protein recombination.Protein Sci. 2003 Aug;12(8):1686-93.

Bodén, M., Yuan, Z. and Bailey, T. L. Prediction of protein continuum secondary structure with probabilistic models. submitted.

Page 20: STAR: Recombination site prediction

PDB 1zg4

Page 21: STAR: Recombination site prediction

Recombination Site Identification

• Recombination vs Mutagenesis or Design from scratch

– Higher fraction of functional proteins– Higher diversity higher chance to find

a better hybrid

• Requirement– Identify recombination site – Identify which segments are useful– Identify beneficial segment combinations

• Existing methods– SCHEMA (Hybrid evaluation : avoid breaking connections)– FamClash (Hybrid evaluation : avoid changing properties of

residue pairs)– STAR (Site suggestion according to strucural compactness)

• Known methods too limited to be a good means for recombination site prediction

http://www.che.caltech.edu/groups/fha/

Page 22: STAR: Recombination site prediction

Possible approaches

• Identify a new measure for evaluating hybrids (derived from datasets of biologically produced hybrids)

• Include more information in the decision process– Sequence/Structure (SCHEMA)– Chemical features (FamClash)– Predicting important residues for structure and/or function– Predicting enzyme function from protein sequence– Substitution tolerance– Hydrophobic patterning– Surface clefts or binding sites– Solvent accessibility – Domains/motifs of parents