proteins and their 3 d structure - bioinformatics laboratory · sms and protein dossier –drug...

Goran Neshichhttp://www.cbi.cnptia.embrapa.br

Proteins and their 3 D Structure

Goran Neshich

Embrapa Informática Agropecuária

Cidade Universitária - UNICAMP

Campinas, SP

Structural BioInformatics Laboratory: SBI

•Gene Anotation

•Gene Comparison

•Structure

Descriptors

•Function

Descriptors

•Gene Expression

Networks

•Proteomics

Sequence

Lexical

Structure

Sintactic

Function

Semantic

Microarray

Analysing

Pragmatic

http://www.cbi.cnpia.embrapa.br

Bringing Genome Into Three Dimensions

Old protein map

Parallels that help us to see the problem better

Structure/function descriptors in JPD

Data/information deluge

flavors of Bioinformatics

Datalibrary – 2003 (february)

23.950.735 nucleotide sequences,37.486.732.136 bp

112 Published-complete genomes

590 Genomes being done

830.525 Protein Sequences

20.417 Protein Structures

5.300 Plasmodium falciparum genes, 23.000.000 bp

35.000 Genes in Homo sapiens,3.164.000.000 bp,

27936 genes in Xyllela fastidiosa,

2.519.802 Bases, 2775 proteins

10.000.000 Publications in PubMedline

Datalibrary – 2003 (October)

29,189,427 nucleotide sequences (~40 x 109 bp)

Published-complete genomes:

Virus: 1421; Archaea:16; Bacteria:135;

Eucariots: 9 +4 vertebrates+7 plants

590 Genomes being done

1,139,154 Protein Sequences

22,700 Protein Structures (PDB)

480 genes in Mycoplasma genitalium: 580,000 bp

35,000 Genes in Homo sapiens (3.164 x 109 bp)

27,936 genes in Xyllela fastidiosa,

2.519.802 Bases,

>10,000,000 Publications in PubMedline

The “high throughputs...”

Structural Bioinformatics

Ancient Chinese

Babylonian

Egyptian

Modern Arabic

MCMLVI

Onde atuamos?

Descritores de estruturaanotação

Sequenciamento

de GenomasGenômica

Estrutural

Interação proteína-ligante

(matching DB)

Mutational and

dynamic studiesDocking

Structural DB

Estrutura-Funcão

Livro da vida

Busca por novos efetores Drug Discovery

SMS and Protein Dossier – Drug Target DB

Final goal: complement Genome Track

Small molecules

Database

Fingerprint

Local PDB files

Fingerprint

Complete Genome

Sequence

Homology Modeling

Protein/Ligand interaction

(matching DB)

Mutational and

Protein-binding site 2-D

information (for search)

2D Contour map surface

matching

Ligand-binding site 2-D

information (for search )

1. Sequence similarity search

2. Sequence alignments

3. Structure alignment

4. Secondary structure prediction

5. Structure modeling (homology modeling)

6. Structure prediction (threding)

7. Characterization of structure

8. Relationship: sequence-structure-function

9. Function modifiers

10.Compiling the list of pairs: structure and its function

modifier

Sequence similarity search

Sequence alignment

AKWHGGAFWPPH

WAAGAHWPHAQD

http://www.cbi.cnpia.embrapa.br

Bringing Genome Into

Three Dimensions

How well function can be

inherited from similar

sequences?

Functional Genomics Milestone:

From sequence to function: desires and problems

Data/information deluge

flavors of Bioinformatics

1. Genomic sequencing2. Protein crystalization3. Synchrotron crystallography4. NMR5. Mass spectrometry6. Mutageneses experiments7. Screening8. Chemical synthesis

High Throughputs help increase a picture resolution:

1.What do we get?

2.A big puzzle with great many peaces!!!

High Throughputs help increase a picture resolution:

• Transcriptomics involves large-scale analysis of messenger RNAs (molecules

that are transcribed from active genes) to follow when, where, and under what conditions genes are expressed.

• Proteomics the study of protein expression and function—can bring

researchers closer than gene expression studies to what’s actually happening in

the cell.

• Structural genomics initiatives are being launched worldwide to generate the

3-D structures of one or more proteins from each protein family, thus offering clues to function and biological targets for drug design.

• Knockout studies are one experimental method for understanding the function

of DNA sequences and the proteins they encode. Researchers inactivate genes

in living organisms and monitor any changes that could reveal the function of specific genes.

• Comparative genomics—analyzing DNA sequence patterns of humans and

well-studied model organisms side-by-side—has become one of the most

powerful strategies for identifying human genes and interpreting their function.

Next Step in Genomics

From gene to functional protein

Sequence alignment

Scoring Matrices

T G A C

T 1 0 0 0

G 0 1 0 0

A 0 0 1 0

C 0 0 0 1

For DNA/RNA match=1, mismatch = 0

Instead of using points at match/mismatch, we may use

“scoring matrix”

“dotplot” is now converted into diagram of numbers and

best alignment corresponds to this diagonal with greatest

numerical value

A R N D C Q E G H I L K M F P S T W Y V

A 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -2 -1 -1 -3 -1 1 0 -3 -2 0

R -2 7 -1 -2 -4 1 0 -3 0 -4 -3 3 -2 -3 -3 -1 -1 -3 -1 -3

N -1 -1 7 2 -2 0 0 0 1 -3 -4 0 -2 -4 -2 1 0 -4 -2 -3

D -2 -2 2 8 -4 0 2 -1 -1 -4 -4 -1 -4 -5 -1 0 -1 -5 -3 -4

C -1 -4 -2 -4 13 -3 -3 -3 -3 -2 -2 -3 -2 -2 -4 -1 -1 -5 -3 -4

Q -1 1 0 0 -3 7 2 -2 1 -3 -2 2 0 -4 -1 0 -1 -1 -1 -3

E -1 0 0 2 -3 2 6 -3 0 -4 -3 1 -2 -3 -1 -1 -1 -3 -2 -3

G 0 -3 0 -1 -3 -2 -3 8 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4

H -2 0 1 -1 -3 1 0 -2 10 -4 -3 0 -1 -1 -2 -1 -2 -3 2 -4

I -1 -4 -3 -4 -2 -3 -4 -4 -4 5 2 -3 2 0 -3 -3 -1 -3 -1 4

L -2 -3 -4 -4 -2 -2 -3 -4 -3 2 5 -3 3 1 -4 -3 -1 -2 -1 1

K -1 3 0 -1 -3 2 1 -2 0 -3 -3 6 -2 -4 -1 0 -1 -3 -2 -3

M -1 -2 -2 -4 -2 0 -2 -3 -1 2 3 -2 7 0 -3 -2 -1 -1 0 1

F -3 -3 -4 -5 -2 -4 -3 -4 -1 0 1 -4 0 8 -4 -3 -2 1 4 -1

P -1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10 -1 -1 -4 -3 -3

S 1 -1 1 0 -1 0 -1 0 -1 -3 -3 0 -2 -3 -1 5 2 -4 -2 -2

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 0

W -3 -3 -4 -5 -5 -1 -3 -3 -3 -3 -2 -3 -1 1 -4 -4 -3 15 2 -3

Y -2 -1 -2 -3 -3 -1 -2 -3 2 -1 -1 -2 0 4 -3 -2 -2 2 8 -1

V 0 -3 -3 -4 -1 -3 -3 -4 -4 4 1 -3 1 -1 -3 -2 0 -3 -1 5

A R N D C Q E G H I L K M F P S T W Y V

A 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -2 -1 -1 -3 -1 1 0 -3 -2 0

R -2 7 -1 -2 -4 1 0 -3 0 -4 -3 3 -2 -3 -3 -1 -1 -3 -1 -3

N -1 -1 7 2 -2 0 0 0 1 -3 -4 0 -2 -4 -2 1 0 -4 -2 -3

D -2 -2 2 8 -4 0 2 -1 -1 -4 -4 -1 -4 -5 -1 0 -1 -5 -3 -4

C -1 -4 -2 -4 13 -3 -3 -3 -3 -2 -2 -3 -2 -2 -4 -1 -1 -5 -3 -4

Q -1 1 0 0 -3 7 2 -2 1 -3 -2 2 0 -4 -1 0 -1 -1 -1 -3

E -1 0 0 2 -3 2 6 -3 0 -4 -3 1 -2 -3 -1 -1 -1 -3 -2 -3

G 0 -3 0 -1 -3 -2 -3 8 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4

H -2 0 1 -1 -3 1 0 -2 10 -4 -3 0 -1 -1 -2 -1 -2 -3 2 -4

I -1 -4 -3 -4 -2 -3 -4 -4 -4 5 2 -3 2 0 -3 -3 -1 -3 -1 4

L -2 -3 -4 -4 -2 -2 -3 -4 -3 2 5 -3 3 1 -4 -3 -1 -2 -1 1

K -1 3 0 -1 -3 2 1 -2 0 -3 -3 6 -2 -4 -1 0 -1 -3 -2 -3

M -1 -2 -2 -4 -2 0 -2 -3 -1 2 3 -2 7 0 -3 -2 -1 -1 0 1

F -3 -3 -4 -5 -2 -4 -3 -4 -1 0 1 -4 0 8 -4 -3 -2 1 4 -1

P -1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10 -1 -1 -4 -3 -3

S 1 -1 1 0 -1 0 -1 0 -1 -3 -3 0 -2 -3 -1 5 2 -4 -2 -2

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 0

W -3 -3 -4 -5 -5 -1 -3 -3 -3 -3 -2 -3 -1 1 -4 -4 -3 15 2 -3

Y -2 -1 -2 -3 -3 -1 -2 -3 2 -1 -1 -2 0 4 -3 -2 -2 2 8 -1

V 0 -3 -3 -4 -1 -3 -3 -4 -4 4 1 -3 1 -1 -3 -2 0 -3 -1 5

Dotplot with scores

Two proteins aligned produce “score dotplot” from which

one can calculate optimal alignment

H E A G A W G H E E

P -2 -1 -1 -2 -1 -4 -2 -2 -1 -1

A -2 -1 5 0 5 -3 0 -2 -1 -1

W -3 -3 -3 -3 -3 15 -3 -3 -3 -3

H 10 0 -2 -2 -2 -3 -2 10 0 0

E 0 6 -1 -3 -1 -3 -3 0 6 6

A -2 -1 5 0 5 -3 0 -2 -1 -1

E 0 6 -1 -3 -1 -3 -3 0 6 6

Simple alignment

Graphical presentation of alignment

CGCTTCGGACGAAATCGCATCAGCATACGATCGCATGCCGGGCGGGATAAC

|| | |

CGAAATCGCATCAGCATACGATCGCATGC

| | | | | | | |

|| | |||| | | |

| ||| | || ||

|| | | || ||

|| || || |

| | | |

|||||||||||||||||||||||||||||

Alignment with “gaps”

Simple alignment does not always function

|| | | | | | | | |

CGAAATCGCATCACGCATACGATCGCATGC

| | ||||| | | |

|| | | || ||

| ||| | | || ||

|| | || |

|| || | |

|| ||||||||||||||||

||||||||||||| |

In many cases where two sequences do not

“coincide/align” perfectly, it is necessary to

introduce “gaps”.

CGCTTCGGACGAAATCGCATCA-GCATACGATCGCATGCCGGGCGGGATAA

||||||||||||| ||||||||||||||||

Structure elements

STING Millennium Suite:

Analysing structure of proteins and their complexes -

What do we know about structure

and its relationship with function?

What are the building blocks of

microfactories, better known as

PROTEINS?

What is the structural hierarchi in

proteins?

Secondary

structure elements:

Peptide bond and

other types of

“intimate” amino acid

contacts

Analysing structure of

proteins and their

complexes -

STING Millennium

Suite:

Analysing

structure of

proteins and

complexes -

“Proper”

structural

parameters:

dihedral angles and

Ramachandran plot

proteins and their complexes

Types of “intimate” amino acid

contacts: Hydrogen Bonds

Diamond STING Suite:

Types of “intimate” amino acid

contacts: Hydrogen Bonds

“Proper” structural parameters:

dihedral angles and Ramachandran

Alpha Helix

Analysing structure of proteins and their complexes - SMS way

Ramachandran Plot

Collagen Helix

1. antiparallela

C. Struttura a foglietto ripiegatoExtended sheet - antiparallel

Goran Neshichhttp://www.cbi.cnptia.embrapa.brExtended sheet - parallel

Beta Turn

Type II turn

Protein types

1. mioglobin

2. flavodoxin 3. immunoglobulin lgG: domain CH2

Structural Proteins

1. 3-D presentation

2. Front view

C. Silk fiber

α-helix

superhelix

1. protofilamentA. α-Cheratin

1,5 nm

1. Triple Helix

2. Typical Sequence

3. Triple Helix (view from above)

Collagen

1. monomer: cartoon

2. monomer: van der Waals presentation

C. Tertiary structure

1. dimer

2. complex Zn2+ hexamer

D. Quaternary Structure

Globular Proteins

Membrane protein secondary structure prediction

Integral membrane proteinsCitoplasmic side

External

protein

phpspholipidglycoprotein

glycolipid Extracellular cell

Table 2. Hydrophobicity scale

by Kyte & Doolittle (1982)

(K-D) and by Goldman,

Engelman & Steitz

(Engelman et al., 1986)

(GES).

Residuo K-D GES

Ile 4.5 3.1

Val 4.2 2.6

Leu 3.8 2.8

Phe 2.8 3.7

Cys 2.5 2.0

Met 1.9 3.4

Ala 1,8 1.6

Tyr 1.3 -0.7

Gly -0.4 1.0

Thr -0.7 1.2

Ser -0.8 0.6

Trp -0.9 1.9

Pro -1.6 -0.2

His -3.2 -3.0

Asp -3.5 -9.2

Glu -3.5 -8.2

Asn -3.5 -4.8

Gln -3.5 -4.1

Lys -3.9 -8.8

Arg -4.5 -12.3

Structure modelling

Sequence-based fold

Recognition

Probably non-globular

Protein

As yet unobserved folds

Full threading methods

Figure 5 Hypothentical applicability of diferent categories of fold-recognition methods to the open

Reading Frames of small bacterial genomes. At present sequance-based fold recognition (e.g.

GenTHREADER) is successful for aroud 50% of the ORFs. Structures of a further 15% of ORFs can

probably be assigned. By full threading methods such as THREADER, and the reamaining 35%

cannot currently be recognized either because the fold has not yet observed, or because the ORF

encodes a non-globular protein (e.g. aTransmembrane protein).

Unannotated regionsPDB match region

Transmembrane or

Low complexity

region

Pie Chart of structural assignments to the proteome of the bacterium Mycoplasma genitalium. Almost

half of the amino acids (49%) in the Mycoplasma genitalium proteins have a structural annotation. In

this case, the structural anotation was taken from the SUPERFAMILY database(version 1.59,

September 2002), described in Section 11.3.2.Roughty one fifth of the proteome is predicted to be a

transmembrane helix or low complexity region by therelevant computer programs. The remaining 30%

of the proteome is unassigned.

Structure alignment

Function modifiers: drugs

Molecular Geometry and 3D

Matching•Formato PDB

•Definições de Superfície Molecular

•Pockets e Cavities

•Fingerprints

•Matching

•Docking

Final goal: complement Genome Track

Small molecules

Database

Fingerprint

Local PDB files

Fingerprint

Complete Genome

Sequence

Homology Modeling

Protein/Ligand interaction

(matching DB)

Mutational and

Protein-binding site 2-D

information (for search)

2D Contour map surface

matching

Ligand-binding site 2-D

information (for search )

Hundreds of targets

millions of compounds

Now…..

Leaving Surface: below the hood...

Intermediary sequence - problem solved!

AKWHGGAFWPPH

WAAGAHWPHAQD

ARWHGGWPHAQE

proteins and their 3 d structure - bioinformatics laboratory · sms and protein dossier –drug...

Documents

portfolio goran savic ostojic eng

goran kovač hrvatske šume d.o.o

goran pejanović assistant director republic

goran - croatian copywriter zagreb

mining semantic descriptions of bioinformatics web resources...

preveo goran skrobonja - knjizara.com

goran bregovic

presentation done by goran lindqvist

brittny goran digital process book

goran sonesson - semiotics of photography

portfolio goran savic ostojic 2016

goran popović, tanja grmuša, maša popović: goran...

goran milenkovic by

input goran tomka_kumatalks2.0

goran stefanovski - euralien

curriculum vitae - cbi.cnptia.embrapa.brneshich/cv...

prof. dr goran pitić, fefa

goran klepac, ph.d

e1 ms access gorički goran

critical sociology-goran therborn