bioinformatics t7-protein structure-v2013_wim_vancriekinge
DESCRIPTION
Protein StructureTRANSCRIPT
FBW19-11-2013
Wim Van Criekinge
The reason for “bioinformatics” to exist ?
• empirical finding: if two biological sequences are sufficiently similar, almost invariably they have similar biological functions and will be descended from a common ancestor.
• (i) function is encoded into sequence, this means: the sequence provides the syntax and
• (ii) there is a redundancy in the encoding, many positions in the sequence may be changed without perceptible changes in the function, thus the semantics of the encoding is robust.
Protein Structure
IntroductionWhy ?How do proteins fold ?
Levels of protein structure0,1,2,3,4
X-ray / NMRThe Protein Database (PDB)Protein ModelingBioinformatics & Proteomics Weblems
• Proteins perform a variety of cellular tasks in the living cells
• Each protein adopts a particular folding that determines its function
• The 3D structure of a protein can bring into close proximity residues that are far apart in the amino acid sequence
• Catalytic site: Business End of the molecule
Why protein structure ?
Rationale for understanding protein structure and function
Protein sequence
-large numbers of sequences, including whole genomes
Protein function
- rational drug design and treatment of disease- protein and genetic engineering- build networks to model cellular pathways- study organismal function and evolution
?
structure determination structure prediction
homologyrational mutagenesisbiochemical analysis
model studies
Protein structure
- three dimensional- complicated- mediates function
About the use of protein models (Peitch)
• Structure is preserved under evolution when sequence is not – Interpreting the impact of mutations/SNPs and conserved
residues on protein function. Potential link to disease• Function ?
– Biochemical: the chemical interactions occerring in a protein– Biological: role within the cell– Phenotypic: the role in the organism
• Gene Ontology functional classification !– Priorisation of residues to mutate to determine protein
function– Providing hints for protein function:Catalytic mechanisms
of enzymes often require key residues to be close together in 3D space
– (protein-ligand complexes, rational drug design, putative interaction interfaces)
MIS-SENSE MUTATIONe.g. Sickle Cell Anaemia
Cause: defective haemoglobin due to mutation in β-globin geneSymptoms: severe anaemia and death in homozygote
Normal β-globin - 146 amino acids val - his - leu - thr - pro - glu - glu - --------- 1 2 3 4 5 6 7
Normal gene (aa 6) Mutant geneDNA CTC CACmRNA GAG GUGProduct Glu Valine
Mutant β-globin val - his - leu - thr - pro - val - glu - ---------
Protein Conformation
• Christian AnfinsenStudies on reversible denaturation “Sequence specifies conformation”
• Chaperones and disulfideinterchange enzymes:involved but not controlling final state, they provide environment to refold if misfolded
• Structure implies function: The amino acid sequence encodes the protein’s structural information
• by itself:– Anfinsen had developed what he called his
"thermodynamic hypothesis" of protein folding to explain the native conformation of amino acid structures. He theorized that the native or natural conformation occurs because this particular shape is thermodynamically the most stable in the intracellular environment. That is, it takes this shape as a result of the constraints of the peptide bonds as modified by the other chemical and physical properties of the amino acids.
– To test this hypothesis, Anfinsen unfolded the RNase enzyme under extreme chemical conditions and observed that the enzyme's amino acid structure refolded spontaneously back into its original form when he returned the chemical environment to natural cellular conditions.
– "The native conformation is determined by the totality of interatomic interactions and hence by the amino acid sequence, in a given environment."
How does a protein fold ?
Protein Structure
IntroductionWhy ?How do proteins fold ?
Levels of protein structure0,1,2,3,4
X-ray / NMRThe Protein Database (PDB)Protein ModelingBioinformatics & Proteomics Weblems
• Proteins are linear heteropolymers: one or more polypeptide chains
• Below about 40 residues the term peptide is frequently used.
• A certain number of residues is necessary to perform a particular biochemical function, and around 40-50 residues appears to be the lower limit for a functional domain size.
• Protein sizes range from this lower limit to several hundred residues in multi-functional proteins.
• Three-dimentional shapes (folds) adopted vary enormously
• Experimental methods:– X-ray crystallography– NMR (nuclear magnetic resonance)– Electron microscopy– Ab initio calculations …
The Basics
• Zeroth: amino acid composition (proteomics, %cysteine, %glycine)
Levels of protein structure
The basic structure of an a-amino acid is quite simple. R denotes any one of the 20 possible side chains (see table below). We notice that the Ca-atom has 4 different ligands (the H is omitted in the drawing) and is thus chiral. An easy trick to remember the correct L-form is the CORN-rule: when the Ca-atom is viewed with the H in front, the residues read "CO-R-N" in a clockwise direction.
Amino Acid Residues
Amino Acid Residues
Amino Acid Residues
Amino Acid Residues
Amino Acid Residues
• Primary: This is simply the order of covalent linkages along the polypeptide chain, I.e. the sequence itself
Levels of protein structure
Backbone Torsion Angles
Backbone Torsion Angles
• Secondary– Local organization of the protein backbone: alpha-
helix, Beta-strand (which assemble into Beta-sheets) turn and interconnecting loop.
Levels of protein structure
Ramachandran / Phi-Psi Plot
The alpha-helix
• Residues with hydrophobic properties conserved at i, i+2, i+4 separated by unconserved or hydrophilic residues suggest surface beta- strands.
· A short run of hydrophobic amino acids (4 residues) suggests a buried beta-strand.
· Pairs of conserved hydrophobic amino acids separated by pairs of unconserved, or hydrophilic residues suggests an alfa-helix with one face packing in the protein core. Likewise, an i, i+3, i+4, i+7 pattern of conserved hydrophobic residues.
A Practical Approach: Interpretation
Beta-sheets
Topologies of Beta-sheets
Secondary structure prediction ?
• Chou, P.Y. and Fasman, G.D. (1974).Conformational parameters for amino acids in helical, b-sheet, and random coil regions calculated from proteins.Biochemistry 13, 211-221.
• Chou, P.Y. and Fasman, G.D. (1974).
Prediction of protein conformation.
Biochemistry 13, 222-245.
Secondary structure prediction:CHOU-FASMAN
• Method • Assigning a set of prediction values to a
residue, based on statistic analysis of 15 proteins
• Applying a simple algorithm to those numbers
Secondary structure prediction:CHOU-FASMAN
Calculation of preference parameters
observed counts• P = Log --------------------- + 1.0 expected counts• Preference parameter > 1.0 specific residue has a
preference for the specific secondary structure.• Preference parameter = 1.0 specific residue does not
have a preference for, nor dislikes the specific secondary structure.
• Preference parameter < 1.0 specific residue dislikes the specific secondary structure.
For each of the 20 residues and each secondary structure (a-helix, b-sheet and b-turn):
Secondary structure prediction:CHOU-FASMAN
Preference parametersResidue P(a) P(b) P(t) f(i) f(i+1) f(i+2) f(i+3)Ala 1.45 0.97 0.57 0.049 0.049 0.034 0.029Arg 0.79 0.90 1.00 0.051 0.127 0.025 0.101Asn 0.73 0.65 1.68 0.101 0.086 0.216 0.065Asp 0.98 0.80 1.26 0.137 0.088 0.069 0.059Cys 0.77 1.30 1.17 0.089 0.022 0.111 0.089Gln 1.17 1.23 0.56 0.050 0.089 0.030 0.089Glu 1.53 0.26 0.44 0.011 0.032 0.053 0.021Gly 0.53 0.81 1.68 0.104 0.090 0.158 0.113His 1.24 0.71 0.69 0.083 0.050 0.033 0.033Ile 1.00 1.60 0.58 0.068 0.034 0.017 0.051Leu 1.34 1.22 0.53 0.038 0.019 0.032 0.051
Lys 1.07 0.74 1.01 0.060 0.080 0.067 0.073Met 1.20 1.67 0.67 0.070 0.070 0.036 0.070Phe 1.12 1.28 0.71 0.031 0.047 0.063 0.063Pro 0.59 0.62 1.54 0.074 0.272 0.012 0.062Ser 0.79 0.72 1.56 0.100 0.095 0.095 0.104Thr 0.82 1.20 1.00 0.062 0.093 0.056 0.068Trp 1.14 1.19 1.11 0.045 0.000 0.045 0.205Tyr 0.61 1.29 1.25 0.136 0.025 0.110 0.102Val 1.14 1.65 0.30 0.023 0.029 0.011 0.029
Secondary structure prediction:CHOU-FASMAN
Applying algorithm1. Assign parameters to residue.2. Identify regions where 4 out of 6 residues have P(a)>100: a-helix. Extend
helix in both directions until four contiguous residues have an average P(a)<100: end of a-helix. If segment is longer than 5 residues and P(a)>P(b): a-helix.
3. Repeat this procedure to locate all of the helical regions. 4. Identify regions where 3 out of 5 residues have P(b)>100: b-sheet. Extend
sheet in both directions until four contiguous residues have an average P(b)<100: end of b-sheet. If P(b)>105 and P(b)>P(a): a-helix.
5. Rest: P(a)>P(b) a-helix. P(b)>P(a) b-sheet.6. To identify a bend at residue number i, calculate the following value:
p(t) = f(i)f(i+1)f(i+2)f(i+3)
If: (1) p(t) > 0.000075; (2) average P(t)>1.00 in the tetrapeptide; and (3) averages for tetrapeptide obey P(a)<P(t)>P(b): b-turn.
Secondary structure prediction:CHOU-FASMAN
Successful method?19 proteins evaluated:• Successful in locating 88% of helical and 95% of
b regions• Correctly predicting 80% of helical and 86% of b-
sheet residues• Accuracy of predicting the three conformational
states for all residues, helix, b, and coil, is 77%Chou & Fasman:successful methodAfter 1974:improvement of preference parameters
Secondary structure prediction:CHOU-FASMAN
Sander-Schneider: Evolution of overall structure
• Naturally occurring sequences with more than 20% sequence identity over 80 or more residues always adopt the same basic structure (Sander and Schneider 1991)
Sander-Schneider
• HSSP: homology derived secondary structure
• SCOP: –Structural Classification of
Proteins• FSSP:
–Family of Structurally Similar Proteins
• CATH: –Class, Architecture, Topology,
Homology
Structural Family Databases
Levels of protein structure
• Tertiary– Packing of secondary structure
elements into a compact spatial unit– Fold or domain – this is the level to
which structure is currently possible
Domains
Protein Architecture
• Protein Dissection into domain• Conserved Domain Architecture
Retrieval Tool (CDART) uses information in Pfam and SMART to assign domains along a sequence
• (automatic when blasting)
Domains
• From the analysis of alignment of protein families
• Conserved sequence features, usually associate with a specific function
• PROSITE database for protein “signature” protein (large amount of FP & FN)
• From aligment of homologous sequences (PRINTS/PRODOM)
• From Hidden Markov Models (PFAM)• Meta approach: INTERPRO
Domains
Protein Architecture
Levels of protein structure: Topology
Hydrophobicity Plot
P53_HUMAN (P04637) human cellular tumor antigen p53Kyte-Doolittle hydrophilicty, window=19
The ‘positive inside’ rule(EMBO J. 5:3021; EJB 174:671,205:1207; FEBS lett. 282:41)
Bacterial IMIn: 16% KR out: 4% KREukaryotic PMIn: 17% KR out: 7% KRThylakoid membrane In: 13% KR out: 5% KRMitochondrial IMIn: 10% KR out: 3% KR
• Membrane-bound receptors
• A very large number of different domains both to bind their ligand and to activate G proteins.
• 6 different families
• Transducing messages as photons, organic odorants, nucleotides, nucleosides, peptides, lipids and proteins.
GPCR Topology
• Pharmaceutically the most important class
• Challenge: Methods to find novel GCPRs in human genome …
GPCR Topology
• Seven transmembrane regions
GPCR Structure
• Conserved residues and motifs (i.e. NPXXY)
• Hydrophobic/ hydrophilic domains
GPCR Topology
GPCR Topology
Eg. Plot conserverd residues (or multiple alignement: MSA to SSA)
Levels of protein structure
• Difficult to predict• Functional units: Apoptosome,
proteasome
Protein Structure
IntroductionWhy ?How do proteins fold ?
Levels of protein structure0,1,2,3,4
X-ray / NMRThe Protein Database (PDB)Protein ModelingBioinformatics & Proteomics Weblems
• X-ray crystallography is an experimental technique that exploits the fact that X-rays are diffracted by crystals.
• X-rays have the proper wavelength (in the Ångström range, ~10-8 cm) to be scattered by the electron cloud of an atom of comparable size.
• Based on the diffraction pattern obtained from X-ray scattering off the periodic assembly of molecules or atoms in the crystal, the electron density can be reconstructed.
• A model is then progressively built into the experimental electron density, refined against the data and the result is a quite accurate molecular structure.
What is X-ray Crystallography
• NMR uses protein in solution– Can look at the dynamic properties of the protein structure– Can look at the interactions between the protein and ligands,
substrates or other proteins– Can look at protein folding– Sample is not damaged in any way– The maximum size of a protein for NMR structure determination is ~30
kDa.This elliminates ~50% of all proteins– High solubility is a requirement
• X-ray crystallography uses protein crystals– No size limit: As long as you can crystallise it– Solubility requirement is less stringent– Simple definition of resolution– Direct calculation from data to electron density and back again– Crystallisation is the process bottleneck, Binary (all or nothing)– Phase problem Relies on heavy atom soaks or SeMet incorporation
• Both techniques require large amounts of pure protein and require expensive equipment!
NMR or Crystallography ?
Protein Structure
IntroductionWhy ?How do proteins fold ?
Levels of protein structure0,1,2,3,4
X-ray / NMRThe Protein Database (PDB)Protein ModelingBioinformatics & Proteomics Weblems
PDB
PDB
PDB
PDB
Visualizing Structures
Cn3D versie 4.0 (NCBI)
Ball: Van der Waals radiusStick: length joins center
N, blue/O, red/S, yellow/C, gray (green)
Visualizing Structures
From N to C
Visualizing Structures
• Demonstration of Protein explorer • PDB, install Chime• Search helicase (select structure where
DNA is present)• Stop spinning, hide water molecules• Show basic residues, interact with
negatively charged backbone
• RASMOL / Cn3D
Visualizing Structures
Protein Structure
IntroductionWhy ?How do proteins fold ?
Levels of protein structure0,1,2,3,4
X-ray / NMRThe Protein Database (PDB)Protein ModelingBioinformatics & Proteomics Weblems
Modeling
Protein StuctureMolecular Modeling:
building a 3D protein structure from its sequence
• Finding a structural homologue• Blast
–versus PDB database or PSI-blast (E<0.005)
–Domain coverage at least 60%• Avoid Gaps
–Choose for few gaps and reasonable similarity scores instead of lots of gaps and high similarity scores
Modeling
• Extract “template” sequences and align with query
• Whatch out for missing data (PDB file) and complement with additonal templates
• Try to get as much information as possible, X/NMR
• Sequence alignment from structure comparson of templates (SSA) can be different from a simple sequence aligment
• >40% identity, any aligment method is OK• <40%, checks are essential
– Residue conservation checks in functional regions (patterns/motifs)– Indels: combine gaps separted by few resides– Manual editing: Move gaps from secondary elements to loops– Within loops, move gaps to loop ends, i.e. turnaround point of backbone
• Align templates structurally, extract the corresponding SSA or QTA (Query/template alignment)
Modeling
Input for model building
• Query sequence (the one you want the 3D model for)
• Template sequences and structures• Query/Template(s) (structure) sequence
aligment
Modeling
• Methods (details on these see paper):– WHATIF,– SWISS-MODEL,– MODELLER,– ICM,– 3D-JIGSAW,– CPH-models,– SDC1
Modeling
• Model evaluation (How good is the prediction, how much can the algorithm rely/extract on the provided templates)– PROCHECK– WHATIF– ERRAT
• CASP (Critical Assessment of Structure Prediction)– Beste method is manual alignment editing !
Modeling
CASP4: overall model accuracy ranging from 1 Å to 6 Å for 50-10% sequence identity
**T112/dhso – 4.9 Å (348 residues; 24%) **T92/yeco – 5.6 Å (104 residues; 12%)
**T128/sodm – 1.0 Å (198 residues; 50%)
**T125/sp18 – 4.4 Å (137 residues; 24%)
**T111/eno – 1.7 Å (430 residues; 51%) **T122/trpa – 2.9 Å (241 residues; 33%)
Comparative modelling at CASP
CASP2
fair~ 75%~ 1.0 Å~ 3.0 Å
CASP3
fair~75%
~ 1.0 Å~ 2.5 Å
CASP4
fair~75%~ 1.0 Å~ 2.0 Å
CASP1
poor~ 50%~ 3.0 Å> 5.0 Å
BC
excellent~ 80%1.0 Å2.0 Å
alignmentside chainshort loopslonger loops
Protein Engineering / Protein Design