- unistra.fr · pdb is an important resource for research in the academic, pharmaceutical, and ......
TRANSCRIPT
Page ‹#›
Introduction to the Protein Data Bank
Master Chimie Info - 2009
Roland Stote
The purpose of the Protein DataBank is to collect and organize3D structures of proteins, nucleicacids, protein-nucleic acidcomplexes and complexes withdrug molecules and inhibitors.
The PDB is the recognizedworldwide repository for 3Dstructures.
After validation of depositedstructures, the coordinates aremade available to all researchersworldwide and free-of-charge.
http://www.pdb.org
Page ‹#›
Founded in 1971 by Brookhaven NationalLaboratory, New York.
PDB is an important resource for research in theacademic, pharmaceutical, andbiotechnology sectors
Transferred to the Research Collaboratory forStructural Bioinformatics (RCSB) in 1998.
Currently it holds more than 50000 structures.▪ X-ray
NMRElectron MicroscopyHomology Models
Protein OnlyDNA OnlyRNAProtein Nucleic Acid Complexes
The PDB can be accessed at http://www.rcsb.org/pdb/
A variety of information is available,1. sequence details …MADPMAGLELLSDQGYRVDGRRAGELRKIQARMGVFAQAD…
2. atomic coordinates3. crystallization conditions4. 3-D structural classifications5. geometric data6. structure factors7. 3-D images8. links to other resources
Page ‹#›
PDB Data
PDB file format was used to contain the coordinates and related information.
In the late 1990ʼs, macromolecular Crystallographic Information file (mmCIF) evolved.
Conforms to well-documented standards and facilitates automated data management.
mmCIF dictionary contains 2,500 definitions for terms used to describe thecrystallographic experiment.
The dictionary definition language (DDL) is structured in a way that data files thatconform to this syntax can be readily loaded into a database.
Software Tools
Software tools to minimize the amount of manual labor.
To help scientists deposit and validate their results more quickly.
To interegate the Data bank for information mining
Search Fields ExamplePDB ID 4HHB, 2MHRDeposition/Release Date September 1 1996Contains Chain Type Protein: Ignore, Enzyme: Yes, DNACitation Author S.S. TaylorCompound Information Myoglobin, LysozymeNumber of Chains 1-5Secondary Structure Content Present of alpha min 80%
Page ‹#›
Data summary
Data Browsing
Page ‹#›
http://www.pdb.org
The PDB can be accessed at http://www.rcsb.org/pdb/
A variety of information is available,– sequence details– atomic coordinates– crystallization conditions– 3-D structural neighbors– geometric data– structure factors– 3-D images– links to other resources
Page ‹#›
Information storage in the Protein Data Bank
The PDB Format
HEADER ISOMERASE(INTRAMOLECULAR OXIDOREDUCTASE)12-OCT-94 1HTI 1HTI 2 COMPND TRIOSEPHOSPHATE ISOMERASE (TIM) (E.C.5.3.1.1) COMPLEXED WITH 1HTI 3 COMPND 2 2-PHOSPHOGLYCOLIC ACID 1HTI 4 SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT FORM EXPRESSED IN 1HTI 5 SOURCE 2 (ESCHERICHIA COLI) 1HTI 6 AUTHOR S.C.MANDE,W.G.J.HOL 1HTI 7 REVDAT 1 26-JAN-95 1HTI 0 1HTI 8 JRNL AUTH S.C.MANDE,V.MAINFROID,K.H.KALK,K.GORAJ, 1HTI 9 JRNL AUTH 2 J.A.MARTIAL,W.G.J.HOL 1HTI 10 JRNL TITL CRYSTAL STRUCTURE OF RECOMBINANT HUMAN 1HTI 11 JRNL TITL 2 TRIOSEPHOSPHATE ISOMERASE AT 2.8 ANGSTROMS 1HTI 12 JRNL TITL 3 RESOLUTION. TRIOSEPHOSPHATE ISOMERASE RELATED HUMAN 1HTI 13 JRNL TITL 4 GENETIC DISORDERS AND COMPARISON WITH THE 1HTI 14 JRNL TITL 5 TRYPANOSOMAL ENZYME 1HTI 15 JRNL REF PROTEIN SCI. V. 3 810 1994 1HTI 16 JRNL REFN ASTM PRCIEI US ISSN 0961-8368 0795 1HTI 17
I. Information Fields
Information storage in the Protein Data Bank
II. The coordinates
ATOM 2 CA ALA A 1 3.600 34.943 -23.158 1.00 59.14 1HTI 117 ATOM 3 C ALA A 1 3.987 33.499 -23.450 1.00 57.98 1HTI 118 ATOM 4 O ALA A 1 4.933 33.337 -24.232 1.00 58.72 1HTI 119 ATOM 5 CB ALA A 1 2.557 35.326 -24.214 1.00 58.72 1HTI 120 ATOM 6 N PRO A 2 3.399 32.472 -22.822 1.00 56.87 1HTI 121 ATOM 7 CA PRO A 2 3.812 31.074 -22.996 1.00 56.05 1HTI 122 ATOM 8 C PRO A 2 3.090 30.120 -23.966 1.00 55.12 1HTI 123 ATOM 9 O PRO A 2 2.000 30.337 -24.519 1.00 55.23 1HTI 124 ATOM 10 CB PRO A 2 3.802 30.553 -21.572 1.00 54.86 1HTI 125 ATOM 11 CG PRO A 2 2.559 31.211 -21.008 1.00 56.53 1HTI 126 ATOM 12 CD PRO A 2 2.618 32.624 -21.585 1.00 55.74 1HTI 127 ATOM 13 N SER A 3 3.779 28.991 -24.074 1.00 53.34 1HTI 128 ATOM 14 CA SER A 3 3.288 27.819 -24.764 1.00 51.00 1HTI 129 ATOM 15 C SER A 3 3.235 26.793 -23.610 1.00 47.52 1HTI 130 ATOM 16 O SER A 3 3.794 25.685 -23.636 1.00 48.54 1HTI 131 ATOM 17 CB SER A 3 4.296 27.418 -25.889 1.00 53.44 1HTI 132
Page ‹#›
Criteria for structure quality
REMARK 2 RESOLUTION. 2.8 ANGSTROMS. 1HTI 20 REMARK 3 1HTI 21 REMARK 3 REFINEMENT. 1HTI 22 REMARK 3 PROGRAM X-PLOR 1HTI 23 REMARK 3 AUTHORS BRUNGER 1HTI 24 REMARK 3 R VALUE 0.167 1HTI 25 REMARK 3 RMSD BOND DISTANCES 0.019 ANGSTROMS 1HTI 26 REMARK 3 RMSD BOND ANGLES 3.8 DEGREES 1HTI 27 REMARK 3 1HTI 28
Chimie Informatique
1
Introduction to the Protein Data Bank Roland Stote This tutorial is designed to introduce you to the Protein Data Bank maintained at the Rutgers University Center for Structural Biology in the United States. The Protein Data Bank (PDB) is a central repository for all known three-dimensional (3D) structures of proteins and nucleic acids in the public domain. The datafiles are freely accessible. The databank contains more than 17,000 proteins and nucleic acid structures. This tutorial is based on documentation available from the RCSB. The PDB can be accessed at http://www.pdb.org PDB identifiers Each structure in the PDB is represented by a 4 character alphanumeric identifier, assigned when the structure is deposited. This identifier begins with a digit (usually 1) and usually followed by three letters, for example 1CG1. Many of the PDB Web site pages, including the PDBhome page, allow you to enter a PDB ID and retrieve information for the corresponding structure. If you don't know a structure's PDB ID, you can search by keyword from the home page, or SearchLite, as described below. General Information • The PDB archive contains macromolecular structure data on proteins, nucleic acids, protein-
nucleic acid complexes, and viruses. Files in its holdings are deposited by the international user community and maintained by the Rutgers Center for Structural Biology (RCSB) Protein Data Bank (PDB) staff. Approximately 50-100 new structures are deposited each week. They are annotated by RCSB and released upon the depositor's specifications. PDB data is freely available worldwide.
• Information on structures can be retrieved from the main PDB Web site at http://www.pdb.org/, or one of its mirror sites, using several search methods described below.
• A variety of information associated with each structure is available, including sequence details,
atomic coordinates, crystallization conditions, 3-D structure neighbors computed using various methods, derived geometric data, structure factors, 3-D images, and a variety of links to other resources.
What information can you get from a PDB DataFile? (OPEN FILE 1RHS.PDB) Title Section contains records used to describe the experiment and the biological macromolecules present in the entry: (HEADER, OBSLTE, TITLE, CAVEAT, COMPND, SOURCE, KEYWDS, EXPDTA, AUTHOR, REVDAT, SPRSDE, JRNL, and REMARK records.)
Chimie Informatique
2
Primary Structure section contains the primary structure section of a PDB file contains the sequence of residues in each chain of the macromolecule. Heterogen Section of a PDB file contains the complete description of non-standard residues in the entry. Secondary Structure Section. The secondary structure section of a PDB file describes helices, sheets, and turns found in protein and polypeptide structures. (HELIX, SHEET , TURN) Connectivity Annotation Section The connectivity annotation section allows the depositors to specify the existence and location of disulfide bonds and other linkages. Miscellaneous Features Section The miscellaneous features section describes features in the molecule such as the active site. Crystallographic and Coordinate Transformation Section The Crystallographic Section describes the geometry of the crystallographic experiment and the coordinate system transformations. Coordinate Section The Coordinate Section contains the collection of atomic coordinates as well as the MODEL and ENDMDL records. Connectivity Section This section provides information on chemical connectivity Bookkeeping Section The Bookkeeping Section provides some final information about the file itself. Searching the PDB Go to PDB home page • A search requires that at least one search field is filled. Case is ignored. The search is then
executed by pressing the search button. • A search can return a single structure or multiple structures. • Iterative searches can be performed, using the output from one search as input for the next. The search tools can be accessed from the PDB home page. The types of possible searches are: 1. By providing a PDB identification code (PDB ID). 2. By searching the text found in PDB files (SearchLite). 3. By searching against specific fields of information - for example, deposition date or author (SearchFields). 4. By searching on the status of an entry, on hold or released (Status). 5. By iterating on a previous search. Searching by PDB ID (From the home page on the top bar) Enter the identification code, for example 4hhb and 9ins for hemoglobin and insulin, respectively. Many of the PDB Web site pages, including the PDB home page, allow you to enter a PDB ID and
Chimie Informatique
3
retrieve information for the corresponding structure. If you don't know a structure's PDB ID, you can search by keyword from the home page, or by a more advanced search, as described below. Using Keywords or Authors Via the Top bar , one can search the text of each PDB file as follows: • Queries locate literal text phrases. A search for protein kinase will locate all the occurances of
the words protein and kinase in the PDB • A search for author: brown will locate all structures that contain brown in their PDB AUTHOR
records.
Searching Using the Advanced Search The advanced search supports queries on specific attributes of a structure, such as its author, sequence, or deposition date. Additional search fields can be added or removed from the default form by selecting new fields from choices provided. If multiple fields are used for a search, a list of structures meeting all of the specified field requirements is returned. Detailed descriptions of these fields are available here. Status Search To check on the status and obtain summary information on an unreleased entry. Queries can be performed based on PDB ID, author, title, release date, or deposition date. You may also search based on the holding status of the unpublished entries. Status categories are: * release on publication - entry will be released when the associated journal article is published (HPUB) * release on certain date - entry will be released on a date specified by the authors at the time of deposition (WAIT) * await author input - entry is being processed but requires further interaction between the processor and the depositor (DEPOSITOR) * currently being processed - entry is still being processed (PROC/PROCESSING) The format of the results can be customized using the options at the bottom of the search interface page. Iterative Search From a list of structures returned from an initial search, the user can select all structures by choosing that option from the pull-down menu, or select a subset of structures by checking the boxes next to them. Additional searches can be performed over the entire or partial result list.
Chimie Informatique
4
Select the Refine Your Query option, which will return you to the search interface which was used for your initial query. Interpreting Results For a Single Structure If your query returns only a single structure, a synopsis of that structure is provided on the Structure Explorer page. The page primarily contains a summary of the most important features of the structure. On the top as well as on the left side is a list of additional options. This list is dynamic – what appears depends on what additional information is available for that structure. An image of the molecule is presented on the right hand side. If you have a Java equipped machine, other ways to view the structure are possible. The possible options are: Download/Display File Download the PDB or mmCIF file to your local computer as plain text or in one of 3 common compression formats: Unix compressed, GNU zip, or ZIP, plus other formats. FASTA Sequence you can download the sequence of the protein in the FAST (single letter) format. Structural Reports provides a detailed summary of the molecular. Display Molecule Provides a umber of options for visualizing the molecule. One needs a Java run-time environment on the machine. Structure Analysis: Geometry a tabular listing of bond lengths, bond angles and dihedral angles (phi, psi, omega, and chi) can be displayed, color coded to highlight significant deviations from ideality; a fold deviation score (FDS) provides a snapshot of the overall geometry of the selected structure Structure Analysis: Summaries and Analysis Provides additional, secondary analysis of the structure For Multiple Structures • A search returning multiple structures will present a list of those structures with a brief synopsis
of each in the Query Result Browser page. A single member of that list can be selected by clicking on the link next to it. A subset of entries can be selected by checking the boxes next to them.
1
Exercises for the Protein Data Bank
Roland Stote
• How do I locate a structure if I don't know its PDB ID?
• Knowing the PDB ID, how do I take a quick look at the structure?
• How do I obtain a list of all human DNA structures in the PDB that are not in complex
with another molecule?
• Which proteins contain a cyclic peptide of 12 residues or less and were refined to an R
value of better than 20%?
• How similar are the structures of ribonuclease A in different crystal forms?
Example 1 - How do I locate a structure if I don't know its PDB ID?
1. Using the SearchLite keyword search interface, you can search for any structure. For
example, you may want to find a ribosome. From the PDB home page select
SearchLite. Enter ribosome in the search box, then hit the Search button.
2. A list of a ribosomes and related structures containing the word "ribosome" in their
files are returned on the Query Result Browser page.
3. Scroll through the list to find the structure you want, then click on the EXPLORE link.
The resulting Structure Explorer page will provide many options, such as viewing the
structure or its sequence details.
2
4. Rather than scrolling through the Query Result Browser list, you can also select Refine
Your Query from the option scroll bar at the top of the page, and hit the Go button. This
will take you back to the SearchLite screen. You can enter another text string to search on,
such as "complex" which will query only the previous result list of ribosomes for that
keyword.
Example 2 - Knowing the PDB ID, how do I take a quick look at the structure?
1. Enter the PDB code (e.g. 1DI0) in the Enter a PDB ID field on the PDB
home page and hit the Explore button. The Structure Explorer page for
this entry, containing a synopsis of the 1DI0 structure, is returned.
2. From the Structure Explorer page select the View Structure link. A page
containing molecular graphics options is returned.
3. Display options are available using VRML, RasMol, Chime, and QuickPDB.
The additional help document provides instructions on how to configure
these options.
4. You can learn a significant amount about this structure by exploring the
options (links) on the left side of the Structure Explorer page. For
example, the topology, other structures with this common fold, and the
functional role of the molecule can be determined from these resources.
3
Example 3 - How do I obtain a list of all human DNA structures in the PDB
that are not in complex with another molecule?
1. Select SearchFields from the PDB home page.
2. In the lower (customizable) part of the SearchFields page, select the
check box for Source. Click on the New Form button to reset the page. A
new page will be returned with a field for Source included. Enter human
in this new field.
3. In the Contains Chain Type section, select Yes for DNA, and select No
for all other options. Click on the Search button.
4. From the scroll bar at the top of the returned Query Result Browser
page, select Create a Tabular Report and click on the Go button. The
resulting page will give you options for creating reports based on
several different parameters, such as citation information, in HTML or
text format.
Example 4 - Which proteins contain a cyclic peptide of 12 residues or less
and were refined to an R value of better than 20%?
1. Select SearchFields from the PDB home page.
2. In the lower (customizable) part of the SearchFields page, select the
check boxes for Number of Chains/Chain Length and Refinement
Parameters. Hit the New Form button to reset the page with the custom
parameters.
3. Enter 12 for the maximum of number of residues in the max box in the
4
Chain Length fields. Enter cyclic peptide in the Text Search field.
Select < from the Observed R Value button and enter 20 in the field.
Now hit the Search button. The search will return the Query Result
Browser page listing all the structures containing the cyclic peptide
key words that are a maximum of 12 residues in length and were refined
to an R value of less than 20%.
4. Review the list of structures that are returned by clicking on the
EXPLORE button for each and then using using both the Sequence Details
link to check the sequence and the Download/Display File link to
visually review the structure description
5. Select those structures that actually contain a cyclic peptide by
clicking on the check box next to the PDB ID for each relevant
structure. Having determined this correct set of structures you could
Create a Tabular Report or Download Structures or Sequences by
selecting these options from the scroll bar at the top of the Query
Result Browser page and clicking Go.
Example 5 - How similar are the structures of ribonuclease A in different
crystal forms?
1. Select SearchFields from the PDB home page. The page containing the
SearchFields interface is returned.
2. Enter ribonuclease A in the Compound Information field. If desired, you
can customize the display of results using the Result Display Options
fields on this page to show 10, 50 or all structures at once. Hit
Search. The Query Result Browser page, containing a list of all entries
5
meeting this search criteria, is returned.
3. Filter this list to include only entries that contain complete native
(non-mutant) Ribonuclease A's by using the checkbox by the side of the
PDB ID for each structure.
4. Choose Create a Tabular Report from the Pull down to select option menu
button and hit Go.
5. Choose the HTML option for Cell Dimensions. This search reveals C2,
P21, P21 21 21 and P32 2 1 as dominant crystal forms for this set of
selected structures.
6. Select the entry with PDB ID of 8RSA. The Structure Explorer page for
this entry will be returned.
7. Select the Structural Neighbors link. From the resulting page select
CE.
8. Select Search Database using the default search options. This search
results in a list of structures that, when ordered by z-score, groups
the previously selected structures very closely according to crystal
form. The alignment of the structures remains good with an RMSD (Root
Mean Square Deviation) of less that 1.0≈. Crystal packing may have some
influence on the overall structure, leading to distinct clusters of
highly similar structures which can be distinguished with a suitable
structure neighboring algorithm.