changing methods of data sharing in crystallography

Download Changing methods of data sharing in crystallography

If you can't read please download the document

Upload: menefer

Post on 09-Jan-2016

19 views

Category:

Documents


4 download

DESCRIPTION

Changing methods of data sharing in crystallography. Editor-in-Chief Acta Crystallographica & Chair of the IUCr Journals Commission 1996-2005; IUCr Delegate to ICSTI 2005-. Professor John R Helliwell Imperial College, June 28th, 2006. - PowerPoint PPT Presentation

TRANSCRIPT

  • Changing methods of data sharing in crystallography Professor John R Helliwell

    Imperial College, June 28th, 2006

    The University of Manchester [email protected] Acta Crystallographica & Chair of the IUCr Journals Commission 1996-2005;IUCr Delegate to ICSTI 2005-

  • Content of presentation

    Data description standards Quality control in publicationResponsibility for quality controlData quality standardsData publication at source

  • Crystal structures publishedCurated databasesCambridge Structural DatabaseSmall organic/metal-organic: 335,280 : 29,000/yrProtein Data BankBiological macromolecules: 34,506 : 5,500/yrInorganic Crystal Structure Database (82,676), CrystMet (99,893), Powder Diffraction File (240,050)IUCr journalsActa Crystallographica Sections C, ESmall-molecule, inorganic: 2357 articles/yearActa Crystallographica Sections D, FBiological macromolecules: ~ 120+ structural articles/year

  • Standard description of dataCrystallographic Information FrameworkInternational Tables for Crystallography (2005). Vol. G, Definition and exchange of crystallographic data, edited by S. R. Hall & B. McMahon, 1st ed. Berlin: Springer.CIF file structureHall, S.R., Allen, F.H. & Brown, I.D. (1991). The Crystallographic Information File (CIF): a new standard archive file for crystallography. Acta Cryst. A47, 655-685Dictionary definition languageHall, S.R. & Cook, A.P.F. (1995). STAR dictionary definition language: initial specification. J. Chem. Inf. Comput. Sci. 35, 819-825.Data dictionaries

  • Data dictionary definition_refine_ls_R_Fsqd_factorName: ' _refine_ls_R_Fsqd_factor'Definition:Residual factor R(Fsqd), calculated on the squared amplitudes of the observed and calculated structure factors, for significantly intense reflections (satisfying _reflns_threshold_expression) and included in the refinement. The reflections also satisfy the resolution limits established by _refine_ls_d_res_high and _refine_ls_d_res_low. sum | F(obs)^2^ - F(calc)^2^ | R(Fsqd) = ------------------------------- sum F(obs)^2^ F(obs)^2^ = squares of the observed structure-factor amplitudes, F(calc)^2^ = squares of the calculated structure-factor amplitudes and the sum is taken over the specified reflections. The permitted range is 0.0 infinityType: numbCategory: refine

  • Reader assessment

  • Quality control at sourcecheckCIF: http://checkcif.iucr.orgDescribed at http://journals.iucr.org/services/cif/datavalidation.html Free public service Sponsored by publishers and databases Over 340 separate tests

  • Validation of macromolecule structures

  • Data publication increasingly at sourceSmall-molecule crystallography often high throughput; thus only a subset of results get into the literature (?5 to 10%?)There is a rise of local/national laboratory data repositoriesExamples: eBank (Southampton, UK + 5 other sites); Reciprocal Net (Indiana, USA + 18 other sites)

  • eBankePrints repositoryOAI-PMHStandard metadataAll dataLinks to publicationRightsQuality

  • Online Dictionary ProjectUse wiki approach ( la Wikipedia) to realise community agreed dictionary termsPilot stage started September 2005Led by Emeritus Professor Andre Authier, Chair of the IUCr Nomenclature Commission

  • SummaryQuality of scientific argument depends on Quality of dataCritical appraisalAccessibility of relevant dataPrecision of definitionsRigorous analysisIUCr publications strive to provide the highest quality in all these areas so as to inform the Editorial process including the peer review

  • AcknowledgementsPeter Strickland, Managing Editor at IUCr, Chester.Brian McMahon, R&D Technical Development Officer at IUCr, Chester.

    The determination of molecular structures of biological and chemical interest is a major role of applied crystallography worldwide today. Half a million crystal structures of 'small' molecules (200 atoms or fewer) and over 35000 protein and nucleic acid crystal structures are stored in several curated databases. Many more structures have been determined but never published; increasingly these are being disseminated over the web. The crystallographic community is energetic in its efforts to uphold quality standards in a diverse scientific environment. The development of a crystallographic information file (CIF) and associated data dictionaries has allowed the seamless transfer of information for deposition and publication. It also allows the definition of formal publication data quality standards, and the deployment of mechanisms for checking compliance with such standards, such as the IUCr checkCIF service. The role of IUCr Journals in maintaining quality, and the possibilities provided by the CIF dictionaries for semantic web applications, will be discussed.

    This slide provides an overview of the structure of the presentation. There will be some general remarks on the importance of basing scientific arguments on high-quality data and analysis. A description will be given of the general features of crystallographic research which make crystallography an especially suitable field for developing quality control protocols. A prerequisite for this is a precise and extensive description of data types and attributes required within the science the Crystallographic Information Framework (CIF) provides this. Armed with rigorous definitions of data items and their relationships, journals can specify mandatory content and draw up quality guidelines for acceptability of reported results. The formal framework allows development of quality-assessment software, pushing the burden of quality control back towards the scientist (and, incidentally, acting as a useful check on the authors understanding of the subject, and a teaching aid to assist deficiencies). Widespread access to standard quality-control software allows the adoption of community-wide standards, and encourages the publication of data by individuals outside the formal publication channels, but with the same quality associated with peer-reviewed research reports. Finally, there are some comments on how experience on maintaining data quality can inform efforts to improve the quality of the literature in general.The quantities of crystal structure determinations released into the public domain are not overwhelming, but are nonetheless substantial. A number of organizations collect and curate data from the scientific literature; many also accept private depositions (e.g. from industrial or pharmaceutical companies) that are not normally published in the open literature. Traditionally the databases have critically evaluated the data they collected, often detecting errors or inconsistencies in the original publication. The annotation of biological structures by the Protein Data Bank has been especially useful in relating structures to biological function, sequence databases and other applications. Many structures are reported briefly in the literature, but the IUCr publishes several journals dedicated to complete reports; these are all reviewed closely and the journals have a reputation for high quality. IUCr journals publish about a tenth of all small-molecule structures reported annually. Macromolecular structural science is relatively a much younger discipline, but more structures are being reported in IUCr journals as the field grows. It is expected that proteins expressed during structural genomics research efforts will account for rapidly rising numbers of publications in the next few years.The homogeneity of crystallographic experiments suggested to the IUCr as far back as the 1980s the need for a standard data exchange and archiving format, and some early versions were produced that later evolved into the Crystallographic Information File (CIF) standard in 1991. CIF is now used as an acronym for the Crystallographic Information Framework, a series of exchange formats and data definitions that cover broad areas of crystallography, and that are under active development and extension. While the original CIF format is still used by the majority of established crystallographic software, equivalent XML representations are also available for modern data-exchange requirements. However, the most important component of CIF is the data dictionaries, that provide rigorous definitions of individual data items.A typical dictionary definition (modified for display purposes) demonstrates: the name of a tag that is used within a data file to indicate the specific item of data; the definition of that item (with mathematical description and hyperlinks to related items where appropriate); the data type; its permitted numerical range; and a statement of its role in the classification scheme of the dictionaries. Other attributes are available for other definitions. The most important thing about these dictionaries is that they are machine-readable, and appropriate software can apply the constraints specified in the dictionaries to values of the corresponding data items found in data files. Developments are in hand to extend this mechanism to allow generic dictionary-reading software to apply complex algorithms specified in the dictionaries to the input data. This functionality goes far beyond the types of validation available to standard XML tools.Although the editors and referees invest much effort in the peer review process, the availability of a full set of accompanying data provides added value for the reader of the article. For example, for any article published in Acta Crystallographica Section E: Structure Reports Online, the reader may assess fully the scientific argument by: (1) reading the text of the article; (2) accessing the full CIF (which will include unpublished data such as the three-dimensional atomic coordinates, and a complete listing of bond lengths and angles); (3) reviewing key indicators and the validation report (in this example the authors response to a significant problem identified in the validation review is presented); (4) retrieving the primary experimental data to allow an independent redetermination of the structure; or (5) visualizing and manipulating the data in a crystallographic application of choice.The software used to create the IUCr journals review reports also powers a public web service, checkCIF, that is freely accessible to anyone (not just authors submitting to IUCr journals). checkCIF is sponsored by other publishers and by curators of crystallographic databases. It creates reports on the consistency of structures similar to the review reports, and these are sometimes requested by referees for other publishers journals. The complete set of checks is publicly documented on the web, and provides a de facto community standard for assessing the quality of a reported crystal structure. Prospective authors of IUCr journal articles are encouraged to pre-check their structures as often as they wish using the checkCIF service before submission. The availability of such a service places responsibility for quality control directly in the hands of the originating scientists.This slide demonstrates extracts from a validation report prepared during pilot testing of new submission procedures for Acta Cryst. F. The reviewer will also be provided with graphics and VRML files displaying three-dimensional representations of the protein structure.Although the volume of published structure reports is rising more rapidly than ever before, crystal structure determinations are increasingly less likely to be written up in the scientific literature. Often a structure determination is performed as part of a chemistry research project, and not as an end in itself. While the results of the chemistry experiments may be written up as a research paper, the individual structures may be left aside for future publication, or forgotten altogether. Often, too, structures are determined to a level of quality that is sufficient to support or justify a chemical argument, but are not refined to the stringent levels required by leading journals. To address the problem of important data failing to enter the public domain, some service crystallography facilities, which undertake numerous structure determinations on behalf of chemistry departments and other clients, are now using open-access repository software to make crystallographic data available over the web.Among the most active of the laboratory repositories is eBank, the University of Southampton/National Crystallography Service server. Here, subject to appropriate permissions, details of structures determined within the laboratory are made available under an open-access architecture. The platform supports the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), allowing automated harvesting of the metadata describing individual entries. Among the applications envisaged to use this facility are: ingest of entries to the Cambridge Structural Database and other curated structural databases; federated searching and archiving services; linking to published articles. To facilitate these objectives, a programme is under way to define suitable metadata to allow automated processing by chemical or other scientifically-aware information services. The eBank implementation provides links to all the supporting data collected during the experiment (including links to the primary data images, archived in a national large-scale data facility). The IUCr journals are investigating methods of providing active links between subsequently published articles and the corresponding eBank records. The adoption of open protocols and data exchange standards means that there are few technical barriers to the development of this as a successful new publication medium. However, the rights to ownership and dissemination of the data, collected and analysed as they are in a client/provider relationship, need further careful consideration and handling. In a framework where the published information has not (necessarily) undergone peer review, the eBank developers have addressed quality concerns by analysing each structure with the IUCr checkCIF software and providing a link to the relevant report.