Use of Data Standards and Metadata in Information Exchange
Rachel UphillFrom Big Data to ChemicalInformation Meeting
2Presentation title in footer
Use of Data Standards and Metadata in Information Exchange
• Understanding how the plethora of new data and improved analytical techniques can enhance future innovation and feed the drug-development pipeline
• Implementing ontologies, standards, strategies and collaborations to enhance products and provide wider value
• Discovering what your peers are prioritizing as part of their own Big Data strategies
00 Month 0000
Big Data in Drug Discovery
25 Million Citations
Over 16,000 Organisms
20,0002 Interactions
>1 million gene expression profiles
HTS: 1 million reactions/day
Over 40000 metabolites
186,000 trials
Complex and Disparate
Increasing Data Size Increasing Data Dimensionality
+ =
Increasing Complexity
Data points are growing rapidly and endpoints are unclear Disparate datasets Data integration through increased use of Contract Research Organisations
(CRO’s) Complex analytics required on ever increasing data
(MDM)
Vision Strategy Plan Execute Summary Endpoint
MDM Delivers one version of the truth across each of the integrated solutions, made easier through the use of standards and ontologies/metadata
(MDM)
Metadata
MetadataResults
ProjectExperiment Protocol
Substance
Integrated systems and re-use of information
Planning and designing our projects and study protocols will be simpler leading to more time to focus on scientific aspects
Entering information once and making it more accessible across the organisation
Facilitating enterprise working & decision-making
Bringing the regions “closer” and collaborating further with our scientists
Starting out on the journey …
• To maximise the value of Pharma R&D data there are some key disciplines that IT have been investing in:
– Blue Printing, understand what you’ve got and what you need• If you were building a house you would first start with the plans
– Stewardship & Governance, who it matters to in the business if it’s wrong• You’d consult with the planning authorities to make sure everything was in acceptance, if it wasn’t you’d get it corrected
– Quality, how you know it’s wrong• You wouldn’t use a industrial boiler to heat your house
– Master & Reference data management, single source of the truth• You’d check with the land registry to make sure the boundaries of the property were accurate before building an extension
– Standards, understand what you have been asked for• You’d consult the building regulations and make sure all documents that are created containing the required data
– Search and Analytics, You’d only trust your search results and analytical answers once you knew all these were correct• You’d build the chimney, once you’d completed the foundations and structure of the building
6
Information Blueprinting
• What is an Information Blueprint– Documentation of the information landscape illustrating the flow and structure of information at a level of detail appropriate for the audience.
• Modelling the key business processes, understand the inputs and outputs of those processes and identifying the upstream suppliers of data and the down stream consumers
• Modelling the high-level information concepts for a broad audience and the detailed data structures for a technical development and data governance audience.
• Identifying transactional use of data as well as secondary reporting and analytical uses of data.
– Information Blueprints are collections of models created using a number of modelling techniques that document the business processes, data structures and systems landscape.
• Major benefits of an Information Blueprint– Provide a business process based view of data usage and information requirements. Highlights gaps in information provision and enables impact analysis of
changes to the information supply chain.
– Provides the framework for assessing and influencing business data quality.
– Facilitates communication within the business and between the business and IT.
– Encourages the development and use of common terminology by clarifying definitions & synonyms.
– Provides the common business understanding for use and re-use across IT systems
Information Blueprinting as a Service 7
What Do We Mean By Stewardship And Governance?
• Stewardship … “formalizing accountability for the management of data resources for a subset of enterprise data.” (1)
• Data Governance … “the execution and enforcement of authority over the management of data assets and the performance of data functions.” (1)
• Master Data Management … the people, processes and technology that ensure a single managed view of critical enterprise information.
8(1) “Non-Invasive Data Governance” © .Copyright © 2008 Robert S. Seiner – KIK Consulting & Educational Services/TDAN.com
Information has value and purpose beyond a single use, system or report. We need to support this with elements of
good Stewardship and Governance.
Data Governance and Stewardship
Good Data Stewardship and Governance encompasses:
• Capability to effectively find, understand and use data and document
• Data standards – the nature of our standards, how we wish to apply them in GSK
• Business rules and processes to support the creation and management of data and documents
• Data disposition – business rules for versions retained, storage, labelling, tracking of data and documents and compliance to these rules
• Business model to support data stewardship and increase the capability for data stewardship within the business lines – responsibilities, resource, utilities
9
Data Quality
10
Master Data Management
11
Reference Data Management
• An ontology is a set of related vocabulary/taxonomies specific to an area enabling Knowledge Management allowing us to create a hierarchy of related terms
• When implementing ontologies need to understand the model, governance/best practice and quality of entries
• Allow us to link disparate data sets together through related terminologies
• Easily extensible and can be embedded within standards and frameworks– Allows for organisational mappings
• Relevant Pharma Ontology Usages– Reactome, available in an RDF format allowing organisations to link pathway information utilised in Metabolomics and relate disease,
target and chemical information
– Allotrope, affiliation of ontologies associated with the analytical chemistry space, for example equipment
Vocabularies, Taxonomies and Ontologies
Standards and Frameworks
• For organisations to be able to exchange data not only do we need to agree on the structure but what is meant by the structure and the business rules associated with the data
• To support this we can use the concepts we have already discussed Information Blueprints, Stewardship & Governance and Data Quality
• The additional aspect and probably most important is that of our Master and Reference data where our Metadata will be held, the drop down lists in our everyday applications
• To improve the ease associated with data exchange we need to agree on terminology and if we can’t them support mappings/synonyms to enable us to integrate datasets at a later point that are then embedded within the use of the standard
• In the growing world of information exchange and data types there is an ever increasing number of standards and frameworks to implement these
– ISA-88 and ISA-95, associated .xsd formats BatchML and B2MML
– Pistoia, HELM, Standard Metadata, Ontology, Standard Data Warehouse projects
– Allotrope, Analytical Chemistry Framework
14
Allotrope : Advanced Data Design for Chemical R&D
© 2014 Allotrope Foundation
Document Preparation
DataErrors
DataExchange
DataManagement
RegulatoryCompliance
Innovation Constrained
DataSilos
Root CauseCurrent Software Environment
Incomplete, Incompatible
Software
No Standard File Formats
Inconsistent Metadata
Gaps, Complexity, incompatible software
Effect
Outcomes
Automated Documents
EliminateData Errors
Fast Data Exchange
Better DataManagement
Innovative Ecosystem
RegulatoryCompliance
EliminateData Silos
Allotrope Foundation Framework
Reusable Software
Components
Open Document Standard
Open Metadata
Repository
Efficient, Innovative, Powerful Software
Fix the Root Cause
Scientists focus on science
15
Drug Development
Doctor/Patient
Clinical Data
FormsClinical
Outcome
HL7
HL7: eStability
eCTD
CDISC
Analytical CMC
CDER Data Standard
Allotrope Framework
Class Libraries
Metadata Repository
WorkflowAutomation
InformationAccess
Data Standards
Archiving
Service Standards
Allotrope Framework addresses the gap in standards for CMC analytical data
Data prepared for submission in standard
formatData ReviewData EntryAnalytical Testing
UN/CEFACT
W3CISO
OMGAllotrope Foundation
ASTM
HL7
MESA
IETF
UPU
Adobe
OASIS
LC
NISOUKOLN
OAI
UNECE
DDI Alliance
UNSC
Dublin Core
Metadata Initiative
JISC
DNB
IMS Global
FOAF Project
ANSI
JPEG
NIST
IHE
SAA
CDISC
16
The landspace of standards potential useful for the Framework
"The nice thing about standards is that there are so many to choose from."
Andrew S. Tanenbaum
DISCLAIMER
This is work in progress.
It is not a complete list of standards but a tool for research the standards.
Allotrope is investigating numerous standards but his graphic is not intended to represent standards Allotrope is commiting to include in the framework.
UN/CEFACT Core Components Technical Specification
3.0
Batch ML
W3C
OWL2.0
ISO
ISO 11179 (Metadata Registry)1999
ISO 19763 (Metamodel Interoperability)
2013RDF1.0
SKOS2012
OMG
Allotrope Foundation
Common Warehouse Metamodel1.1
2003
Common Terminology Services 21.1
2013
ISO 25694 (Thesauri)
Univeral Modeling Language2.4.12012
ASTM
AnIML2.0
HL7
HL7
ISO 12000 (MARTIF)
MESA
ISO 19773 (Metadata Registry Modules)
IETF
RFC 2421 (Voice Profile)2
1998
ISO 1087 (Terminology Vocabulary)
2000
ISO 11404 (General Purpose Datatypes)
2007
ISO 20944 (MDRIB)2013
UPU S42-1 (Postal address components)
2003
ISO 2832 (IT Vocabulary)1996-2000
UPU
ISO 9899 (Programming Languages C)
1999
ISO 9945 (Filenames)
RFC 3986 (URI)2005
ISO 10646 (Unicode)
ISO 646 (IA5 character code)
ISO 19107 (Geographic Information)
ISO 16684-1 (XMP)2012
Adobe
ISO 639 (Language Codes)
ISO 3166 (Country Codes)
RFC 2046 (MIME Types)
RFC 3066 (Language Codes)
OASIS
ebXML Registry Information Model 2
3.02005
ebXML Registry Services Specification
2.02001
genericode1.0
2007
RFC 2119 (Requirement Keywords)
1997
CMIS1.1
2012
RFC 2616 (HTTP)1.1
1999
RFC 3023 (XML Media Types)2001 RFC 2045 (MIME Format)
RFC 4287 (Atom Syndication)
RFC 5023 (Atom Publishing)
RFC 4918 (WebDAV)
XML Schema Datatypes2004
OData4.0
ebXML RegRep4.0
2012
ISO 15000-3 (ebRIM)2004
XPath 2.02.0
2007
XMLDSig2001
XLink 1.11.1
1999
SOAP 1.21.2
2003
ISO 19915 (Geographic Information Metadata)
ISO 19119 (Geographic Information Services)
2005
LC
MARC 21 XML Schema1.2
2009
MIX2.0
2006
PREMIS2.2
2012
NISO
Metadata Object Description Standard
3.52013
Metadata Authority Description Standard
2.02012
ISO 25577 (Information and Documentation - MarcXchange)
ISO 20775 (Information and Documentation - Schema for
Holdings Information)
searchRetrieve1.0
2013
Search/Retrieval via URL2.0
Contextual Query Language1.2
Dublin Core Metadata Element Set1.1
UKOLN
Encoded Archival Description20022002
Text Encoding Initiative
DDI Codebook2.5
OAI Protocol for Metadata Harvesting
2.02002
OAI
OAI Object Reuse and Exchange1.0
2008
SPARQL1.1
2013
ISO 704 (Terminology - Principles and methods)
2000
UNECE
ISO 19504 (Common Warehouse Metamodel)
Statistical Data and Metadata Exchange
2.12011
Common Metadata Framework
DDI Alliance
DDI Lifecycle3.1
UNSC
EDIFACT
Meta Object Facility1.4.12005
Ontology Definition Metamodel1.0
2009
Information Management Metamodel
UML Profile & Metamodel for Services
1.0.12012
Semantics of Business Vocabulary and Business Rules
1.22013
ISO 6093 (Number Namespace)
Metadata Encoding & Transmission Standard
1.102013
ISO 15000-4 (ebRS)2004
ISO 15489 (Records Management)
2001
ISO 23081 (Metadata for records)2006
ISO 16363 (Audit and Certification of Trustworthy Digital Repositories)
2011
ISO 14721 (OAIS)2012
Dublin Core Metadata Initiative
ISO 15836 (DCMES)
SWORD2.0
2008JISC
BagIt
ARK Identifiers
ISO 26324 (Digital Object Identifier)
2012
RFC 3652 (Handle System Protocol)
2.12003
RFC 3650 (Handle System Overview)
2003
RFC 3651 (Handle System Namespace and Service
Definition)
2003
ISO 13120 (ClamML)2013
ISO 27951 (CTS1)2009
ISO 27527 (Provider Identification)
2010
ISO 27932 (HL7 Clinical Document Architecture)
2009
ISO 27931 (HL7)2009
ISO 17115 (Vocabulary for terminological systems)
2007
LMER1.2
DNB
RFC 2141 (URN Syntax)1997
RFC 1737 (URN Requirements)1994
RFC 4122 (UUID URN Namespace)
2005
ISO 20652 (PAIMAS)2006
IMS Content Packaging1.2
IMS Global
Z39.50 (Information Retrieval)4
2003
ISO 2709 (Format for information exchange)
2008
MARC 21
EAD2002
FOAF Vocabulary0.992014
FOAF Project
RDF Best Practices
CoolURIs
RDF Vocabulary Description Language
1.02004
Extensible Resource Identifier2.0
2005
RFC 2234 (ABNF)1997
RFC 3987 (IRI)2005
RFC 3305 (URI,URL,URN Clarifications)
2002
RFC 2396 (URI)1998
XRI Data Interchange2.0
2005
ISO 14533-2 (XAdES)2012
Canonical XML1.0
2001
Universal Business Language2.1
2013
ISO 14662 (Open-edi)2010
ISO 15000-5 (CCTS)2005
Z39.88 (OpenURL)1
2004
Z39.85 (DCMES)1
2001
ISO 8601 (Dates and Times)2000
ISO 62264 (B2MML)2003-2008
ISA 952001-2005
ISA 88
ANSI
ISO 21000-2 (MPEG-21 DID)2005
ISO 21000-6 (MPEG-21 RDD)2004
ISO 21000-7 (MPEG-21 DIA)2007
ISO 21000-9 (MPEG-21 Fileformat)2005
ISO 21000-18 (MPEG-21 Streaming)
2007
ISO 14496-12 (base media file format)2012
RFC 6481(Codecs)2011
ISO 21000-3 (MPEG-21 DII)2003
TIFF6.0
1992
ISO 15444-1 (JPEG2000)2004
JPEG
UnitsML1.0
2011
NIST
hData1.0
2013
RLUS1.0.12011
LECIS1.0
2003
ISO 21090 (Health informatics data types)
IHE
XDS
SVSXUA
SAML2.0
2008 XACML3.0
2013
ASTM E1986 (Access Privileges to Health Info)
2013
ASTM E1869 (Confidentiality, Privacy, Access and Data Security
)
2010
ISO 19005-1b (PDF/A)
CDA2
2008
ISO 19510 (BPMN 2.0)2013
BPMN2.0.12011
SAA
CDISC
BRIDG3.2
Define-XML2.0
2013
ADaM2.1
SDM-XML1.0
CDISC-ODM1.3.2
SEND3.0
LAB1.0.1
ISO 28500 (WARC)2009
RFC 3629 (UTF-8)2003
ISO 17025 (Competence of laboratories)
2005
17
The landspace of standards potential useful for the Framework
Analytics Goal
– Implement a strategy addressing the need for R&D to Big Data through Informatics to enhance decision making, to Simplify of the Operating Model and ensure efficiencies to Deliver More Products of Value
• Goal: Maximize insight from the minimum necessary data
– Standards, MDM, Data Quality, Data Governance & Stewardship and Blue Printing are some of the foundational capabilities to do this
Search.Analytics.Visualization.
IT & Informatics Technology
Clinical DataDiscovery DataNon R&D Data (e.g. competitive intelligence)
GSK Data
High Value Questions
Knowledge centered organization.Increased Re-Use of Knowledge.Knowledge as our competitive asset.
R&D Culture
Big Data Informatics
Electronic Health Records.Academic Institutes. (EBI, Broad, etc...)Public / Private Partnerships. (IMI etc..)Publications / Public Standards
External Data
SOCRATES SEARCH
Scientific searching for R&D
Text and chemical searches
Internal and External data sources linked through ontologies
More access to R&D content
Phase II – Summarise GSK Experience of Compounds, Targets
• Provide overview of compounds using traditional drug Discovery-Development chevrons, allow drilldown into the raw data.
• Integrated into Socrates Search, whenever a compound or medicine is ‘detected’ in search results
• Targetpedia : existing system integrated and extended into Socrates
Preclinical data systemsEarly systems Clinical systems
Socrates Search
Federated data aggregation(supported with ontologies)
TargetSelection
CandidateSelection
Commit to Medicine Dev
Target IDLead
DiscoveryLead
OptimisationPreclinical
Development POC FullDevelopment
External Datasets
Socrates Targetpedia and Compoundpedia
Targetpedia
Compoundpedia
Exporting to Excel
GSK Search
Improved GSK Search
Socrates Scientific Search
Socrates Target and Compoundpedias
Structured public data
Advanced “Expert Systems”
Integrating Predictive Systems
Integrating Knowledge
Delivery
Value
Data Integration
Advanced Analytics
Simple Analytics
Text Analytics
Data Quality
Ontologies
2013
Whats next ? …
NLP
2012
2014
2015