semantics for integrated analytical laboratory processes – the allotrope perspective
TRANSCRIPT
SEMANTiCS, Industry, Vienna 2015,16-17 September
Semantics for Integrated
Analytical Laboratory Processes
The Allotrope Perspective
Heiner Oberkampf
slide 2
Agenda
Introduction
Approach and IT-Solution
Allotrope Data Format
Domain Taxonomies
Data Cube Ontology
Integration Projects
slide 4
High Variability of Result Data
chromatography pH thermogravimetry
HPLC-MS-MS
…
mass spectroscopy HPLC-MS
cell counter NMR
slide 5
Laboratory Analytical Processes
application 1
application 2 application 3
result data and
process meta-data
slide 6
Common Problems
It’s hard to find data
based on intuitive starting
points [e.g. study, project,
analyst, technique]
It’s hard to integrate
data from different
labs instruments, or
online/offline because
the file format is
different
It’s hard to mine a collection of
data because the details and the
context of the experiment is
stored somewhere else
Can’t interpret data later because the context is
incomplete, inconsistent, often free text
Instrument & software
interoperability is
limited…at best
slide 7
Landscape of Existing Standards
"The nice thing about standards is that
there are so many to choose from."
Andrew S. Tanenbaum
DISCLAIMER
This is work in progress.
It is not a complete list of standards but a tool for research the standards.
Allotrope is investigating numerous standards but his graphic is not intended to represent standards Allotrope is commiting to include in the framework.
UN/CEFACT Core Components Technical
Specification
3.0
Batch ML
W3C
OWL
2.0
ISO
ISO 11179 (Metadata Registry)
1999
ISO 19763 (Metamodel
Interoperability)
2013RDF
1.0
SKOS
2012
OMG
Allotrope
Foundation
Common Warehouse Metamodel
1.1
2003
Common Terminology Services 2
1.1
2013
ISO 25694 (Thesauri)
Univeral Modeling Language
2.4.1
2012
ASTM
AnIML
2.0
HL7
HL7
ISO 12000 (MARTIF)
MESA
ISO 19773 (Metadata Registry
Modules)
IETF
RFC 2421 (Voice Profile)
2
1998
ISO 1087 (Terminology
Vocabulary)
2000
ISO 11404 (General Purpose
Datatypes)
2007
ISO 20944 (MDRIB)
2013
UPU S42-1 (Postal address
components)
2003
ISO 2832 (IT Vocabulary)
1996-2000
UPU
ISO 9899 (Programming
Languages C)
1999
ISO 9945 (Filenames)
RFC 3986 (URI)
2005
ISO 10646 (Unicode)
ISO 646 (IA5 character code)
ISO 19107 (Geographic
Information)
ISO 16684-1 (XMP)
2012
Adobe
ISO 639 (Language Codes)
ISO 3166 (Country Codes)
RFC 2046 (MIME Types)
RFC 3066 (Language Codes)
OASIS
ebXML Registry Information
Model 2
3.0
2005
ebXML Registry Services
Specification
2.0
2001
genericode
1.0
2007
RFC 2119 (Requirement
Keywords)
1997
CMIS
1.1
2012
RFC 2616 (HTTP)
1.1
1999
RFC 3023 (XML Media Types)
2001 RFC 2045 (MIME Format)
RFC 4287 (Atom Syndication)
RFC 5023 (Atom Publishing)
RFC 4918 (WebDAV)
XML Schema Datatypes
2004
OData
4.0
ebXML RegRep
4.0
2012
ISO 15000-3 (ebRIM)
2004
XPath 2.0
2.0
2007
XMLDSig
2001
XLink 1.1
1.1
1999
SOAP 1.2
1.2
2003
ISO 19915 (Geographic
Information Metadata)
ISO 19119 (Geographic
Information Services)
2005
LC
MARC 21 XML Schema
1.2
2009
MIX
2.0
2006
PREMIS
2.2
2012
NISO
Metadata Object Description
Standard
3.5
2013
Metadata Authority Description
Standard
2.0
2012
ISO 25577 (Information and
Documentation - MarcXchange)
ISO 20775 (Information and
Documentation - Schema for
Holdings Information)
searchRetrieve
1.0
2013
Search/Retrieval via URL
2.0
Contextual Query Language
1.2
Dublin Core Metadata Element
Set
1.1
UKOLN
Encoded Archival Description
2002
2002
Text Encoding Initiative
DDI Codebook
2.5
OAI Protocol for Metadata
Harvesting
2.0
2002
OAI
OAI Object Reuse and Exchange
1.0
2008
SPARQL
1.1
2013
ISO 704 (Terminology - Principles
and methods)
2000
UNECE
ISO 19504 (Common Warehouse
Metamodel)
Statistical Data and Metadata
Exchange
2.1
2011
Common Metadata Framework
DDI Alliance
DDI Lifecycle
3.1
UNSC
EDIFACT
Meta Object Facility
1.4.1
2005
Ontology Definition Metamodel
1.0
2009
Information Management
Metamodel
UML Profile & Metamodel for
Services
1.0.1
2012
Semantics of Business Vocabulary
and Business Rules
1.2
2013
ISO 6093 (Number Namespace)
Metadata Encoding &
Transmission Standard
1.10
2013
ISO 15000-4 (ebRS)
2004
ISO 15489 (Records
Management)
2001
ISO 23081 (Metadata for records)
2006
ISO 16363 (Audit and Certification
of Trustworthy Digital Repositories)
2011
ISO 14721 (OAIS)
2012
Dublin Core
Metadata
Initiative
ISO 15836 (DCMES)
SWORD
2.0
2008
JISC
BagIt
ARK Identifiers
ISO 26324 (Digital Object
Identifier)
2012
RFC 3652 (Handle System
Protocol)
2.1
2003
RFC 3650 (Handle System
Overview)
2003
RFC 3651 (Handle System
Namespace and Service
Definition)
2003
ISO 13120 (ClamML)
2013
ISO 27951 (CTS1)
2009
ISO 27527 (Provider
Identification)
2010
ISO 27932 (HL7 Clinical
Document Architecture)
2009
ISO 27931 (HL7)
2009
ISO 17115 (Vocabulary for
terminological systems)
2007
LMER
1.2
DNB
RFC 2141 (URN Syntax)
1997
RFC 1737 (URN Requirements)
1994
RFC 4122 (UUID URN
Namespace)
2005
ISO 20652 (PAIMAS)
2006
IMS Content Packaging
1.2
IMS Global
Z39.50 (Information Retrieval)
4
2003
ISO 2709 (Format for information
exchange)
2008
MARC 21
EAD
2002
FOAF Vocabulary
0.99
2014
FOAF Project
RDF Best Practices
CoolURIs
RDF Vocabulary Description
Language
1.0
2004
Extensible Resource Identifier
2.0
2005
RFC 2234 (ABNF)
1997
RFC 3987 (IRI)
2005
RFC 3305 (URI,URL,URN
Clarifications)
2002
RFC 2396 (URI)
1998
XRI Data Interchange
2.0
2005
ISO 14533-2 (XAdES)
2012
Canonical XML
1.0
2001
Universal Business Language
2.1
2013
ISO 14662 (Open-edi)
2010
ISO 15000-5 (CCTS)
2005
Z39.88 (OpenURL)
1
2004
Z39.85 (DCMES)
1
2001
ISO 8601 (Dates and Times)
2000
ISO 62264 (B2MML)
2003-2008
ISA 95
2001-2005
ISA 88
ANSI
ISO 21000-2 (MPEG-21 DID)
2005
ISO 21000-6 (MPEG-21 RDD)
2004
ISO 21000-7 (MPEG-21 DIA)
2007
ISO 21000-9 (MPEG-21 Fileformat)
2005
ISO 21000-18 (MPEG-21
Streaming)
2007
ISO 14496-12 (base media fi le
format)
2012
RFC 6481(Codecs)
2011
ISO 21000-3 (MPEG-21 DII)
2003
TIFF
6.0
1992
ISO 15444-1 (JPEG2000)
2004
JPEG
UnitsML
1.0
2011
NIST
hData
1.0
2013
RLUS
1.0.1
2011
LECIS
1.0
2003
ISO 21090 (Health informatics
data types)
IHE
XDS
SVSXUA
SAML
2.0
2008 XACML
3.0
2013
ASTM E1986 (Access Privileges to
Health Info)
2013
ASTM E1869 (Confidentiality,
Privacy, Access and Data Security
)
2010
ISO 19005-1b (PDF/A)
CDA
2
2008
ISO 19510 (BPMN 2.0)
2013
BPMN
2.0.1
2011
SAA
CDISC
BRIDG
3.2
Define-XML
2.0
2013
ADaM
2.1
SDM-XML
1.0
CDISC-ODM
1.3.2
SEND
3.0
LAB
1.0.1
ISO 28500 (WARC)
2009
RFC 3629 (UTF-8)
2003
ISO 17025 (Competence of
laboratories)
2005
ISO W3C
IE TF
OASIS
OMG
LC
CDISC
NISO
OAI
slide 9
Allotrope Foundation
•Subject Matter Experts
•Project Funding
Member
Companies
•Project Management
•Legal & Logistical Support
Secretariat
•Framework Development
•Technical Leadership
Professional
Software Firm
•Requirements & Specifications
•Contributions, PoC Applications
Partner Network
AbbVie
Amgen
Baxter
Bayer
Biogen
Boehringer Ingelheim
Bristol-Myers Squibb
Eli Lilly
Genentech/Roche
GlaxoSmithKline
Merck & Co
Pfizer
ACD/Labs
Agilent
BIOVIA
BSSN
Erasmus MC
IDBS
Mestrelab Research
Mettler Toledo
Persistent
Riffyn
Sartorius
Shimadzu
Thermo Scientific
Univ. Southampton
Waters
slide 10
Allotrope Data Format (ADF)
Data Description
RDF Model
Data Cubes
Universal data container
Data Package
Virtual file system *
Contains:
• Method, instrument, sample,
process, result, etc.
• Data cube metadata
• Binary file metadata
• …
Analytical data represented by
one- or multidimensional arrays.
HDF5
Platform Independent File Format
Allotrope Data Format
* Use is optional
Analytical data represented by
arbitrary formats, incl. native
instrument formats, images, pdf,
video, etc.
Specifically designed to store and
organize large amounts of numerical
data.
slide 11
API Stack
Allotrope Framework provides APIs to read and write data
contained in ADF
Developers do not have to concern themselves with RDF,
SPARQL, semantics or complex graph patterns
Platform independent file format
(HDF5)
Data Package API Data Cube API
Data Description API
(Apache Jena)
Analytical Data API
Taxonom
ies
Triple Store API
Taxonom
ies
slide 13
Scope and Current Status
Implemented analytical
techniques:
Small molecules
gas chromatography
Karl Fischer
liquid chromatography
mass spectrometry
nuclear magnetic repulsion
spectrometry
thermogravimetric analysis
ultra violet spectrometry
Large molecules
capillary electrophoresis
cell counter
cell culture analyzer
blood gas analysis
Both
balance
pH
562
168
2272
283
Number of classes:
slide 14
Reused Vocabularies and Ontologies
Used:
RDFS, OWL, SKOS
Shape Constraint Language (SHACL)
Directly imported:
Quantities, Units, Dimensions and Data Types Ontologies (QUDT)
The W3C RDF Data Cube Vocabulary (QB)
Partly reused definitions:
Chemical Methods Ontology (CHMO)
Proteomics Standards Initiative – Mass Spectrometry (PSI-MS)
International Union of Pure and Applied Chemistry (IUPAC)
…
slide 19
Example: Mass Spectrum
Data set of rank 2.
Additional dimensions:
• sample
• retention time
• device
• …
Meta data is expressed in RDF.
Numeric data is natively
represented in HDF5.
mass
intensity
af-m:AFM_0000350
af-r:A
FR_0
000495
slide 20
ADF Data Cube Ontology
ADF Data Cube API
HDF5
ADF Data Cube Ontology
RDF Data Cube
Vocabulary
HDF5 Ontology
ADF-HDF5 Mapping
Create and access data cubes.
Extends the RDF Data Cube
Vocabulary by scales, slabs, order
functions and complex data types.
Mapping between RDF meta data
descriptions and description of
physical storage in HDF5.
Vocabulary of HDF5 entities and
data types.
Platform independent file format.
slide 21
ADF Data Cube Ontology
W3C: RDF Data
Cube Vocabulary
HDF5 Ontology
W3C: RDF, OWL, SHACL
ADF Data Cube Ontology ADF-HDF5 Mapping
slide 23
ADF Data Cube Ontology
Nominal Scale: sample, run …
Ordinal Scale: sample index, quality (++,+,o,-,--) ...
Interval Scale: temperature, date time …
Ratio Scale: mass, duration …
slide 25
ADF Data Cube Ontology
HDF Mapping:
Required to map the
data structure from
functional to physical
perspective.
slide 27
Complex Data Types
weight (mg)
1020
655
weight
1.020 g
655 mg
weight (mg)
1020 +/- 15
655 +/- 12
weight
tare: 25.3332 +/- 0.2 g
net: 20.219 +/- 0.2 g
Complex Data types are expressed using the Shapes Constraint
Language (SHACL).
https://w3c.github.io/data-shapes/shacl/
slide 29
Company 1
Reference Data Project
Data Lake Project
Lab
Execution
System
Instruments
(multiple)
Data Lake
(Hadoop)
ADF
(multiple)
AF
Taxonomies
slide 30
Company 2
Analytical Chemistry in Discovery
Sample
Queue
Analytical
Data Review ADF HPLC-MS
ADF Methods
MS
HPLC
slide 31
Company 3
Stability and Release Testing
Manufacturing Domain
ADF HPLC-UV
HPLC-UV
Balance Electronic
Lab
Notebook
ADF Methods
slide 32
Conclusion
Why Semantics?
Good framework for standardized but extendable data
descriptions which are needed to realize the potential of the
available data.
Linked Data allows to relate information stored in ADF with
additional context: e.g. materials, devices, chemicals,
processes, locations etc.
Initially:
Experiments for
approval for drugs.
Today:
Experiments generate data
that can be used in many
different contexts.
slide 33
Questions?
Heiner Oberkampf
www.osthus.com
Allotrope Foundation:
www.allotrope.org