semantic models for cdisc based standards and metadata management (1)

6

Click here to load reader

Upload: kerstin-forsberg

Post on 12-May-2015

1.159 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Semantic models for cdisc based standards and metadata management (1)

CDISC EU Interchange 2012

Page | 1

Semantic Models for CDISC Based Standards and Metadata Management

Introduction We have possibly come at a critical turning point in the way clinical data can be managed, used

and reused within and across organizations. The coverage and maturity of existing CDISC

standards, the establishment of these standards within the industry at large, the use of these

standards as a foundation for metadata driven systems, and the upcoming role of semantic

standards are all converging to create new and unique opportunities. In this presentation we look

at the implications and challenges of integrating CDISC standards, metadata, and information

models into a single framework. We also show how semantic standards can provide a solid

foundation in building such a framework.

CDISC Standards The role of data standards for the management of clinical data has shifted significantly over the

past few years, largely due to the establishment of CDISC standards across the pharmaceutical

industry. Not so long ago, sponsors had to consider if and when they should use SDTM standards

for FDA submissions. Today, those questions have changed. Not if and when, but how to best

adopt CDISC based data standards is becoming the leading question. This change in mindset is

in itself a major step forward, but also leads to formidable challenges, both for CDISC as the

owner of the standards, for sponsors integrating these standards into their own organizations, for

vendors providing products and services, and for regulatory organizations to review submitted

data.

A key challenge for any set of standards is to be consistent and complete. Looking at the CDISC

standards, we see a variety of standards at different levels of maturity. The SDTM standards,

domains and terminology seem to have the highest level of adoption to date, but as more

sponsors submit data according to those standards, its shortcomings become magnified. SDTM is

an informal model and in many instances open for interpretation. This leads to inconsistencies in

Page 2: Semantic models for cdisc based standards and metadata management (1)

CDISC EU Interchange 2012

Page | 2

how collected data is mapped to SDTM, potentially across studies from a single sponsor, but

definitely across studies from different sponsors. As sponsors get comfortable adopting the

SDTM standards, they naturally venture into the CDASH and ADaM standards. These standards

have had a shorter life time and have not yet reached the maturity level of SDTM while suffering

from similar problems. In addition, issues about consistency at the content and representational

levels across the CDISC standards come into focus as well. This is highlighted by the disconnect

between the standards just mentioned and the BRIDG model, a comprehensive domain analysis

model for protocol-driven biomedical and clinical research, captured as a UML model.

Sponsors adopting CDISC have to deal with these issues. They also face the challenge to manage

and integrate CDISC based data standards within their respective organizations at the

information architecture, process, and systems application level. In the following sections we

outline some fundamental principles that can help meet these challenges.

Information Architecture We already indicated the importance for a set of standards to be complete and consistent. Formal

models make these notions precise. Another observation is that the content of the CDISC

standards depends on the meaning of what is studied in the biological and clinical reality (often

referred to as concepts), and how these concepts are represented by data elements from protocol

to submission, i.e. we are dealing with semantic and metadata information about biomedical and

clinical research knowledge and data. The conclusion is immediate and striking. An information

architecture taking this into account needs to be based on a formal ontological metadata model.

Well placed to get the job done are semantic models based on the W3C semantic web standards

(RDF, OWL, SKOS). These standards provide the means to define a formal representation of a

body of knowledge. In short, the Resource Description Framework (RDF) specifies a general

model of how any piece of knowledge can be represented by statements of the form Subject-

Predicate-Object or Subject-Predicate-Value, called triples. Each part of a triple (except Value)

has a Uniform Resource Identifier (URI), and triples can be aggregated into graphs with subject

and objects as nodes, and predicates as arcs. The Web Ontology Language (OWL) adds a typing

mechanism to classify subjects and objects into a hierarchy of classes and defines modeling

constructs to express knowledge about predicates. This gives a rich modeling vocabulary to build

schemas and the capability to derive new triples from existing triples (inference). Finally, the

Simple Knowledge Organization System (SKOS) is a thin RDF based vocabulary that can be

used to build terminologies. See [2] for more information on RDF based standards.

A knowledge base written in RDF can easily be shared between systems by serializing it into

formats such as RDF/XML. RDF knowledge bases are also easy to federate and cross-reference

as witnessed by the development of the Linked Open Data (LOD) cloud, a large amount of open

and cross-linked RDF data sets available on the web today. In this context it should be noted that

Page 3: Semantic models for cdisc based standards and metadata management (1)

CDISC EU Interchange 2012

Page | 3

an OWL version of the NCI Thesaurus (the source for CDISC’s controlled terminologies) is

freely available today in an RDF/XML format. Also, an effort is well on its way to port the

BRIDG UML model to an OWL based ontology.

Looking across the CDISC standards, we notice that the content is itself metadata, hence the

RDF schema we have in mind corresponds to a level 3 meta-model. A good starting point here is

the ISO 11179 standard for metadata registries (MDR). This standard is a bit elaborate and not

that widely adopted, but it is does provide a good starting point to develop a small and generic

OWL vocabulary for metadata models, including most notably the capability of item level

versioning for anything that goes into a metadata registry. Using an ISO 11179 based OWL

vocabulary, it is fairly straightforward to create a knowledge base for the CDASH, SDTM, and

ADaM standards.

Finally, there is a need to eliminate any possible interpretation and to guarantee consistency

between the different CDISC standards. A biomedical concept model, representing the meaning

of what is studied in the biological and clinical reality, can provide the glue to hold everything

together. It provides common and precise semantic content for any CDASH, SDTM, and ADaM

data element, and restricts these standards to have only representational capabilities. On the other

side of the coin, an RDF based biomedical concept model can link directly into other RDF

sources with semantic content such as the NCI Thesaurus and BRIDG once its OWL

representation is available.

Our considerations on an information architecture for CDISC standards based on semantic web

standards lead to the following RDF based information stack.

Figure 1

RDF OWL SKOS

ISO 11179 MDR Schema (subset)

BRIDG and ISO 21090

Biomedical Concept Model

CDASH SDTM ADaM

Sponsor Extensions

NCI Thesaurus

Page 4: Semantic models for cdisc based standards and metadata management (1)

CDISC EU Interchange 2012

Page | 4

Notice that the top layer offers sponsors the opportunity to extend content based on existing RDF

schemas, e.g. sponsors may add additional SDTM data elements as supplemental qualifiers, or

introduce additional RDF schemas to cover new types of content.

CDISC Considerations The CDISC standards have come a long way, both in terms of maturity and adoption, but also

face considerable challenges as more sponsors use the standards, and even more so as substantial

content is expected to be added for therapeutic areas. A layered information architecture based

on semantic standards can provide a solid foundation to systematically address these challenges.

The CDISC SHARE project may be the best place to get such an effort on its way, but will

require substantial commitment from CDISC as a whole to be successful. Just recently we have

provided a first draft OWL model to give a home to the ideas that the SHARE team has been

working on over the past few years. The future roadmap however seems to be unclear at best

with no firm commitment to implementation goals and time lines. At the same time the SHARE

team is already producing much valuable content that fits extremely well in the biomedical

concept model.

Sponsor and Vendor Considerations Right now we seem to have come at a turning point, driven by a widespread adoption of CDISC

standards and an emerging need for sponsors to establish a standards management function

within their respective organizations. Large organizations have increasing difficulty just dealing

with the resulting work load of managing and applying clinical data standards. This naturally

leads to the need for a metadata repository (MDR).

The same arguments for the information architecture given earlier apply even more here.

RDF/XML represents an RDF interface format for MDR content. As indicated before, it can

easily be shared and federated, but also loaded into a triple store database. Since an RDF

knowledge base can carry its own schema and everything is represented by triples, the triple

store load is immediate and the RDF knowledge base directly represents the MDR content.

Two examples of how sponsors have started to implement semantic standards and apply linked

data principles: At Roche this is done by implementing an internally built MDR, see more details

below. At AstraZeneca the requirements on a commercial MDR product will include an interface

to MDR content based on semantic standards and linked data principles. This is part of a larger

effort called integrative informatics (i2) establishing the components to let a Linked Data cloud

grow across AstraZeneca R&D.

Page 5: Semantic models for cdisc based standards and metadata management (1)

CDISC EU Interchange 2012

Page | 5

MDR Based Standards Implementation at Roche In a first phase, Roche has successfully defined a set of clinical trial data standards based on the

CDISC, ISO 11179 MDR, and the W3C semantic standards following the architecture shown

earlier in Figure 1. In this implementation, the biomedical concept model has deliberately been

designed as a thin layer in anticipation that CDISC SHARE is going to give this part of the stack

later on. BRIDG can be added as soon as its OWL representation becomes available. The data

collection and data tabulation standards cover all of safety and the Roche therapeutic areas, but is

only partially based on CDASH. Data analysis standards are still in their infant stages.

In a second phase, Roche has built an MDR and an application infrastructure in 2011. This

includes a controlled mechanism to publish the RDF stack to a triple store database, a web

browser application to deliver the content to end-users, and a set of web services to provide

access to other applications. The MDR includes item level versioning following ISO 11179 and

is deployed in a high availability IT production environment. The next release is scheduled to

include semantic search and linking from the biomedical concept model into the NCI Thesaurus.

The good news for sponsors is that semantic technology has proven to work at all levels, from

W3C standards to semantic toolsets such as modeling workbenches, triple store databases, and

application programming interfaces (API).

Roche is now entering a third phase to establish MDR driven workflow automation from

protocol to submission. The idea is to implement a semantic representation of the protocol and

data analysis plan, and from there use the MDR content to support study build, provide data

transformation services to derive SDTM mappings, and finally support the production of data

analysis and submission deliverables.

References

1. To read more on knowledge systems and semantic modeling, the following is recommended.

Dean Allemang and Jim Hendler. Semantic Web for the Working Ontologist. Second

Edition. Morgan Kaufmann, 2011. This is an excellent book, well-written, specifically on

the modeling aspects of RDF and OWL.

Christopher Walton. Agency and the Semantic Web. Oxford University Press, 2007. This

book gives a broad outlook on knowledge systems and the semantic web, including more

academic background on the computational aspects of the subject.

Dragan Gasevic, Dragan Djuric, and Vladan Devedzic. Model Driven Engineering and

Ontology Management. Second Edition. Springer, 2009. This book provides valuable

insight on knowledge engineering and the relationship between the different modeling

spaces.

Page 6: Semantic models for cdisc based standards and metadata management (1)

CDISC EU Interchange 2012

Page | 6

2. Here is a good entry page to locate the W3C standards for the semantic web, in particular the

RDF, RDFS, OWL, and SKOS standards:

http://www.w3.org/2001/sw/wiki/Main_Page

3. To see what the National Cancer Institute (NCI) is doing in the area of controlled

terminologies and ontology modeling, have a look here:

https://cabig.nci.nih.gov/concepts/EVS/

4. The National Center for Biomedical Ontology (NCBO) is a great resource for biomedical

ontologies and related technologies. It can be accessed here:

http://www.bioontology.org/