Semantic Representation ofEvents in the Pharmaceutical IndustryEvents in the Pharmaceutical IndustryMartin Romacker, Samuel Läubli & Marc BuxNIBR-IT / Text Mining ServicesNIBR IT / Text Mining Services24-Feb-2011 CSHALS
IntroductionIntroduction
Generation of a Comprehensive Terminology for p gyCompanies (data capture, unique reference across NIBR).
Important Events related to Companies Drugs IndicationsImportant Events related to Companies, Drugs, Indications and Geographical Locations.
D t f C t t P id XML f dData from Content Providers as XML feeds(Prous Integrity, TPP, Adis R&D, TR PDI)
L k f ti t ll l lLack of semantics at all levels:• No definitions for concepts
f C C• Unspecific relations like parentCompany or relatedCompany• Important events locked in Natural Language statements
U f S ti W b A h2 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Usage of Semantic Web Approach
Generation of Terminology for CompaniesGeneration of Terminology for Companies
We try to automatize the production of our terminologiesy p gand corresponding pointers as much as possible• Thorough analysis of the input sources and relations between them
Company terminology is also intellectually curatedtime spent on task
curation
automation
curation
probability of errors
3 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Find optimum with high automation and few errors
Canonical Representation for TerminologyCanonical Representation for Terminology
Preferred Term/ Concepts Unique Identifierp qmandatory label to be used to name an object (Controlled Vocabulary)unique identifier represents concept.
Synonymsa set of Synonyms semantically equivalent
Pointer / Cross-Referencea referential link between a Preferred Term and data repository using the approriate value for accessthe approriate value for access• Example:
Novartis AG (Preferred Term), Prous Integrity (data repository), 16964 (value for access)
Creation of a referential MetaData Layer which is shared by
4 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
y yall scientific NIBR applications
Design StrategyDesign Strategy
Multiple Usage/ Reusabilityp g y• Example: Company terminology used for
Text MiningText MiningFAST Enterprise SearchSuggest/ AutocompletionUltralink ApplicationUltralink Application
Semantic Interoperability
Compatibility with public domain knowledge repositories
Focus on coverage of terms relevant to Novartis (Top 50 competitors, strategic alliances etc.)
5 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Company Terminology Content (Example)Company Terminology – Content (Example)
6 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Company Terminology Input SourcesCompany Terminology – Input Sources
A set of commercial feeds from Competitive Intelligence p gand scientific content providers:• Prous Integrity (now known as Thomson Reuters Integrity)• TPP – Thomson Pharma Partnering (formerly known as IdDB)• Adis Insight R&D• TR PDI – Thompson Reuters Pipeline Data Integration
Raw input feeds are preprocessed:a put eeds a e p ep ocessed• extract any valuable information and filter noise• convert to well-defined, distinct format,
The Algorithm is implemented in Pipeline Pilot
7 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Company Terminology Input Sources WorkflowCompany Terminology – Input Sources, Workflow
8 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Company Terminology MergerCompany Terminology – Merger
Data from the input feeds is merged in 4 stepsp g p
Step Example
S d N ti AGSandoz Novartis AGSandoz Technology Ltd Sandoz AG
1. NormalizationSandoz NovartisSandoz Technology Sandoz
2. Transitivity ResolutionySandoz Technology Novartis
3. DenormalizationSandoz Technology Ltd Novartis AGSandoz Technology Ltd Novartis AG
4. Synonym ExpansionSandoz Technology Limited Novartis AG
9 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Company Terminology Merger WorkflowCompany Terminology – Merger, Workflow
10 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Company Terminology Filter / CleanCompany Terminology – Filter / Clean
An English natural language dictionary is used to removeg g g yfalse positives from company synonyms• The natural language dictionary contains lemmatized versions of
A i d B iti h E li h dcommon American and British English words• It contains neither abbreviations nor names
Example of filtered false positives due to normalization:• University University of Calgary• Phase pHase Pharmaceuticals LLC
After the filtering synonyms and pointers are dumped toAfter the filtering, synonyms and pointers are dumped totext files and ready to be uploaded into the Metastore
11 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Company Terminology Filter / Clean WorkflowCompany Terminology – Filter / Clean, Workflow
12 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Company Terminology ChallengesCompany Terminology – Challenges
Different state and quality of input sources leads toq y p• Contradictions
- Merck Merck & Co Inc.- Merck Merck KGaA
• Misleading facts- Roche Consumer Health AG Bayer AG (Acquisition)- Roche Roche Consumer Health AG (Abbreviation)C l• Cycles
• Integration of outdated facts
13 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Company Terminology ModfileCompany Terminology – Modfile
Various options to alter the data flow:p• Add intellectually maintained facts about companies• Prevent normalization or remove facts from input in order to break up
cycles, resolve contradictions and enforce synonym relations• Manually remove noise from output• Redesignate Preferred Terms of companies• Prevent natural language filtering of selected synonyms
S if ffi d i l h t f t li ti d• Specify suffixes and special characters for term normalization andexpansion
M l ti l id f t i th i t f dManual assertions always override facts in the input feed
14 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Company Terminology Modfile WorkflowCompany Terminology – Modfile, Workflow
15 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Company Terminology Diff ReportCompany Terminology – Diff-Report
16 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Company Terminology Reporting WorkflowCompany Terminology – Reporting, Workflow
17 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Company Terminology WorkflowCompany Terminology – Workflow
18 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Application of Company Terminology on Adis R&DApplication of Company Terminology on Adis R&D
Adis R&D Insight is a drug pipeline database that tracks g g p pand evaluates drugs worldwide through the entire development process, from discovery, through pre-clinical and clinical studies to launchclinical and clinical studies to launch.
Updates on drug development processes are distributed as XML feed basically consisting of• Short news lines (text strings; «shouts»)• Structuring information• Basic meta-information
19 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Semantic Representation of Events Why ?Semantic Representation of Events – Why ?
Bevacizumab has been licensed to Chugai in Japan
Adis R&D feeds contain a huge number of importantstatements on events in very short stereotypical sentences
Bevacizumab has been licensed to Chugai in Japan
statements on events in very short, stereotypical sentences.
This knowledge is locked in natural language ...
20 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Introduction | Semantics Why?Introduction | Semantics – Why?Chugai has acquired licensing rights toBevacizumab in all countries except USABevacizumab has been licensed to Chugai in Japan
Computers cannot «automatically» map has been licensed and acquired licensing rights to alicensed and acquired licensing rights to a common semantic concept
R lti bl h t f l t iResulting problem: how to formulate queries on pure natural language text?
e g list all countries where Chugai holds a license for Bevacizumab• e.g. «list all countries where Chugai holds a license for Bevacizumab»
21 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Semantic Model for EventsSemantic Model for Events
OWL Ontologyincluding semantic rolesg
22 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Semantic Model for Licensing EventSemantic Model for Licensing Event
Frame: LicensingEvent
Bevacizumab has been licensed to Chugai in JapanPRODUCTS COMPANIES TERRITORIES
g
Semantic annotation and normalization of news entries
LicensingSubject Licensee ValidTerritory
Semantic annotation and normalization of news entries makes information explicit and thus machine-readable
Q f fQueries can now be formulated on frames, types and roles rather than textual surface
23 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Data Processing | Overall PipelineData Processing | Overall Pipeline
Pre-Processing / Annotation1. Tokenization
Parallel:Ontology Development
2. Part-of-Speech-Tagging3. Chunking
& Refinement
4. Named Entity RecognitionUsing Novartis Metastore (SOAP-WebService)
Evaluation5. Rule Evaluation
Output6. Result Representation (Triples)
24 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Data Processing | Pre Processing / AnnotationData Processing | Pre-Processing / Annotation
Tokenization PoS Tagging Chunking Named Entity RecognitionTokenization PoS-Tagging Chunking Named Entity Recognition
Bevacizumab has been licensed to Chugai in Japan.'Bevacizumab','has','been','licensed','to','Chugai','in','Japan','.'NNP VBZ VBN VBN TO NNP IN NNP
25 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Example of a Semantic RuleExample of a Semantic Rule
Result of Pre-Processing / Annotation:Result of Pre-Processing / Annotation:(S(NP bevacizumab/ER:PRODUCTS)h /VBZhas/VBZbeen/VBNlicensed/VBN(PP to/TO (NP Roche/ER:COMPANIES))( / ( / ))(PP in/IN (NP Japan/ER:TERRITORIES))./.)
Result of Rule Evaluation:EventType: LicensingStatusEvent
LicensingSubject: bevacizumabLicensingSubject: bevacizumabLicensee: rocheValidTerritory: japan
26 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Semantic Representation of Events as GraphsSemantic Representation of Events as Graphs
bevacizumab chugaihasLicensee
Subject Predicate Object
[http://usecases.novartis.intra/ci.owl#chug[http://usecases.novartis.intra/ci.owl#hasLicensee][http://usecases.novartis.intra/ci.owl#bevacizumab]
hasAcquired
hasCASNumber
hasType
Each triple corresponds to one statement consisting of a subject, a predicate and an object.
216974-75-3
hasTypesubject, a predicate and an object.
Entities are identified by an URI (Unique RessourceIdentifier)
rochecompanyhasType
Identifier)
New resources can easily be «attached» in order to Integration of other Concept Types from the MetaStore27 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
create big networks of relations (hence «linked data») Integration of other Concept Types from the MetaStore
Querying and Exploring the Triple StoreQuerying and Exploring the Triple Store
Example queries for an event using SPARQLp q gSELECT ?countryWHERE {
?event rdf:type ci:LicensingStatusEvent?event rdf:type ci:LicensingStatusEvent .?event ci:hasLicensee ci:roche .?event ci:hasLicensingSubject ci:bevacizumab .?event ci:hasValidTerritory ?country .
}}
28 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Querying and Exploring the Triple StoreQuerying and Exploring the Triple Store
29 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Storage & Representation | Querying & ExploringStorage & Representation | Querying & Exploring
30 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
Meaning Expansion: OntologiesMeaning Expansion: Ontologies
New Triples can be inferred using ontology definitions:p g gy
A i i i E isAcquisitionObjectAcquisitionEvent1 chugai
isAcquisitionObject
hasType hasTypehasType
CompanyAcquiredCompanyAcquisitionEve CompanyAcquiredCompanynt
31 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma
ConclusionsConclusions
Data feeds from commercial content providers lack psemantics (XML=Syntax, statements in Natural Language)
Data feeds from commercial content providers containData feeds from commercial content providers containinconsistencies and outdated facts(need for consolidation)
Transformation of content (entities and events) into a Semantic Web representation eases knowledge integrationSemantic Web representation eases knowledge integrationand exploration of data (graph navigation)
Content providers should shift towards a meaningful wellContent providers should shift towards a meaningful, well-defined and explorable Semantic Web representation oftheir data.
32 | Events in Pharmaceutical Industry | Martin Romacker, Samuel Läubli & Marc Bux | CSHALS Feb 2011, Boston, Ma