open health natural language processing consortium

1

Open Health Natural Language Processing Consortium

• www.ohnlp.org (part of caBIG Vocabulary Knowledge Center web presence)

• Goal• foster an open-source collaborative community around

clinical NLP that can deliver best-of-breed annotators, leverage the dynamic features of UIMA flow-control, and establish the infrastructure for clinical NLP.

• Two open source releases as part of OHNLP• Mayo’s pipeline for processing clinical notes (cTAKES)• IBM’s pipeline for processing medical notes (MedKAT)

and pathology reports (MedKAT/P)

http://www.ohnlp.org/

4

cTAKES Technical Details • Open source release March 15, 2009

• www.ohnlp.org• Downloads: Documentation and Downloads• Technical details: Publications

• Framework • IBM’s Unstructured Information Management Architecture

(UIMA) open source framework

• Methods • Natural Language Processing methods (NLP)

• Application • High-throughput phenotype extraction system (80M+ notes;

80B+ tokens)

http://www.ohnlp.org/

5

cTAKES Components

• Core components• Sentence boundary detection (OpenNLP)• Tokenization (rule-based)• Morphologic normalization (NLM’s “norm”)• POS tagging (OpenNLP)• Shallow parsing (OpenNLP)• Named Entity Recognition

• Diseases/disorders, signs/symptoms, procedures, anatomical sites, medications

• Dictionary mapping (lookup algorithm)• Machine learning (MAWUI)

• Negation and status identification (NegEx)

6

cTAKES Type System

7

cTAKES example

8

Current Efforts - I

• Anaphoric relations and coreference (as part of the Ontology Development and Information Extraction project, University of Pittsburgh) (2008 - 2011)

• In collaboration with Chapman and Crowley

• Semantic processing of the clinical text (in collaboration with Palmer, Martin and Ward, University of Colorado) (2009 - 2011)

• Treebanking (deep parses)• Predicate-argument structure and semantic labeling

(PropBanking)• UMLS relations (except temporal relations)

9

Current Efforts - II• Temporal relation discovery (2010-2014)

• In collaboration with Palmer, Martin and Ward, University of Colorado

• Lexical resources for the clinical domain (2010-2015)• In collaboration with Chapman, University of

Colorado and Elhadad, Columbia University• A la Treebank and clinical named entities with

attributes and modifiers

open health natural language processing consortium

Documents