using the umls metamap as a cause of death analyzer michael hogarth, md michael resendez, ms univ....

33
Using the UMLS MetaMap as a Cause of Death Analyzer Michael Hogarth, MD Michael Resendez, MS Univ. of California, Davis

Upload: marcia-johnston

Post on 01-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Using the UMLS MetaMap as a Cause of Death

AnalyzerMichael Hogarth, MD

Michael Resendez, MSUniv. of California, Davis

2NAPHSIS 2007

Overview

• Causes of Death: A Historical Perspective• Overview of the California EDRS• Cause of Death Analysis tool (BECA)• NLM MetaMap and the UMLS• BECA-MetaMap experiment• Discussion

2

NAPHSIS 2007

Historical Perspectives on causes of death

• Bills of Mortality (1532)• Arose from the need to better

understand death rates in medieval England -- plague epidemics(1361,1368,1375,1390,1406, …)

• John Graunt (1620-74)• Used the Bills of Mortality and

found an infant death rate of 36% in England -- not previously known or understood

• London Bills of Mortality classification• Used by Dr. John Snow to

characterize a cholera outbreak traced to a water source in London

• Evolved to become the Intl. Classification of Disease (1850’s)

• International Classification of Disease(ICD) -- used for the last 150 years

4NAPHSIS 2007

CA-EDRS Causes of Death

4

5NAPHSIS 2007

Causes of Death

• Importance

• key epidemiological information is contained in the cause of death

• Issues and Challenges

• absolutely correct versus ‘close to correct’

• absolute correctness requires significant time/effort and manual effort

• is ‘close to correct’ in an automated fashion still useful?

• Typical process in California

• COD --> SuperMICAR --> Stat Master File

• turnaround for entire process can be lengthy (~2 years)

• could have a trend in causes of death and it would not be known by local jurisdictions for 2 years.

• Today in California• a significant number of jurisdictions today don’t wait for the final

statistical files from the State office to look at trends --- they *manually* ‘code’ (if they have the staff) -- takes time and funding

5

6NAPHSIS 2007

Preliminary COD classification

• Possible uses of a preliminary COD classification using automated methods that are ‘close to correct’• early identification of trends in a local jurisdiction

• disease vs. injury/poisoning -- coroner referral cross-checking

• identify specific infectious causes (encephalitis, cholera, etc..)

• What it is not• not for ‘absolutely correct’ cause of death classification• will not replace the nosologist’s expertise in

understanding the sequence of events leading to death nor their understanding of ICD-10, with its includes/excludes

7NAPHSIS 2007

How to analyze causes of death?

• Challenges• text is verbatim and thus ‘arbitrary’

(free text)• need to go beyond simple keyword

matching• biomedical knowledge and content is

vast -- and constantly changing!• A possible approach - text mining and

computational linguistic techniques

8NAPHSIS 2007

BECA

• We built BECA, a generic concept analyzer framework that can incorporate any ‘concept identifier’ engine such as NLM MetaMap and other text processing tools• BECA = BECA Enables Concept Analysis• Supports a ‘plug-in’ design for the concept matcher and

other components (ie, spell checker)

• Designed to support multiple transformations of the text in step-by-step fashion• transformations -- strip special characters, lower case, run it

through the concept matcher engine (MetaMap or other), run it through an available spell checker (jazzy spell, etc..)

• example transformations• convert to lowercase, remove all punctuation, map string

using concept mapper, etc..

• First version of BECA uses the NLM MetaMap as a concept mapper

9NAPHSIS 2007

BECA system design

10

NAPHSIS 2007

Example transformations

10

11

NAPHSIS 2007

What is NLM MetaMap?

• The National Library of Medicine’s MetaMap• a free, open source software component built by

the NLM Lister Hill Laboratory• uses computational linguistic techniques to map

biomedical text to a large corpus of biomedical content (the NLM Unified Medical Language System)

• Provides a number of text processing functions• Includes a ‘concept mapper’ that attempts to

match phrases with concepts in the UMLS Metathesaurus

• Includes a UMLS concept-to-code mapping for multiple coding systems (ICD, SNOMED, etc..)

11

12

NAPHSIS 2007

How does MetaMap work?

• Takes text as input and attempts to identify ‘concepts’ in the text and match them to concepts in a large corpus of phrases and concepts in biomedicine (UMLS Metathesaurus)

• The retrieved “candidate” matches include a score that reflects how sure it believes the match is correct

• The candidates retrieved include their semantic type• “Disease or Syndrome”, “Injury or Poisoning”,

etc... 12

13

NAPHSIS 2007

The UMLS

• Developed by the National Library of Medicine• Derived from over 100 sources (ICD,

SNOMED,)• The Unified Medical Language System

• A system built to support information retrieval in biomedicine

• Used in PubMed, ClinicalTrials.gov, etc..• Consists of:

•(1) UMLS Metathesaurus•(2) UMLS Semantic Network•(3) UMLS SPECIALIST Lexicon

14

NAPHSIS 2007

UMLS in detail

• UMLS Metathesaurus -- the world’s largest repository of biomedical phrases• 1.3 million concepts, 6.4 million unique phrases

(concept names)• over 100 source vocabularies (ICD,SNOMED,CPT, etc..)

• UMLS SPECIALIST LEXICON• a file that provides individual words found in the UMLS

metathesaurus and their linguistic information including grammatical ‘type’ (noun, verb, adjective, adverb, etc..)

• UMLS Sematic Network• a set of files that classify the metathesaurus ‘concept’

into a particular type• Examples -- “Disease”, “Injury/Poisoning”,

“Neoplasm”, ..

15

NAPHSIS 2007

MetaMap Algorithm

• MetaMap’s algorithm consists of four steps

• (1) Parsing• using a part-of-speech tagger text is decomposed into one or

more noun phrases • “ocular complications of myasthenia gravis” ==> “ocular

complications” and “myasthenia gravis”. • noun phrases are processed independently by decomposing

them into their grammatical origins• “ocular complications” ==> modifier “ocular” and head of

the phrase “complications”

• (2) Variant Generation -- ‘variants’ for each phrase are generated using SPECIALIST

• variants -- all synonyms of the term, acronyms containing the term, abbreviations, plural/singular variants

• each variants has a ‘distance’ score obtained from SPECIALIST• “ocular” - “eye”, “eyes”, “optic”, “opthalmic”, “opthalmia”,

“oculus”, “oculi”15

16

NAPHSIS 2007

MetaMap Algorithm

• MetaMap Algorithm continued• (3) Candidate Retrieval from Metathesaurus

• all metathesaurus strings that have at least one of the variants is retrieved

• can exclude those where the variant is present in a large number of strings (ie, very common string)

• (4) Candidate evaluation -- the MMTX score• each metathesaurus candidate is evaluated by

calculating the ‘strength’ of the similarity between the original input phrase and the candidate phrase from metathesaurus

• the calculation involves a weighted average of four metrics including distance scores for variants from input noun phrase(variation), whether the phrase is part of the ‘head’ (centrality), ”, ‘coverage’ and ‘cohesiveness’

16

17

NAPHSIS 2007

Example

• BECA MetaMap output• Input phrase: “ocular complications”

17

18

NAPHSIS 2007

The question

• ?Can BECA using the NLM MetaMap be useful in:1.Identifying biomedical concepts in a cause of

death literal, which is narrative text.2.“auto-coding” literals into ICD-10 codes

18

19

NAPHSIS 2007

Cause of Death Literals in CA-EDRS

• CA-EDRS data is a combination of records initiated in EDRS (EDRS counties) and those submitted on paper (non EDRS counties)

• Causes of death are verbatim from the certifier and typically entered into EDRS or the typed on a paper certificate by funeral home staff or hospital staff

• Overall COD statistics for CA-EDRS• 462,564 registered death certificates• 985,330 unique literals (phrases) in all COD

fields• 88,719 unique literals (phrases) in the

Immediate Cause of Death field 19

20

NAPHSIS 2007

Experiment

• We randomly selected 1,000 literals from the 88,719 unique literals in the Immediate Cause of Death field

• We submitted these “as is” to BECA (MetaMap, no spell checking component)

• BECA returned 7.9 candidate matches per literal (7,791 candidates for 1,000 strings)

• Candidate scores ranged from 517 - 1000

• Match score distribution for the 7,791 candidates

20

Match Score Distribution

0

100

200

300

400

500

600

700

517 530 540 560 574 590 602 617 632 645 661 679 685 694 703 711 722 740 748 762 777 789 804 817 827 837 845 854 862 871 885 897 911 924 947 981

match score

number at that score

21

NAPHSIS 2007

Example Output

21

22

NAPHSIS 2007

Literals with high score matches >=800

22

23

NAPHSIS 2007

High Score Candidate Matches• 3,017 (38.7%) of the 7,791 candidates had a score >=800

• 95.3% of the original literals (953/1000) had at least one candidate with a match score>=800

• 54.5% of the original literals (545/1000) had at least one candidate with a match score>=900

• 30.7% of the original literals (307/1000) had at least one candidate with a match score=1000

• Note: only 7.5% were the exact string as found the UMLS Metathesaurus

• Match score distribution for the 3,017 candidatesMatch Score Distribution (>800)

0

100

200

300

400

500

600

700

799804808811817821823827830833837840842845848851854858860862865868871875877885887893897901904911913919924936941947958966981988

match score

number at score

24

NAPHSIS 2007

Semantic Type correct matches

• BECA with MetaMap correctly categorized 720 (72%) of the literals by semantic type

• Of these, “Neoplastic Process” had the highest reliability

24

25

NAPHSIS 2007

Wrong matches

• Semantic types most frequently in error

25

26

NAPHSIS 2007

ICD-10 Coding

• 252 of the 1,000 (25.2%) literals had an ICD-10 matched by BECA-MetaMap

• Categories• 1 = good match• 2 = approximate match (within ICD category)• 0 = incorrect code

• Results - 97% were good or approximate• 82.5% “good match”• 14.3% “approximate match”• 3.2% “incorrect match”

26

27

NAPHSIS 2007

ICD-10 Autocoding data

27

28

NAPHSIS 2007

Some interesting challenges

• “CSTFIOTRDPIRATORY FAILURE”• “CHRONIC ALCOHOLISHM”• “ESOPHAGELA VARICES”• “END STAGE RENAL DOSEASE”• “HEAR FAILURE”• “OVARION CANCER WITH METASTASES” • “LUNF CARCINOMA, METASTATIC”• “PENDING TOX & MICRO”• “SEP[TIC SHOCK”

2

8

29

NAPHSIS 2007

Discussion

• MetaMap may be useful for preliminary categorization of causes of death by semantic type

• Excluding certain semantic types would improve match precision (at the cost of lower # of matches)

• BECA-MetaMap only assigned an ICD-10 code 25.2% of the time

• If BECA-MetaMap assigned an ICD-10 code, it was correct over in 83% of cases, and near correct in 97% of cases

• We found that MetaMap was “confused” if:• there are multiple concepts (noun phrases) in a single

string• the phrase has a compound statement (“metastasis to

brain and bone” or “gunshot wounds of the head and right arm“

• the phrases begin with certain words (ie, complications, etc...)

29

30

NAPHSIS 2007

Future Directions for BECA

• Build a new “concept mapper” to replace MetaMap, and specifically design it to analyze causes of death phrases• include a spell checker• disambiguation for phrases that have compound

statements• match SNOMED first, then match to ICD-10

(increases the hit rate for ICD-10 autocoding)• improve performance• implement for ICD-10 includes/excludes using

an open source rules engine (jBoss Rules Engine)

30

31

NAPHSIS 2007

Credits

• National Library of Medicine, Lister Hill Lab• University of California

• Michael Resendez, MS• Cecil Lynch, MD, MS

• California Department of Health (California Department of Public Health)• Terry Trinidad• David Fisher• Debbie McDowell

31

32

NAPHSIS 2007

California EDRS

• Developed by the University of California and California DHS (2004-2005)

• Implementation (2005 - 2008)

• all death certificates entered into EDRS since Jan 1, 2005• full EDRS (implemented counties)-- DC originates in EDRS and

electronically completed locally• KDE EDRS (non-EDRS counties) -- DC completed in standard

‘paper’ fashion, eventually entered by State office into EDRS

• June 2007 - where are we?

• today --> 510,000 certificates (2005 - present)• Originate locally (EDRS records) or are entered later into EDRS

(non-EDRS records)• Today, June 2007, ~ 65% originate locally as EDRS

electronic• By Nov 2007 over 90% of all CA records will originate in

EDRS

33

NAPHSIS 2007

Cause of Death Workflow with CA-EDRS

• CA-EDRS does not provide electronic support for gathering of the COD today

QuickTime™ and aPhoto - JPEG decompressor

are needed to see this picture.

certifier andfuneral home exchange (fax)worksheet

Once COD is finalized bycertifier, funeral home staffcreate EDRS recordand enters them