intelligent information directory system for clinical documents

Intelligent Information Directory System for Clinical Documents

Qinghua Zou

6/3/2005

Dr. Wesley W. Chu (Advisor)

When searching clinical reports

Keyword Search Problems

Hard to compose good keywords

Lack an outlook of the content

Interchangeable words

Intelligent Directory System

1. Overview 2. Extracting Key Concepts 3. Mining Topics 4. Building Directories 5. Searching 6. Conclusion

1. System Overview

2. Concept Extraction

2.1 Introduction 2.2 Our approach: IndexFinder

Index Phase (Offline) Search Phase (Real Time)

2.3 Experiments 2.4 Summary

2.1 Motivation Clinical texts are

valuable in medical practice Search relevant reports Search similar patients

What is key information? UMLS provides

key medical concepts Our Goal

Extract UMLS concepts from clinical texts

Clinical Texts

•Extract key info.•Standard terms

2.1 Previous Approaches

Free text

ip

dp i1

i0 vplambs

will v0

eat

dp

oats

NLP Parser

UMLS

Mapping

UMLS Concepts

Noun phrases

•lambs•oats

2.1 Problems of Previous Approaches

Concepts cannot be discovered if they are not in a single noun phrase. E.g. In “second, third, and fourth ribs”,

“Second rib” can not be discovered.

Difficult to scale to large text computing. Natural language processing requires

significant computing resources

2.2 Our Approach: IndexFinder

Free text

NLP Parser

Noun phrases

UMLS

Mapping Concepts

We would discard all words in the text except “lung” and “cancer”.

Our approach: UMLSfree text Previous: free textUMLS

Suppose UMLS contains only“Lung cancer”

Indexing

Index Data ~80MB

UMLS 2GB Index phase(offline)

conceptsFilteringExtracting

Free text Search phase(real time)

2.2 Our Approach: What’s New?

Knowledge-based approach Using the compact index data

without using any database system

Permuting words in a sentence to generate UMLS concept candidates.

Using filters to eliminate irrelevant concepts.

2.2 Concept Candidates GenerationAssumptions Knowledge base provides a

phrase table. Each phrase (concept) is a

set of words. An input text T is

represented as a set of words.

Goal Combining words in T to

generate concept candidates

Example T={D,E,F}

Answer: 5

2.2 Search Phase: FilteringUse filters to eliminate irrelevant

concepts Syntactic filter:

Word combination is limited within a sentence.

Semantic filter: Filter out irrelevant concepts using

semantic types (e.g. body part, disease, treatment, diagnose).

Filter out general concepts using the ISA relationship and keep the more specific ones.

2.3 Experiment Comparison with MetaMap [3]

Input: A small mass was found in the left hilum of the lung.

MetaMap

IndexFinder

2.4 Summary An efficient method that maps from UMLS

to free text for extracting concepts without using any database system.

Syntactic and semantic filters are used to eliminate irrelevant candidates.

IndexFinder is able to find more specific concepts than NLP approaches.

IndexFinder is scalable and can be operated in real time.

3. Mining Topics: SmartMiner

3.1 Introduction 3.2 Search Space 3.3 SmartMiner 3.4 Experiment 3.5 Summary

3.1 Introduction

A Topic (assumption) a set of concepts a frequent pattern

Finding topics by data mining Frequent patterns, or Maximal frequent patterns

Require efficient data mining

3.1 Data Mining Problem

1: a b c d e2: a b c d3: b c d4: b e5: c d e

id: item setDataset

MinSup=2

MFI abcd, be, cde

What itemsets are frequent itemsets (FI)?

a, b, c, d, e, ab, ac, ad, bc, bd, be, cd, ce, de, abc, abd, acd, bcd, cde,

abcd

Maximal frequent itemset(MFI): No superset is frequent.

3.1 Why MFI not FI? Mining FI is infeasible when there exists long FI. E.g, Suppose we have a 20-item frequent set a1 a2 … a20. All of its subset are frequent, i.e., 220=1,048,576

Mining MFI is fast and we can generate all the FI.

3.1 Previous work

Superset checking. A study shows that CPU spends 40% time for superset checking.

Search tree is too large A large number of support counting

Need more efficient method

3.2 Search spaceGiven 5 items: a, b, c, d, e. What is the search space?

Ø, a, b, c, d, e, ab, ac, ad, ae, bc, …, abcde

We use “head:tail” to denote the space as:

:abcdesimplify

Ø:abcde

What is the space of ? ab:cd

ab, abc, abd, abcd

3.2 Space decomposition

For a space :abcde, if abcg is frequent,

Then, the known space any subset of abc is frequent known space is :abc

The unknown space are: Any itemsets contain d or e. d:abce and e:abc

:abcde = d:abce + e:abc + :abc

3.3 The basic idea

(b) SmartMiner Strategy

SmartMiner takes advantages of the information from previous steps.

(a) Previous approach

B2

…

A1

B1 …

Creating B2 before exploring B1

Bn B’

…

A1

B1 …

Creating B’ after exploring B1

Using information from B to prune the space at B’

3.3 The tail information

For the space :abcde, if we know abcf, abcg and abfg are frequent, then we project them to the space. abcf abc. abcg abc. abfg ab.

Thus Tinf(abcf,abcg, abfg|:abcde)={abc}

3.4 Running time on Mushroom

0

1

10

100

1000

10 1 0.1 0.01 Minimum Support (%)

Total Time(sec)SmartMinerGenMaxMafia

3.5 Summary

SmartMiner uses tail information to guide the mining, efficient since A smaller search tree. No superset checking. Reduces the number of support counting.

4. Building Directories

4.1 Introduction 4.2 Knowledge Hierarchies 4.3 User Specification 4.4 Directory Generation 4.5 Integration various

directories 4.6 Summary

4.1 Introduction

Three Inputs Topics

Key Content Knowledge

trees Meaningful

User specs Customized

4.2 Knowledge Hierarchies UMLS concept hierarchies

PA: parent-child relationship RA: rather-than relationship

Problems A concept: several parents, different granularity

[lung cancer] [Neoplasms, Respiratory Tract] [lung cancer] [Neoplasms, Respiratory

System] A concept: hundreds of paths to roots

[lung cancer]: 233 different paths in UMLS by PA

4.2 Select Proper Hierarchies Set source preference order, e.g

[disease]: ICD9>SNOMED>MeSH [body part]: SNOMED>ICD9

Select proper granularity C: a set of concepts; n: a path node Score function for selecting the

node n S(n)=|{ci| cin, ci in C}|

Expert review

4.3 User Specifications A good directory ~ usage pattern User spec usage pattern User may have different specs A spec: a series of knowledge

names [disease] + [body part], or [body part] + [disease]

Build a directory for a spec by the ordering

4.4 Directory GenerationAn example

User spec 1: d + p [disease] + [body part]

User spec 2: p + d [body part] + [disease]

4.4 ~ An example d + p

p + d

1

1 11

1 1 11

4.4 ~ Algorithm

4.5 Integration various directories

For each Di, get all dir paths to Di

A Di is tree: XML Key words can

associate with tree nodes

Query: xpath Exist redundant

information

4.5 simplified model Keep only the

first level knowledge trees

For //d6//p6, we use XPath query

//doc[//d6 and //p6]

Size smaller, require some computation

4.6 Summary

Build directory by Topics Knowledge hierarchies User specifications

Mapping directories to XML By collecting directory paths for

each document Leverage on existing XML

technologies

intelligent information directory system for clinical documents

Documents

mining fi

frequent itemsets fi

set of words

umls concept candidates

specific concepts

phrase concept

general concepts

irrelevant candidates