sk ghi (wip) 22052014

27
Evaluating Methods for the Identification of Cancer in Free-Text Pathology Reports Using alternative Machine Learning and Data Preprocessing Approaches Suranga Nath Kasthurirathne

Upload: suranga-nath-kasthurirathne

Post on 28-Nov-2014

175 views

Category:

Science


3 download

DESCRIPTION

Presentation on Evaluating Methods for the Identification of Cancer in  Free-Text Pathology Reports Using alternative Machine Learning and Data Preprocessing Approaches

TRANSCRIPT

Page 1: Sk ghi (wip) 22052014

Evaluating Methods for the Identification of Cancer in 

Free-Text Pathology Reports Using alternative

Machine Learning and Data Preprocessing Approaches

Suranga Nath Kasthurirathne

Page 2: Sk ghi (wip) 22052014

What does that even mean ?

Page 3: Sk ghi (wip) 22052014

Our problem

• Cancer case reporting to public health registries are often:– Delayed– Incomplete

Page 4: Sk ghi (wip) 22052014

Our emphasis

• Use pathology reports• Automate it (It actually works !)

Our solution

• Speed• Accuracy• Applicability to other surveillance

activities• Computationally efficient

Page 5: Sk ghi (wip) 22052014

Issues

• Lots of data

• Lots of FREE-TEXT data• Not enough time• Not enough resources

Page 6: Sk ghi (wip) 22052014

Clarifications

When I say “We”:

• “We” in terms of decision making and consultation usually means Dr. Grannis

• “We” in terms of implementation and code mongering usually means Suranga

Page 7: Sk ghi (wip) 22052014

Our basic approach

Page 8: Sk ghi (wip) 22052014

Solution/s

What improvements are we trying out?

• Alternative data input formats• Candidate decision models• Decision model combinations• HOW to look for Vs. WHAT to look

for

Page 9: Sk ghi (wip) 22052014

Manual review

• Functions as our source of truth–What ?–Why ?

Manually reviewed 1495 reportsIdentified 371 (24.8%) positive cancer cases

Page 10: Sk ghi (wip) 22052014

Machine learning process

• Identification of keywords–What ARE keywords ?Metastasis, tumor, malignant, neoplasm, stage, carcinoma and ca

• Identification of negation context• Use of alternate data input formats

Page 11: Sk ghi (wip) 22052014

What were the different data input formats used ?

• Raw data input• Four state data input

What and Why ?

Page 12: Sk ghi (wip) 22052014

• Raw

• Four state

Page 13: Sk ghi (wip) 22052014

So basically

Page 14: Sk ghi (wip) 22052014

Training / Testing

• What ?• Why cross validation ?

• Alternative decision models– So many options !– Classification vs. Clustering analysis

Page 15: Sk ghi (wip) 22052014

To preserve my sanity, and because we’re not stupid…

• We used Weka (Waikato Environment for Knowledge Analysis)– is a collection of machine learning

algorithms for data mining tasks– is Open Source !

Page 16: Sk ghi (wip) 22052014

Decision models used

• Logistic regression• Naïve Bayes• Support vector machine• K-nearest neighbor• Random forest• JT48 J48 decision tree

(Thanks Jamie !!!)

Page 17: Sk ghi (wip) 22052014
Page 18: Sk ghi (wip) 22052014

Results

• How do we measure our results ?– Precision

• What % of positive predictions were correct?

– Recall• What % of positive cases were caught?

– Accuracy• What % of predictions were correct?

Precision Vs. Recall. The fine balance

Page 19: Sk ghi (wip) 22052014

Results contd.…

• RF and NB showed statistically significant lower values for precision

• SVM exhibited statistically significant lower results for recall

• SVM and NB produced statistically significant lower results for accuracy

Page 20: Sk ghi (wip) 22052014

Overall performance by preprocessed input type

• Raw count is significantly better than four state

Page 21: Sk ghi (wip) 22052014

Overall performance by decision model

• Ensemble approach is significantly better to individual algorithms

Page 22: Sk ghi (wip) 22052014

Improvements

Page 23: Sk ghi (wip) 22052014

Keywords ? sure, I have a list…

Better identification of keywords

Shaun

Page 24: Sk ghi (wip) 22052014

Problems with Negex…

Page 25: Sk ghi (wip) 22052014

Results

• The funder is happy… we think• We wrote an abstract !• Feature selection approaches for

keyword identification as an independent study rotation

Page 26: Sk ghi (wip) 22052014

Our thanks to…

• Dr. Shaun Grannis (RI)• Dr. Brian Dixon (RI)• Dr. Judy Wawira (IUPUI)• Eric Durbin (UKC)

Page 27: Sk ghi (wip) 22052014

Questions ?