development of thai text-mining model for classifying icd ...ecai.ro/arhiva/ecai-2016...

ECAI 2016 - International Conference – 8th Edition

Electronics, Computers and Artificial Intelligence

30 June -02 July, 2016, Ploiesti, ROMÂNIA

Development of Thai Text-Mining Model for

Classifying ICD-10 TM

Pornrat Jatunarapit

Department of Computer Engineering

Faculty of Engineering, Chulalongkorn University

Bangkok, Thailand

[email protected]

Krerk Piromsopa

Department of Computer Engineering

Faculty of Engineering, Chulalongkorn University

Bangkok, Thailand

[email protected]

Chris Charoeanlap Department of Orthopaedics

Faculty of Medicine, Chulalongkorn University

Bangkok, Thailand

[email protected]

Abstract – This paper presents a model for classifying

ICD-10 TM using machine learning and information

retrieval. The scope of this research take systematic

approach for translating diagnosis from medical records

to ICD-10 TM is proposed. First, an information

retrieval is used to find similarity word in Thai and

English diagnose. Then, machine learning approach is

applied to classify ICD-10 TM by training models using

Naïve Bayes algorithm. The result shows that our

proposed approach can accurately classify ICD-10 TM

in Thai-English diagnose at 81.41%.

Keywords-ICD-10 TM; Text mining; IR; machine

learning; Thai diagnosis;

I. INTRODUCTION

The ICD-10 TM is an international code of diseases and sign, abnormal finding symptom circumstances, complaints, and external cause of injury as classified by WHO (World Health Organization) [1]. It is mandatory to associate every medical notes with ICD-10. In several countries, a hospital is required to submit the ICD-10 to the government.

To code ICD-10 TM, most hospitals employ medical coder [2], ICD-10 TM specialist to manually classify doctor’s diagnosis with into ICD-10 TM by comparing with codebook This manual process introduces (1) a delay in ICD-10 TM recording, (2) a requirement of a high-educated people (3) a complex process in Figure 1. .

In addition, this research short term goal is to provide an initial machine learning model for classify ICD-10 TM using text mining to assist medical coder. Also, long-term goal is to automatically extract ICD-10 TM directly from medical records. This research show a new method to classify ICD-10 TM from Thai free text. This is very useful for medical record.

Furthermore, Section II explains background knowledge related ICD-10 TM, text mining and information retrieval. Section III contains related works. Section IV is our design. Finally, evaluation

and conclusion show in our work in Section V and Section VI respectively.

Figure 1. The process of classifying ICD-10 TM of the OPD of

King Chulalongkorn Memorial hospital. (OPD)

II. BACKGROUND

This section provides background necessary for

understanding our work. We first discuss ICD-10 TM

and its complexity. Later, we discuss text mining,

Thai language tools, and machine learning models.

A. ICD-10 TM

International Classification of Diseases was initially developed for classifying diseases from the cause of the death by WHO. In present, it is the 10th edition. Given that each country has different Diseases and Health problem; they have to localize their own ICD. For examples, the United States has an ICD-9 CM (CM: Clinical Modification). Like others, Thailand also creates ICD-10 TM (TM: Thai Modification) in 2001. ICD-10 TM has been approved by WHO. In this version, the expert in each medical specialty from the Ministry of public health, colleges, the associations of medical specialties have added more diseases into ICD-10 TM.

Pornrat Jatunarapit, Kreak Piromsopa and Chris Charoeanlap

2

Code in ICD-10 TM generally begins with a capital alphabet A-Z followed by 2-4 digits of Arabic numbers. There is a “.” (dot) separating between the third digit and the forth digit. Fig. 2 is an example of ICD-10 TM.

Figure 2. ICD-10 TM Code.

ICD-10 TM is grouped by principle diagnosis, comorbidity, complication, other diagnosis, external cause and non-OR procedures [3]. To ease understanding, we first explain the terminologies.

1) Principle diagnosis is the disease that occurs before treatment or main diseases that cause the patience to receive a treatment.

2) Comorbidity is the presence of disorders that occur before treatment and less effect than principle diseases.

3) Complication is a disease that occurs while receiving treatments.

4) External cause is the cause of treatment such as injury from accident.

5) Other diagnosis is the lesser other diseases that may be affected in treatment result.

6) Non-OR process is a treatment that uses

medical tools with patience.

B. Text Mining [4][5]

Text mining is a complex procedure in information analysis study of their frequency of words, word classification, meaning of word, natural language processing log, machine learning and Information retrieval. There are 3 methods in text mining: 1) clustering 2) questioning and answering 3) concept linking.

C. Information Retrival [4][6]

Information retrieval is the search for similarity by

using index of date subject in document file, index of

questionnaire for composing the similarity of both to

find out the most closed meaning. To perform

information retrieval across languages, the finding

process is divided into 3 sub processes which are:

1) Indexing phase: In this phase, the system has to

create document preventative to decrease finding tire

from data. Invented indexing must gather all required

data then parse them before language processing to

delete requirement word before indexing. The process

involves mostly sorting by alphabet and counting

repetition and frequency

2) Translation phase [7]: This phase can be

divided into 2 types:

2.1) Direct translation: Dictionary technique

translation.

2.2) Indirect translation: Latent semantic

analysis and explant semantic analysis. Latent

Semantic Analysis (LSI) searches without translation

but used index, matrix and cosine correlation to find

out the similarity of document or query. This process

has high efficiency and popular in measuring

similarity between two data.

3) Matching phase: This phase compares or

matches index and keywords using convention

methods.

Lucene [8] which is an open-source software that uses in information retrieval develop from java programming language. It can extract data, store data and create keywords index for reference to documents.

D. Machine Learning [9]

Machine learning is artificial intelligent. There are

at least 3 types of machine learning.

1) Supervised learning is an algorithm that learned

from training data with correct result.

2) Unsupervised learning is to learn directly from

input data without any hint.

3) Reinforcement learning is to learn from

environment. The Tic-Tac-Tor game is a good

example of reinforcement learning.

E. Word Segmentation

Word segmentation is the process of dividing statement, sentences to words. Nowadays, there are panty of open-source tool for word segmentation such as standard java library, LexTo [11].

III. RELATE WORKS

A. Cross Language Information Retrival (CLIR) The related several cross language information

retrieval. T. Z. a. and Y.-J. Zhang [11] states that information retrieval for English-Chinese languages by using bilingual dictionary with synonym. P.Akewaranukulsiri [12] uses bilingual dictionary with vector model and similarity from costive in Thai, for finding meaning in Thai herb. In this work, there 3 suggestions. 1) Size of related entries from semantic analysis key word between Thai herb and modern medicine are related to the size of maturity. 2) Information retrieval will be more effective if using

Development of Thai Text-Mining Model for Classifying ICD-10 TM

3

the specific word. 3) Information retrieval gives less efficiency if using vector model from their questionnaire extension and keyword index cratering. This makes cross language information retrieval more accuracy.

B. Machine Learning

In 2009, K.Phosai [13] demonstrates Latent

semantic analysis and machine learning for Thai

question answering system. Additionally, semantic

analysis is used for answering Thai language because

Thai language is more complex than English language

(e.g. no spaces word). This work used CRFs as a

learning model for grouping query and choosing text.

Finally, Naïve Bayes average the result by using the

highest percentage of accuracy.

In 2014, N.Chirawichitchai [14] shows Emotion

classification of Thai text using Boolean weighting

with support vector machine. This result shows the

accuracy of 77.86% from support vector machine

algorithm.

C. Medical Record System

In 2008 Farkas and Szarvas [16] presented CSSs

for automatic assignment of ICD-9-CM codes (limit

number of possible ICD-9-CM codes) by Support

Vector Machine. In 2014 the Maria, et. al. [16]

implemented tools from CSSs base on tools (JAVA)

and CSS framework for natural language. This work

shows an accuracy of 92% from short medical text

(3,000 samples). The paper shows 3 step in assign

ICD-9-CM

1) Text preprocessing: transform in a standardized

form.

2) Query generation: expanded synonyms and

augment the probability of retrieving correct ICD-9-

CM code.

3) Code selection: query that identifies all the

candidate codes in knowledge base.

In 2010 the Chen, et. al. [15] used semantic

analysis of free text (English) to assign ICD-9-CM

codes from 978 patient records. This work shows the

average precision (semantic feature) of 67.0% and

75.6% (matching feature).In this research they used a

semantics analysis (semantic graph) method include

dependency parsing of clinical records and calculation

of semantic matching score to classify ICD code. The

experiment result from three domains.

1) Implemented semantic features 1) digestive

67% 2) neural 70% and 3) respiratory 63.3%.

2) Implemented matching features 1) digestive

78.8% 2) neural 71.0% and 3) respiratory 75.9%.

Nowadays, there exist research that classify ICD-

10 form medical record by using WEKA [19] to test

for accuracy between C4.5 and Naïve Bayes. This is

then applied with Apriori algorithm [18]. The result

shows that the system can classify 115 samples of

diagnosis into 3 types of diseases. Each type has 7

diseases. The research achieves the accuracy of 86%.

This paper develops a model for classify ICD-10 TM from free text using information retrieval and text classifier algorithm. The classifier is selected by validating accuracy of Naïve Bayes, Support vector machine and Decision tree against 3,000 diagnosis note.

IV. PROPOSED APPROACH

Our aim is to create a model for classifying ICD-10 TM by using information retrieval technology and machine learning technique. There are 3 steps in our design methodology, 1) Data preparation 2) Modeling and 3) Tool creation. The steps are shown in Fig. 3.

Figure 3. Overview of approach.

A. Data Preparation

Diagnosis data are collected from King Chulalongkorn Memorial hospital by blinding patient confidential information. This data includes code and expansion of ICD-10 TM in both Thai and English. The data are separated into 2 data sets (test set and training set).

B. Model

The overview of our model is shown in Fig. 4. There are 6 steps in our model. They are tha basis of our tool.


4

Control word

Disease Thai

มะเร็งกระดูก …

English Bone cancer,

Malignant bone tumor , CA

…

Disease : มะเรง็กระดกู

Symptoms : คล ำพบกอ้นเน้ือบริเวณกระดูก...

Disease area: …..

ICD-10 TM :

Disease: Bone cancer, Malignant bone ,

tumor,CA

Symptoms : mass….

Disease area :….

ICD-10 TM :

Figure 4. Overview of this Model for classify ICD-10 TM.

1) Input: diagnose data and separate word using

Lexto. (Lexto is a word segment software, developed

by HLT Lab (NECTEC). The longest matching

feature embedded in the Lexto use). In this study it is

assumed diagnostic data no misspelled.

2) Cleaning & keybase: clean word and extract

keywords from diagnostic.

3) Control word: contains list of language proficiency of both English and Thai diseases. The meaning and its acronyms [20] from doctor is shown in Fig. 5 and Table I.

Figure 5. Example control word

TABLE I. ACRONYMS FROM “อกัษรยอ่ท่ีหมอใช”้ BOOK.

Acronyms Full Name

PTA Prior To Admission

THA Total hip arthroplasty

TKA Total Knee Arthroplasty

TLIF Transforaminal lumbar interbody fusion

TMT Tarsomtatarsal

yr Year

4) Indexing: create index from diagnosis and

ICD-10 TM code by using Lucene library to give

weight before semantic search.

5) Latent Semantic: search similarity word to

increase the accuracy of model.

6) Classifying: use algorithm to classify ICD-10

TM from 5).

7) Machine Learning: In case that the tool show

wrong ICD-10 TM code, user can correct the code.

The newly corrected code will be used for retraining

the system as shown in Fig. 6.

Figure 6. Machine learning process

C. Tool Creating

ICD-10 TM classification model was developed by java programming language and mangoDB [21] (noSQL) database for data storage.

D. Sample Cases

The table II shows the percentage of ICD-10 TM

code from 3,000 samples of orthopedic department.

In additional, table III presents an example input from

our information retrieval model.

TABLE II. PERCENTAGE OF ICD-10 TM CODE USE IN THE

ANALYZED SAMPLE PER SECTION OF DISEASES.

Chapter ICD-10 TM

Percentage

of code

usage

II Neoplasms 5.4 %

VI Diseases of the nervous system 1.2 %

XIII Diseases of the musculocutaneous system and connective tissue

67.1 %

XVII

Congenital malformations,

deformations and chromosomal

abnormalities

3.1 %

XIX Injury, poisoning and certain other

consequences of external causes 23.2 %

Development of Thai Text-Mining Model for Classifying ICD-10 TM

5

TABLE III. EXAMPLE CASE.

Example

Input 5 yr PTA ปวดร้ำวลงขำสองขำ้ง ซ้ำย>ขวำ ไม่ปวดหลงั ปวดเวลำเดินไกล

Word segment

(LexTo)

5 | yr | PTA | ปวดร้ำว | ลงขำ | สองขำ้ง | ซ้ำย | > | ขวำ | ไม่ปวดหลงั | ปวด | เวลำ | เดินไกล

Semantic &

Control

word

5 | year | prior to admission | ปวดร้ำว | ลงขำ | สองขำ้ง | ซ้ำย | ไป | ขวำ | ไม่ปวดหลงั | ปวด | เวลำ | เดินไกล

Output ปวดร้ำว | ลงขำ | สองขำ้ง |

V. EVALUATION

We validate our model using 3,000 samples. The

assessment is done by comparing the results with

those from medical coder to evaluate the accuracy,

precision and recall. The definition and equations are

shown in Table IV and Equation 1-3.

TABLE IV. RESULT OF EVALUATION.

Predicted condition

positive

Predicted condition

negative

The system

displays ICD-

10 TM

TP (True Positive) Correct result

FP (False Positive) Unexpected result

The system

dose not display ICD-

10 TM

FN (False Negative) Missing result

TN (True Negative)

Correct absence of

result

(1)

Recall is the result of relavant instances that are

retrieved.

(2)

Precision is the result of retrieved instances that are relevant.

(3)

Accuracy is the result value from tool with the

actual value.

TABLE V. RESULT FROM WEKA (10-FOLD CROSS

VALIDATION)

Algorithm

Predicted condition

positive Precision Recal

l Correct Incorrect

Support vector

machines 80.21 % 19.79 % 0.802 0.839

Naïve Bayes 81.41 % 17.59 % 0.814 0.835

Decision Tree (C4.5)

73.63 % 26.37 % 0.727 0.800

The preliminary results from table V show that Decision tree gives the worst result and Naïve Bayes yields the best precision at 81.41%. Although, Nonetheless, SVM gives the best recall. In general, Naïve Bayes gives the best overall result. Therefore, we choose Naïve Bayes for developing our tool.

The example result from our tool is shown in table VI.

TABLE VI. EXAMPLE RESULT FROM TOOL.

Diagnosis

Result (ICD-10 TM)

check Medical

Coder Model

ล่ืนลม้สะโพกซำ้ยกระแทกพ้ืน ปวดสะโพกซำ้ย ยนืลงน ้ ำหนกัไม่ได ้

M17.1 M17.1

ญ 58 yr กระดูกสนัหลงัคด ไม่มีอำกำรอ่ืน ไม่ปวดหลงั

M41.1(5) M41.1

น้ิวเทำ้ท่ี 2 ดำ้นซำ้ยขำด S98.1 S98.1

Chronic Lt Hip dislocation

Operation : Open reduction

and Hip spica Lt.

S73.09 S73.09

Myelogram , Post

myelogram no complication Z09.8 Z09.8

มีรถตดัหนำ้ มอร์เตอร์ไซดล์ม้ ไหล่ขวำกระแทกพ้ืน Prominent Rt distal

clavicle ขยบัล ำบำก ไม่ชำ S49.8 S48.9

VI. CONCLUSIONS AND FUTURE WORK

Machine learning model is methodized to

classifying ICD-10 TM using text mining. The

results show that the proposed approach yields

accuracy up to 81.41% from the initial data. The

advantage of this research is the model can assist

medical coder to reduce the period of work. However,

this analysis is a preliminary result, hence the

accuracy can further be developed. The work can also

be used with other medical departments. We aim at

developing a fully automated ICD-10 TM extraction

from the follow-up note.

ACKNOWLEDGMENT

We would like to thank King Chulalongkorn Memorial hospital for providing the support.

REFERENCES

[1] World Health Organization. [Online]. Available: http://www.who.int/about/en/. [Accessed 5 October 2015].

[2] Advancing the business of Healthcare [Online]. Available: http://www.aapc.com/medical-coding/medical-coding.aspx. [Accessed 3 February 2016].

[3] K. Sangkhawasi, Introduction to ICD-10 [Online]. Available: http://www.slideserve.com/kerry/icd-10. [Accessed 10 December 2016].

[4] W.Wongwilaisakun. “Data Warehouse and Data Mining for Management,” Panyapiwat Journal ,vol.2 no.2, pp.157-165, spacial issue may.

[5] D.F.a.J.Sanger, The Text Mining Handbook.

[6] K. Kesorn, “Cross language (Thai-English) Information Retrieval: Concepts and Challenges,” KKU Sci. J. 41(1), pp.121-133, 2013.


6

[7] K. Kesorn, “Semantic Search: The New Idea of Search Engine and The Way for Future Development,” Valaya Alongkorn Review, vol.2.

[8] D.Cutting [Online]. Available: http://lucene.apache.org/. [Accessed 3 February 2016].

[9] TSAI, D. Zhang, J. JP, “Machine Learning and Software Engineering,” Kluwer Academic, 2003.

[10] National Electronics and Computer Technology Center [Online]. Available: http://www.sansarn.com/lexto/. [Accessed 20 December 2016].

[11] T. Z. a. Y.-J. Zhang, “Research on Chinese-English Cross-Language Information Retrieval,” Machine Learning and Cybernetics, 2008 International Conference, vol.5, pp. 2591-2596, 2008.

[12] P.Akewaranukulsiri, “Semantic and Cross-Language Information Retrieval for Thai Herbal and Medicine Using Latent Semantic Analysis,” International Conference on Information Science and Applications, 2013.

[13] K.Phosai, “Latent semantic analysis and machine learning for Thai question answering system,” 2009.

[14] N.Chirawichitchai, “Emotion Classification of Thai Text based Using Term weighting and Machine Learning

Techniques,” 2014 11th International Joint Conference on Computer Science and Software Engineering(JCSSE), 2014.

[15] P.Chen, A.Barrera, C.Rhodes, “Semantic analysis of free text and its application on automatically assigning ICD-9-CM code to patient records,” Cognitive Informatics (ICCI),2010 9th IEEE International Conference, pp.68-74, 2010.

[16] M. Teresa Chiaravalloti, R. Guarasci, V.Lagani, E.Pasceri, R.Trunfio, “A Coding Support System for the ICD-9-CM standard,” International Conference on Healthcare Informatics, pp.71-78, 2014.

[17] R. Farkas and G. Szarvas, “Automatic construction of rule-based ICD-9-CM coding systems,” BMC Bioinformatics, vol.9, no. 3:S10, pp.1-9, 2008.

[18] S.Monthasuwan, P.Tantasanawong, N.Ruangrit, “Development system for searching code ICD-10 for medical record,” Science and Technology Silpakorn University, pp.74-88, 2015.

[19] Weka version 3.6.13 1999-2008, The university of Waikato [Online]. Available: http://www.cs.waikato.ac.nz/ml/weka/. [Accessed 16 August 2015].

[20] K. Sangkhawasi, “อกัษรยอ่ท่ีหมอใช,้”.

[21] MongoDB Inc., [Online]. Available: https://www.mongodb.org/. [Acessed 20 March 2016].

development of thai text-mining model for classifying icd ...ecai.ro/arhiva/ecai-2016...

Documents