1 the automatic snomed coding system data overview system design experiments future work by weihang...
Post on 19-Jan-2018
217 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
The Automatic SNOMED Coding The Automatic SNOMED Coding SystemSystem
Data Overview System Design Experiments Future Work
By Weihang ZHANG
Supervisor: Prof. Jon PATRICK
2
Data Overview – The Pathology Text• 400K of pathology texts from the SWAPS Anatomical Pathology Database
• Every pathology text is indexed by “RequestID”
• A set of diagnoses for each report (pathology text), presented as SNOMED RT codes
3
Data Overview – The Pathology TextInsight into a text (RequestID=“1”)Insight into a text (RequestID=“1”)
<Title>CLINICAL HISTORY</Title>
Biopsy of discoid erythematosus like lesion from right cheek ? DLE.
<Title>MACROSCOPIC</Title>
LABELLED `RIGHT CHEEK LESION'. An ellipse 12 x 3mm with subcutis to 3mm. A poorly defined pale nodular lesion 3 x 3mm. It appears to abut the surgical margin. Representative sections embeded, A tips face on, B lesion and surgical margin. (MR 17/4)<DOT>TA</DOT>
<Title>MICROSCOPIC</Title>
Section shows hyperkeratosis with occasional follicular plugging, epidermal atrophy and severe sundamage to dermal collagen. A dense chronic inflammatory cell infiltrate, both superficial and deep is present, mainly in a perivascular and periadnexal distribution. No liquefaction degeneration of the basal layer, no dermal oedema and no interface dermatitis are seen. PAS stain reveals no thickening of the epidermal basement membrane and only an occasional fungal spore on the skin surface.
Immunofluorescence for immunoglobulins and complement fractions are negative.
The differential diagnosis rests between chronic discoid erythematosus, lymphocytic infiltration of skin of Jessner and the plaque type of polymorphous light eruption. The presence of marked solar damage to collagen, the absence of basal liquefaction degeneration and the negative immunofluorescence favours polymorphous light eruption. A reaction to drugs or an insect bite is also a possibility. No evidence of malignancy.
Reported 24/4/98
4
Data Overview – The SNOMED CodeThe SNOMED Codes AssignedThe SNOMED Codes Assigned
The SNOMED CodesThe SNOMED Codes ExplainedExplained
5
Data Overview - Sequence Number and Multi-lable
• One One pathology text v.s. pathology text v.s. manymany SNOMED Code SNOMED Code
Example-
RequestID ’1’: 4 sequences, 3 codes
6
Data Overview - DiagnosisCodes Distribution
• The first 9 codes are selected for experiments (excluding ‘None’)
• All the left codes are considered as “others”
Uniformly Random Select Uniformly Random Select
10K pathology texts from 400K texts database10K pathology texts from 400K texts database
7
Data Overview – Attributes inspection on the texts
Attributes inspection: Self-Separated Distribution (SVM…?)Attributes inspection: Self-Separated Distribution (SVM…?)
8
System Design – Text-to-Vector Extracting
Configuration (xml) Properly de-coupled classes Flexibly interchange methods Freely switch the output file type Prepared for unsupervised experiments
Uniformly Random Selection 10K texts are selected from 400K dataset
Stratified Resample 10-fold Cross Validation
Output (ARFF,DAT) ARFF : For Weka DAT : For SVM-light and MaxEnt
•Text to VectorText to Vector
+config()
Console
XmlConfigurator
DOM
Preprocessor
DimReducer
Indexer
MysqlAgent
FileGenerator
InstanceFactory
Histograms
CODE_1vector_data10-fold
{FeatureMatrixFile}
**
-Read
*
*
-Generate
*
*
MysqlDatabase
-Read**
1
*
1
*
config.xml
CODE_2vector_data10-fold
CODE_3vector_data10-fold
CODE_Kvector_data10-fold
-Generate
*
*-Generate*
*
-Generate
*
*
10
System Design – Classifiers Generation
Machine Learner Selection SVM-Light MaxEnt J48(Weka) …
Classifiers Generation Depending on the selected Features and MLs
CODE_1vector_data10-fold
CODE_2vector_data10-fold
CODE_3vector_data10-fold
CODE_Kvector_data10-fold
MachineLearners
MachineLearner
Classifier_1 Classifier_2 Classifier_3 Classifier_K
•LM SelectionLM Selection
•Classifiers GenerationClassifiers Generation
11
System Design – Code Assembling
Assembly line Text-to-Vector conversion Vector Deliver
Classifiers – the workers All classifiers work together, assign the
vector their classification results
Classifier_1 Classifier_2 Object_3 Object_K
NewTextclassify
CODE_1Positive
CODE_2Negative
CODE_3Negative
CODE_KPositive
•Code AssemblingCode Assembling
12
3 sets of comparisons (winners)
Preprocessing: Stem-All-Words v.s. Stem-None
Indexing: Weight-Entropy v.s. Word-Frequency v.s. Boolean-Weight
Learning: Machine Learner Performances SVM v.s. Maximum-Entropy (MaxEnt) v.s. Weka J48
Baseline 1 : Stem-None + Weight-Entropy + SVM ? The trade-off between TIME and ACCURACY
Baseline 1~ : Stem-None + Boolean-Weight + MaxEnt
Experiments –
13
SNOMED Concepts participation Add SNOMED Concept IDs as extra feature Replacement words with Concept IDs Use SNOMED Concepts instead of text
Negation (in company with SNOMED-Concept extraction)
Try N-gram (Bigram, Trigram)
Insight to the sections of text – section hiding / focusing <MICROSCOPIC>,<MACROSCOPIC>,<CLINICAL HISTORY>,<SPECIMEN>… Full structured texts???
Future Work –
top related