1 the automatic snomed coding system data overview system design experiments future work by weihang...

13
1 The Automatic SNOMED The Automatic SNOMED Coding System Coding System Data Overview System Design Experiments Future Work By Weihang ZHANG Supervisor: Prof. Jon PA TRICK

Upload: kory-watson

Post on 19-Jan-2018

217 views

Category:

Documents


0 download

DESCRIPTION

3 Data Overview – The Pathology Text Insight into a text (RequestID=“1”) CLINICAL HISTORY Biopsy of discoid erythematosus like lesion from right cheek ? DLE. MACROSCOPIC LABELLED `RIGHT CHEEK LESION'. An ellipse 12 x 3mm with subcutis to 3mm. A poorly defined pale nodular lesion 3 x 3mm. It appears to abut the surgical margin. Representative sections embeded, A tips face on, B lesion and surgical margin. (MR 17/4) TA MICROSCOPIC Section shows hyperkeratosis with occasional follicular plugging, epidermal atrophy and severe sundamage to dermal collagen. A dense chronic inflammatory cell infiltrate, both superficial and deep is present, mainly in a perivascular and periadnexal distribution. No liquefaction degeneration of the basal layer, no dermal oedema and no interface dermatitis are seen. PAS stain reveals no thickening of the epidermal basement membrane and only an occasional fungal spore on the skin surface. Immunofluorescence for immunoglobulins and complement fractions are negative. The differential diagnosis rests between chronic discoid erythematosus, lymphocytic infiltration of skin of Jessner and the plaque type of polymorphous light eruption. The presence of marked solar damage to collagen, the absence of basal liquefaction degeneration and the negative immunofluorescence favours polymorphous light eruption. A reaction to drugs or an insect bite is also a possibility. No evidence of malignancy. Reported 24/4/98

TRANSCRIPT

Page 1: 1 The Automatic SNOMED Coding System Data Overview System Design Experiments Future Work By Weihang ZHANG Supervisor: Prof. Jon PATRICK

1

The Automatic SNOMED Coding The Automatic SNOMED Coding SystemSystem

Data Overview System Design Experiments Future Work

By Weihang ZHANG

Supervisor: Prof. Jon PATRICK

Page 2: 1 The Automatic SNOMED Coding System Data Overview System Design Experiments Future Work By Weihang ZHANG Supervisor: Prof. Jon PATRICK

2

Data Overview – The Pathology Text• 400K of pathology texts from the SWAPS Anatomical Pathology Database

• Every pathology text is indexed by “RequestID”

• A set of diagnoses for each report (pathology text), presented as SNOMED RT codes

Page 3: 1 The Automatic SNOMED Coding System Data Overview System Design Experiments Future Work By Weihang ZHANG Supervisor: Prof. Jon PATRICK

3

Data Overview – The Pathology TextInsight into a text (RequestID=“1”)Insight into a text (RequestID=“1”)

<Title>CLINICAL HISTORY</Title>

Biopsy of discoid erythematosus like lesion from right cheek ? DLE.

<Title>MACROSCOPIC</Title>

LABELLED `RIGHT CHEEK LESION'. An ellipse 12 x 3mm with subcutis to 3mm. A poorly defined pale nodular lesion 3 x 3mm. It appears to abut the surgical margin. Representative sections embeded, A tips face on, B lesion and surgical margin. (MR 17/4)<DOT>TA</DOT>

<Title>MICROSCOPIC</Title>

Section shows hyperkeratosis with occasional follicular plugging, epidermal atrophy and severe sundamage to dermal collagen. A dense chronic inflammatory cell infiltrate, both superficial and deep is present, mainly in a perivascular and periadnexal distribution. No liquefaction degeneration of the basal layer, no dermal oedema and no interface dermatitis are seen. PAS stain reveals no thickening of the epidermal basement membrane and only an occasional fungal spore on the skin surface.

Immunofluorescence for immunoglobulins and complement fractions are negative.

The differential diagnosis rests between chronic discoid erythematosus, lymphocytic infiltration of skin of Jessner and the plaque type of polymorphous light eruption. The presence of marked solar damage to collagen, the absence of basal liquefaction degeneration and the negative immunofluorescence favours polymorphous light eruption. A reaction to drugs or an insect bite is also a possibility. No evidence of malignancy.

Reported 24/4/98

Page 4: 1 The Automatic SNOMED Coding System Data Overview System Design Experiments Future Work By Weihang ZHANG Supervisor: Prof. Jon PATRICK

4

Data Overview – The SNOMED CodeThe SNOMED Codes AssignedThe SNOMED Codes Assigned

The SNOMED CodesThe SNOMED Codes ExplainedExplained

Page 5: 1 The Automatic SNOMED Coding System Data Overview System Design Experiments Future Work By Weihang ZHANG Supervisor: Prof. Jon PATRICK

5

Data Overview - Sequence Number and Multi-lable

• One One pathology text v.s. pathology text v.s. manymany SNOMED Code SNOMED Code

Example-

RequestID ’1’: 4 sequences, 3 codes

Page 6: 1 The Automatic SNOMED Coding System Data Overview System Design Experiments Future Work By Weihang ZHANG Supervisor: Prof. Jon PATRICK

6

Data Overview - DiagnosisCodes Distribution

• The first 9 codes are selected for experiments (excluding ‘None’)

• All the left codes are considered as “others”

Uniformly Random Select Uniformly Random Select

10K pathology texts from 400K texts database10K pathology texts from 400K texts database

Page 7: 1 The Automatic SNOMED Coding System Data Overview System Design Experiments Future Work By Weihang ZHANG Supervisor: Prof. Jon PATRICK

7

Data Overview – Attributes inspection on the texts

Attributes inspection: Self-Separated Distribution (SVM…?)Attributes inspection: Self-Separated Distribution (SVM…?)

Page 8: 1 The Automatic SNOMED Coding System Data Overview System Design Experiments Future Work By Weihang ZHANG Supervisor: Prof. Jon PATRICK

8

System Design – Text-to-Vector Extracting

Configuration (xml) Properly de-coupled classes Flexibly interchange methods Freely switch the output file type Prepared for unsupervised experiments

Uniformly Random Selection 10K texts are selected from 400K dataset

Stratified Resample 10-fold Cross Validation

Output (ARFF,DAT) ARFF : For Weka DAT : For SVM-light and MaxEnt

Page 9: 1 The Automatic SNOMED Coding System Data Overview System Design Experiments Future Work By Weihang ZHANG Supervisor: Prof. Jon PATRICK

•Text to VectorText to Vector

+config()

Console

XmlConfigurator

DOM

Preprocessor

DimReducer

Indexer

MysqlAgent

FileGenerator

InstanceFactory

Histograms

CODE_1vector_data10-fold

{FeatureMatrixFile}

**

-Read

*

*

-Generate

*

*

MysqlDatabase

-Read**

1

*

1

*

config.xml

CODE_2vector_data10-fold

CODE_3vector_data10-fold

CODE_Kvector_data10-fold

-Generate

*

*-Generate*

*

-Generate

*

*

Page 10: 1 The Automatic SNOMED Coding System Data Overview System Design Experiments Future Work By Weihang ZHANG Supervisor: Prof. Jon PATRICK

10

System Design – Classifiers Generation

Machine Learner Selection SVM-Light MaxEnt J48(Weka) …

Classifiers Generation Depending on the selected Features and MLs

CODE_1vector_data10-fold

CODE_2vector_data10-fold

CODE_3vector_data10-fold

CODE_Kvector_data10-fold

MachineLearners

MachineLearner

Classifier_1 Classifier_2 Classifier_3 Classifier_K

•LM SelectionLM Selection

•Classifiers GenerationClassifiers Generation

Page 11: 1 The Automatic SNOMED Coding System Data Overview System Design Experiments Future Work By Weihang ZHANG Supervisor: Prof. Jon PATRICK

11

System Design – Code Assembling

Assembly line Text-to-Vector conversion Vector Deliver

Classifiers – the workers All classifiers work together, assign the

vector their classification results

Classifier_1 Classifier_2 Object_3 Object_K

NewTextclassify

CODE_1Positive

CODE_2Negative

CODE_3Negative

CODE_KPositive

•Code AssemblingCode Assembling

Page 12: 1 The Automatic SNOMED Coding System Data Overview System Design Experiments Future Work By Weihang ZHANG Supervisor: Prof. Jon PATRICK

12

3 sets of comparisons (winners)

Preprocessing: Stem-All-Words v.s. Stem-None

Indexing: Weight-Entropy v.s. Word-Frequency v.s. Boolean-Weight

Learning: Machine Learner Performances SVM v.s. Maximum-Entropy (MaxEnt) v.s. Weka J48

Baseline 1 : Stem-None + Weight-Entropy + SVM ? The trade-off between TIME and ACCURACY

Baseline 1~ : Stem-None + Boolean-Weight + MaxEnt

Experiments –

Page 13: 1 The Automatic SNOMED Coding System Data Overview System Design Experiments Future Work By Weihang ZHANG Supervisor: Prof. Jon PATRICK

13

SNOMED Concepts participation Add SNOMED Concept IDs as extra feature Replacement words with Concept IDs Use SNOMED Concepts instead of text

Negation (in company with SNOMED-Concept extraction)

Try N-gram (Bigram, Trigram)

Insight to the sections of text – section hiding / focusing <MICROSCOPIC>,<MACROSCOPIC>,<CLINICAL HISTORY>,<SPECIMEN>… Full structured texts???

Future Work –