u sing a hierarchy in place of flat data for s equential p attern m ining matt ramsey 100063967...
TRANSCRIPT
USING A HIERARCHY IN PLACE OF FLAT DATA FOR SEQUENTIAL
PATTERN MINING
Matt Ramsey
100063967
Supervisor: Jan Stanek
Minor Thesis Presentation
PRESENTATION STRUCTURE
Introduction Concepts Research Question
Methodology Results Conclusion Discussion
INTRODUCTION
What the project is: Performing sequential pattern mining on patient
prescription data using the ATC drug hierarchy
Some Important concepts Sequential Pattern Mining ATC drug hierarchy
CONCEPT 1 - SEQUENTIAL PATTERN MINING
Sequential pattern mining is trying to find the relationships between occurrences of sequential events, to find if there exists any specific order of the occurrences (Zhao, Q & Bhowmick, S 2003).
CONCEPT 3 - ATC CODES Anatomical Therapeutic Chemical (ATC): A
drug classification that creates distinct groups of drugs EG: Propicillin (J01CE03)
RESEARCH QUESTION
How will the use of a hierarchy affect the success of the sequential pattern mining?
THEORY
Theory is: there will be more patterns, but may experience “contamination” and “dilution”
Contamination Occurs when a level of the hierarchy that is too high is used –
it contains too many unrelated subgroups E.G: If we have a pattern at level 2, and we move up to
pattern mine at level 1, we still have that pattern, but the meaning might be lost.
Pattern is contaminated
Dilution Occurs when a level of the hierarchy that is too low is used -
meaning information becomes too specific E.G: At Level 5 of the ATC hierarchy may be too many
possible states (prescriptions) for strong patterns to emerge Pattern is diluted by the level of detail
METHODOLOGY – PRE-PROCESSING
Remove Prescriptions with no drug recorded Calculate Duration = Packsize * (Repeats+1) /
Dose Convert Drug names to ATC codes
Patient Key
Provider Key
Date Script number
Drug Name
Dose Packsize
Repeats
A2rJFF8mDe
Amrr6MjVUS
22/04/2008
A006507 Propicillin 1 daily
20 2
Patient Key
Date Drug Code Duration (days)
A2rJFF8mDe 22/04/2008 J10CE03 60
^Prescription(1)="A2rJFF8mDe,Amrr6MjVUS,22/07/2004,A006507, Propicillin,1 daily,20,2,Tablets,.....”^Prescription(1,"ATC Code")="J10CE03"^Prescription(1,"Patient ID")="A2rJFF8mDe”^Prescription(1,"Prescription Duration")=60^Prescription(1,"Script Date")="22/04/2008"
METHODOLOGY – SEQUENTIAL PATTERN MINING
There are many Sequential Pattern Mining algorithms, but majority are overly complex or specific to multi-dimensional data, time sensitive data, etc.
We want a simple method for proof-of-concept Some problems with existing “simple” algorithms
such as AprioriAll (Agrawal & Srikant 1995) and PrefixSpan (Pei et al. 2004)
METHODOLOGY – PROBLEM 1: PATTERNS WITHIN SEQUENCES
Patterns: A -> B
SPM with Min Support: 3
METHODOLOGY – PROBLEM 1: PATTERNS WITHIN SEQUENCES
Patterns: ---
SPM with Min Support: 3
METHODOLOGY – PROBLEM 2: TRANSITIVE PATTERNS
SPM with Min Support: 3
Patterns: A -> C
METHODOLOGY – ADDRESSING PROBLEMS
1. Link data with the same Patient Key2. Break each patient’s sequence up into
individual 2 drug sequences.
No transitive effect because no possible gaps No missing patterns within patients as all
prescriptions treated as their own sequence
METHODOLOGY – PATTERN MINING
Use 2 support thresholds Minimum Support Minimum Patients
Pattern Mine Iteratively: Mine for patterns in level 5 (flat data) Modify the ATC codes (eg A01AA01 -> A01AA) Mine for patterns in level 4 Modify the ATC codes (eg A01AA -> A01A) Etc
Reflect on strength and relevance of gained patterns How strongly supported the patterns are How meaningful they are
EXAMPLE PATTERNS WITH MINSUP = 6, MINPAT = 3
LEVEL 5 : 2 ITEM PATTERNS
PATTERN: Temazepam (N05CD07) -> Paracetamol (N02BE01) PATTERN OCCURS 8 TIMES IN TOTAL OF 1 PATIENTS OCCURENCES: (3, 49) (3, 52) (3, 56) (3, 58) (3, 65) (3, 67) (3, 80) (3, 100)
PATTERN: Metformin (A10BA02) -> Gliclazide (A10BB09) PATTERN OCCURS 5 TIMES IN TOTAL OF 3 PATIENTS OCCURENCES: (1, 64) (4, 1) (4, 3) (13, 45) (13, 49)
....
LEVEL 4 : 2 ITEM PATTERNS
PATTERN: Biguanides (A10BA) -> Sulfonamides, urea derivatives (A10BB) PATTERN OCCURS 5 TIMES IN TOTAL OF 3 PATIENTS OCCURENCES: (1, 64) (4, 1) (4, 3) (13, 45) (13, 49)
....
RESULTS FOR MINSUP = 6, MINPAT = 3
Level Pattern Length
Unique Patterns
Total Patterns
5 2 8 55
4 2 11 73
3 2 15 109
3 3 1 6
2 2 20 152
2 3 1 6
1 2 31 430
1 3 25 211
1 4 11 73
1 5 2 17
1 6 1 7
RESULTS FOR MINSUP = 6, MINPAT = 3
Level Pattern Length
Unique Patterns
Total Patterns
5 2 8 55
4 2 11 73
3 2 15 109
3 3 1 6
2 2 20 152
2 3 1 6
1 2 31 430
1 3 25 211
1 4 11 73
1 5 2 17
1 6 1 7
RESULTS FOR MINSUP = 6, MINPAT = 3
A01AA01
A01AA
A01A
A01
A
Level Pattern Length
Unique Patterns
Total Patterns
5 2 8 55
4 2 11 73
3 2 15 109
3 3 1 6
2 2 20 152
2 3 1 6
1 2 31 430
1 3 25 211
1 4 11 73
1 5 2 17
1 6 1 7
Example code
INTERESTING PATTERNS DISCOVERED
Only emerges at level 4, and is one of the strongest patterns
Gets diluted at level 5
PATTERN: ACE inhibitors, plain (C09AA) -> HMG CoA reductase inhibitors (C10AA) PATTERN OCCURS 9 TIMES IN TOTAL OF 2 PATIENTS
PATTERN: OTHER ANALGESICS AND ANTIPYRETICS (N02B) -> HYPNOTICS AND SEDATIVES (N05C) PATTERN OCCURS 6 TIMES IN TOTAL OF 1 PATIENTS
Identified as unusual by supervisor Jan Stanek
PATTERN: Metformin (A10BA02) -> Gliclazide (A10BB09) PATTERN OCCURS 5 TIMES IN TOTAL OF 3 PATIENTS
PATTERN: Biguanides (A10BA) -> Sulfonamides, urea derivatives (A10BB) PATTERN OCCURS 5 TIMES IN TOTAL OF 3 PATIENTS
PATTERN: ORAL BLOOD GLUCOSE LOWERING DRUGS (A10B) -> ORAL BLOOD GLUCOSE LOWERING DRUGS (A10B)
PATTERN OCCURS 19 TIMES IN TOTAL OF 3 PATIENTS Very vague pattern, is present at lower levels By higher level, loses its meaning; becomes
contaminated
ACHIEVEMENTS
Created modified algorithm Finds patterns within sequences as well as across
sequences Uses 2 support thresholds
Discovered more rules than if performing pattern mining on flat data
Asses the impact of using a hierarchy for sequential pattern mining YES using hierarchy DOES furthers pattern mining BUT it is up to an expert in the field to assess
whether the extra patterns are useful or not.
CONCLUSION Unique Research
Using hierarchy to enhance pattern mining Finding patterns within sequences as well as across
all sequences Small data set required
This research shows the potential importance of using a hierarchy to enhance data mining
Forms the basis for further research
REFERENCES
Agrawal, R & Srikant, R 1995, 'Mining Sequential Patterns', paper presented at the Eleventh International Conference on Data Engineering.
Pei, J, Han, J, Dayal, U, Mortazavi-Asl, B, Wang, J, Pinto, H, Chen, Q & Hsu, M 2004, 'Mining sequential patterns by pattern-growth: The prefixspan approach', IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 11.
Zhao, Q & Bhowmick, S 2003, 'Sequential pattern mining: A survey', ITechnical Report CAIS Nayang Technological University Singapore, pp. 1–26.
DISCUSSION