a tree-based scan statistic for database disease surveillance martin kulldorff university of...
Post on 19-Dec-2015
216 views
TRANSCRIPT
![Page 1: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/1.jpg)
A Tree-Based Scan Statistic for Database Disease Surveillance
Martin Kulldorff
University of Connecticut
Joint work with: Zixing Fang, Stephen Walsh
![Page 2: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/2.jpg)
Database Disease Surveillance
• In what occupations are there an excess risk of dying from a particular disease?
• Are there pharmaceutical drugs that causes certain adverse effects?
![Page 3: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/3.jpg)
Nested Variables
inhalation therapists therapists health occupations professional occupations
ecotrin asprin nonsteoridal anti-inflammatory drugs analgesic drugs
![Page 4: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/4.jpg)
Occupational Multiple Cause of Death Database
• National Center for Health Statistics
• Based on Death Certificates
• Occupational Classification System
• Selected States
![Page 5: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/5.jpg)
Occupational Multiple Cause of Death Database
• Time period: 1985-1992
• Age groups: 25 years
• Total deaths: 2,114,832
• Silicosis deaths: 405
![Page 6: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/6.jpg)
Occupational Classification System
A hierarchical structure of occupations created by the United States Bureau of the Census.
Number of occupational groups at each level:
Level: 1 2 3 4 5 6 7 6 13 86 345 476 502 503
![Page 7: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/7.jpg)
Farmers Cowboys Hunters Teachers Clerks
Root
Node
Branches
Leaf
A Small Three-Level Tree Variable
![Page 8: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/8.jpg)
Occupational Classification SystemManagerial and Professional Specialty Occupations Professional Specialty Occupations Mathematical and Computer Scientists
Computer Systems Analysts and Scientists (064)Operations and Systems Researchers and Analysts (065)Actuaries (066)Statisticians (067)Mathematical Scientists, n.e.c. (068)
Natural ScientistsMedical Scientists (083), etc.
Health Diagnosing OccupationsPhysicians (084), etc.
Health Assessment and Treatment OccupationsTherapists (098-105), etc.
![Page 9: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/9.jpg)
Silicosis
• A rare disease of the lung
• Chronic shortness of breath
• Caused by dust containing crystalline silica (quartz) particles
• No known cure
![Page 10: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/10.jpg)
Silicosis
Described by Agricola in 1556:
‘In the Carpathian mines, women are found who have married seven husbands, all of whom this terrible consumption has carried away’
Agricola G. (1556). De Re Metallica. Basel: Froben and Episopius.
![Page 11: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/11.jpg)
Proportional Mortality (PM)
N = Total number of deaths (2,114,832)C = Total number of silicosis deaths (405)n = Number of farmers (266,715)c = Farmers dying from silicosis (12)
All: C/N = 405/2,114,832 = 0.000192Farmers: c/n = 12/266,715 = 0.000045
![Page 12: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/12.jpg)
Proportional Mortality Ratio (PMR)
N = Total number of deaths (2,114,832)C = Total number of silicosis deaths (405)n = Number of farmers (266,715)c = Farmers dying from silicosis (12) Farmers: PMR= [c/n] / [(C-c)/(N-n)] = 0.23
![Page 13: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/13.jpg)
Standardized Proportional Mortality Ratio (SPMR)
The same thing as proportional mortality ratio but adjusted for covariates. Adjusted for age and gender, for silicosis among farmers we have:
SPMR = 0.29
![Page 14: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/14.jpg)
Analysis Options
• Evaluate each of the 503 occupational groups, using a Bonferroni type adjustment for multiple testing.
• Use a higher group level, such as level 3 with 86 occupational groups.
Substantive Problem: We do not know whether the disease relationships effect a smaller or larger group.
![Page 15: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/15.jpg)
Analysis Options
• Take the 503 occupations as a base, and evaluate all 2503 - 2 = 2.6 10151 combinations.
Problems: Computational, Statistical, Substantive
![Page 16: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/16.jpg)
Ideal Analytical Solution
• Use the Hierarchical Tree
• Evaluate Cuts on that Tree
![Page 17: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/17.jpg)
Farmers Cowboys Hunters Teachers Clerks
A Small Three-Level Tree Variable
Cut
![Page 18: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/18.jpg)
Problem
How do we deal with the multiple testing?
![Page 19: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/19.jpg)
Proposed Solution
Tree-Based Scan Statistic
![Page 20: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/20.jpg)
One-Dimensional Scan StatisticStudied by Naus (JASA, 1965)
![Page 21: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/21.jpg)
Other Scan Statistics• Spatial scan statistics using circles or squares.
• Space-time scan statistics using cylinders.
• Variable size window, using maximum likelihood rather than counts.
• Applied for geographical and temporal disease surveillance, and in many other fields.
![Page 22: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/22.jpg)
Tree-Based Scan Statistic
H0: The probability of dying from silicosis is the same for all occupations.
HA: There is at least one group of occupations (cut) for which the probability is higher.
![Page 23: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/23.jpg)
Tree-Based Scan Statistic1. Scan the tree by considering all possible cuts on any branch.2. For each cut, calculate the likelihood.3. Denote the cut with the maximum likelihood as the most likely cut (cluster). 4. Generate 9999 Monte Carlo replications under H0.5. Compare the most likely cut from the real data set with the most likely cuts from the random data sets.6. If the rank of the most likely cut from the real data set is R, then the p-value for that cut is R/(9999+1).
![Page 24: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/24.jpg)
ResultMost Likely Cut
Occupations: Mining machine operators
Observed: 56, Expected: 5.5
SPMR = 11.8, p=0.0001
![Page 25: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/25.jpg)
Result: Second Most Likely Cut
Occupations: Molding and casting machine operators, Metal plating machine operators, Heat treating equipment operators, Misc. metal and plastic machine operators
Observed: 22, Expected: 1.2
SPMR = 20.5, p=0.0001
![Page 26: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/26.jpg)
ResultNinth Most Likely Cut
Occupation: Heavy equipment mechanics
Observed: 5, Expected: 1.0
SPMR = 4.8, p=0.72
![Page 27: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/27.jpg)
Extension to Complex Cuts
Consider a node with 4 branches: A, B, C, D.
Simple cuts: [A], [B], [C], [D]
Combinatorial cuts: [A], [B], [C], [D][AB], [AC], [AD], [BC], [BD], [CD][ABC], [ABD], [ACD], [BCD]
Ordinal cuts: [A], [B], [C], [D][AB], [BC], [CD], [ABC], [BCD]
![Page 28: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/28.jpg)
ResultMost Likely Cut
Occupations: Mining machine operators,Mining occupations n.e.c
Observed: 59, Expected: 6.0
SPMR = 11.5, p=0.0001
![Page 29: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/29.jpg)
Extension to Multiple Trees
There may not be one unique suitable tree.
It is trivial to extend the method to multiple trees, by simply scanning over all trees.
![Page 30: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/30.jpg)
ResultMost Likely Cut
Occupations: Mining machine operators,Mining engineers, Mining occupations n.e.c
Observed: 60, Expected: 6.0
SPMR = 11.6, p=0.0001
![Page 31: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/31.jpg)
Evaluated Combinations
Simple cuts: ~1,000Mixed cuts: ~1,000,000Two trees: ~1,000,000
![Page 32: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/32.jpg)
Comparison with Computer Assisted Regression Trees (CART)
Similarity:
The letters ‘T’, ‘R’, ’E’ and ‘E’.
Both are Data Mining Methods
![Page 33: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/33.jpg)
Difference
CART: There are multiple continuous or categorical variables, and a regression tree is constructed by making a hierarchical set of splits in the multi- dimensional space of the independent variables.
Tree-Based Scan Statistic: There may be only one independent variable (e.g. occupation). Rather than using this as a continuous or categorical variable, it is defined as a tree structured variable. That is, we are not trying to estimate the tree, but use the tree as a new and different type of variable.
![Page 34: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/34.jpg)
Conclusions• The tree-based scan statistic is a useful data
mining tool when we want to do know if a detected ‘clusters’ is due to chance or not, adjusting for the multiple testing of all possible cluster locations considered.
• Requires a variable that are suitably expressed in a tree structure, although the method may be extended to other structures as well.
![Page 35: A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d385503460f94a11c52/html5/thumbnails/35.jpg)
Conclusions
• There are many other potential application areas, such as pharmacovigilance where one is interested in detecting unsuspected adverse drug effects.
• Extensions can be made to tree-structured dependent variables, and to multiple tree-structured independent variables.