stefan arnborg, kth, sics ingrid agartz, håkan hall, erik jönsson,
Post on 13-Jan-2016
33 Views
Preview:
DESCRIPTION
TRANSCRIPT
Stefan Arnborg, KTH, SICS
Ingrid Agartz, Håkan Hall, Erik Jönsson, Anna Sillén, Göran Sedvall, Karolinska Institutet
’Principles of Data Mining and Knowledge Discovery’,Helsinki, Aug 2002
http://www.nada.kth.se/~stefan
Data Mining in Schizophrenia Research -preliminary
HUBIN - a project to accelerate research in human brain diseases
• Carefully selected patients, relatives and controls• Each participant characterized over many domains• DNA stored in bio-bank• Each research team collects high-quality information,
analyzes it, and stores in archive for inter-domain analyses.
Hubin organizationHubin organization
Ethical groupEthical groupGöran Sedvall, ChairmanGöran Sedvall, Chairman
Ethical groupEthical groupGöran Sedvall, ChairmanGöran Sedvall, Chairman
Hubin ABHubin ABStig Larsson, ChairmanStig Larsson, ChairmanHåkan Hall, CEOHåkan Hall, CEO
Hubin ABHubin ABStig Larsson, ChairmanStig Larsson, ChairmanHåkan Hall, CEOHåkan Hall, CEO
Project staff Data domain responsibles
Management groupManagement groupHåkan Hall, Assoc. Prof. Håkan Hall, Assoc. Prof.
(project manager)(project manager)Stig Larsson, T.D. hcStig Larsson, T.D. hcGöran Sedvall, Prof.Göran Sedvall, Prof.Stefan Arnborg, Prof.Stefan Arnborg, Prof.Tom McNeil, Prof.Tom McNeil, Prof.Lars Therenius Prof.Lars Therenius Prof.
Management groupManagement groupHåkan Hall, Assoc. Prof. Håkan Hall, Assoc. Prof.
(project manager)(project manager)Stig Larsson, T.D. hcStig Larsson, T.D. hcGöran Sedvall, Prof.Göran Sedvall, Prof.Stefan Arnborg, Prof.Stefan Arnborg, Prof.Tom McNeil, Prof.Tom McNeil, Prof.Lars Therenius Prof.Lars Therenius Prof.
Scientific advisory boardScientific advisory boardGöran Sedvall, ChairmanGöran Sedvall, ChairmanNancy Andreasen, Univ of IowaNancy Andreasen, Univ of IowaPaul Greengard, Rockefeller UnivPaul Greengard, Rockefeller UnivTomas Hökfelt, Karolinska Inst.Tomas Hökfelt, Karolinska Inst.
Scientific advisory boardScientific advisory boardGöran Sedvall, ChairmanGöran Sedvall, ChairmanNancy Andreasen, Univ of IowaNancy Andreasen, Univ of IowaPaul Greengard, Rockefeller UnivPaul Greengard, Rockefeller UnivTomas Hökfelt, Karolinska Inst.Tomas Hökfelt, Karolinska Inst.
Leading causes of disability in the world, WHO (1990)
Cause of disability Total % of millions world total
1. Unipolar major depression 50.8 10.7
2. Iron deficiency anemia 22.0 4.7
3. Falls 22.0 4.6
4. Alcohol use 15.8 3.3
5. Chronic obstructive pulmonary disease 14.7 3.1
6. Bipolar disorder 14.1 3.0
7. Congenital anomalies 13.5 2.9
8. Osteoarthritis 13.3 2.8
9. Schizophrenia 12.1 2.6
10. Obsessive compulsive disorder 10.2 2.2
Schizophrenia -Questions and Clues
• Cause(s) of schizophrenia not known.• Medication effective against some symptoms - discovered by
chance 100-2000 years ago.• Does not appear in animals-no experimental clues.• Explanation models vary over time.• Disturbed neuronal circuitry in schizophrenia?
(currently hottest hypothesis)• Influenced by genotype or/and environment?
(clustering in families)
Schizophrenia -Questions and Clues
• Which processes result in disease?• Traces of disturbed development visible in MRI
(anatomy) and blood tests?• Genetic risk factors?• Causal pathways?• MAIN PROBLEM:
Connect psychiatry to physiology
Preliminary analysis
Test case:144 subjects: 61 affected, 83 controlsVariables:•Diagnosis (DSM-IV)•Demography (age, gender, ..)•Blood tests (liver, heart,…)•Genetics (20 SNP:s, receptor, growth factors, …)•Anatomy (MRI)•Neuropsychology(working memory, reactions)•Clinicaltest batteries (type of delusions, history, medication)
In vivo imaging
Magnetic resonance images (MRI)
Functional magnetic resonance images (fMRI)
Positron emission tomography (PET)
Single photon emission tomography (SPECT)
MRI
PET
In vitro imaging (whole hemispheres)
Autoradiography
In situ hybridization
ISHH
LAR
Types of images used in HUBIN
Brain boxes
Picture fromBRAINS II manual,Magnotta et al,University of Iowa
Manually drawn vermis regions
ROIs drawn by GakuOkugawa
Single Nucleotide Polymorphism
A U G U U C C A U U A U U G U
A U G U U U C A U U A U U A U
RNA:
Protein A Phe
Phe
His
His
Tyr
Tyr
Cys
Phe
non-coding SNP
coding SNP
TyrProtein A’
Protein A can be slightly different from A´
Genes studied
• DBH dopamine beta-hydroxylase• DRD2 dopamine receptor D2 +• DRD3 dopamine receptor D3• HTR5A serotonin receptor 5A• NPY neuropeptide Y• SLC6A4 serotonin transporter• BDNF brain derived neurotrophic factor• NRG1 neuregulin +
Intracranial volume (ml)
Cumulative distribution
+ = schizo = controls
Elementary Visualizations MRI Intracranial volume
Elementary VisualizationsMRI data
Total CSF volumes (ml)
Cumulative distribution
+ = schizo = controls
p < 0.0002
0 1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Gamma GTGamma GT
Cumulative distribution
+ = schizo = controls
p < 0.01
Blood dataGamma GT- alcohol marker
Men
Women25 30 35 40 45 50 55
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Sub White-women
30 35 40 45 50 55 60 650
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Sub White-men
Subcortical white
+ = schizo = controls
Subcortical white
+ = schizo = controls
Gender differences
MRI
Which methods to use?
• Visualizations, cdf and scatter plots, give intuitive grasp of variables - problems with many interrelated variables
• Statistical modelling required to decide significance of visible trend, and to rank effects
Statistical methods
• Bayesian methods intuitive and rational - but conventional testing required for publications
• Linear models - need to account for mixing and over-dispersion(Glenn Lawyer thesis project).
• Discretization and Bayesian analysis of discrete distributions - intuitive, but information lost
• Non-parametric randomization tests - most sensitive and accommodate modern multiple testing paradigms
Bayes’ factor
• Choice between two hypotheses, H1 and H2,given experimental/observational data D
P(H1|D) P(D|H1) P(H1)P(H2|D) P(D|H2) P(H2)
Posterior odds Bayes factor prior odds
Hypotheses in test matrix
• H1: (no effect) a data column is generatedindependently of diagnosis (composite model)
• H2: the data for controls are generated by one composite model, for affected by another one.
Hierarkiska modeller• Modell kontinuerlig:
H:
• Modell parametriserad:
• Modell hierarkisk: priorfördelning f() för
• Inferens för parameter
P(x∈X)= f (x)dxX∫
Hλ :P(x∈X) = f(x|λ)dxX∫
H1 :P(D |H1) = f(D|λ) f(λ)dλΛ∫
f (λ |D)∝ f(D |λ) f (λ)
Model adequate?
• Best tested with classical p-values.• Determine posterior for parameter:
• Design test function • Compute p-value:
• Reject model if p small, e.g., <1%, <5%
f (λ |D)∝ f(D |λ) f (λ)
t :D→ Rp=P(t(Dr )<t(D))
Dr ~f (⋅|λ) f (λ |D)
Bayes’ example
• Result D from test: s heads, f tails, n=s+f
• H0: Coin is balanced, P(D|H0)=2
• H : Coin has head probability p P(D|H ) = p (1-p)
• H1: H with uniform prior for p , hierarchical
P(D|H1) = ∫ P(D|H ) dp = (s! f!)/((n+1)!)
-n
p
p
ps f
p
Bayes factor - ratio between areas
Graphical models
Y
Z
X
Y
Z
X
Y
Z
Xf(x,y,z)=f(x)f(y)f(z)
f(x,y,z)=f(x,z)f(x,y)/f(x)
f(x,y,z)
MRI volumes, blood, demography
Dia
BrsCSF TemCSF
SubCSF TotCSF
Multivariate characterization by graphical models
Adding Vermis variables
Dia
BrsCSF TemCSF
PSV
V-structures,causality
X
Y
A
B
C
A
B
C
X
YA CA C | B
A CA C | B
V-structures detectablefrom observational data
Indistinguishable
A
B
C
f(x,y)=f(y|x)f(x) =f(x|y)f(y)
Pairs associated to Diagnosis
Y
Z
D
Y
Z
D
Y
Z
D
Y
Z
D
Y and Z co-vary differentlyfor Affected and Controls
Age-dependency of Posterior Superior Vermis
Age at MRI
Post sup vermis
+ = schizo = controls
70 80 90 100 110 120 1300.08
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
ParWhite
0
No co-variation between Posterior inferior vermis and parietal white for affected
Parietal white
Post inf vermis
+ = schizo = controls
PSV has best explanatory power
affected - healthy
0.05 0.1 0.15 0.2 0.25 0.30
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
PS VermisPosterior superior vermis
+ = schizo = controls
Decision tree for DiagnosisMRI Data
A = schizC = controls() = misscls
Classification explains data!(Can Mert Thesis project)
XY
Z
XY
Z
H
W W
Autoclass1
Total gray
A= schizC= controls
Weak signals in genetics data
• Numerous investigations have indicated ‘almost significant’ signals of SNP:s to diagnosis
• Typically, these findings cannot be confirmed in other studies - populations genetically heterogeneous and measurements nonstandardized.
• We try to connect SNP:s both to diagnosis and to other phenotypical variables
• Multiple testing and weak signal problems.
Genetics data - weak statistics
Gene SNP type genotypes
DRD3 SerGly A/C 49 59 14DRD2 Ser311Cys C/G 118 4 0NPY Ley7Pro A/C 1 7 144DBH Ala55Ser G/T 98 24 0BDNF Val66Met A/G 5 37 80HTR5A Pro15Ser C/T 109 11 2PNOC Gln172Arg A/G 11 37 28SLC6A4 (del(44bp)in pr) S/L 20 60 42
Empirical distribution by genotypeGene BDNF (schiz + controls)
Frontal CSF
A/A A/G G/G
Cumulativedistribution
25 30 35 40 45 50 55 60 65 70 750
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Frontal-CSF-right-
G/G G/A A/A
0 20 40 60 80 100 120 1400
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
Benjamini & Yekutieli, Annals of Math Stat, (ta)
‘no effect’Observedp-values
FDRi 71
FDRd 62
Bonferroni-Hochberg-Benjamini methodsMRI and lab data
Number of p-values
p-values
multiple comparisons:
what is the significanceof min p-values 1,1,2,3%in 20 tests?What is the probability of obtainingmore extreme result?
Compensating multiple comparisons
• Bonferroni 1937: For level and n tests, use level /n
• Hochberg 1988: step-up procedure• Benjamini,Hochberg 1996: False Discovery
Rate• J. Storey, 2002: pFDRi, pFDRd• Bayesian interpretations being developed
(Wasserman & Genovese, 2002)
Diagnosis-genotype
bdnf drd2 nrg1
0.1136 0.0735 0.8709
0.0801 0.2213 0.7666
0.0316 0.0823 0.6426
0.5499 1.0000 0.0244
0.7314 0.7312 0.0103
bdnf drd2 nrg1
0.1137 0.0749 0.8744
0.7293 0.7276 0.0096
21 tests on three genes
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | . . .
7% - not quite significant!but better than Bonferroni: 20%
p-values 3%, 2%, 1%, 1% in 20 tests
0 50 100 150 200 250 300 350 4000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1TPH:SNP000002367
q-value - FDR rate in prefix
0 50 100 150 200 250 300 350 4000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1BDNF:SNP000006430
0 50 100 150 200 250 300 350 4000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1sex
0 50 100 150 200 250 300 350 4000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1diagnos
0 1 20
0.5
1VermisMiddle-grey-right-
0.0850 1 2
0
0.5
1men
0.0650 1 2
0
0.5
1women
0.81
0 1 20
0.5
1
0.0050 1 2
0
0.5
1
0.010.5 1 1.5 20
0.5
1
0.275
0 1 20
0.5
1
0.810 0.5 1 1.5
0
0.5
1
0.7250 1 2
0
0.5
1
0.955
2 4 60
0.5
1VermisUpper-grey-
0.4652 4 6
0
0.5
1men
0.112 3 4 5
0
0.5
1women
0.55
2 4 60
0.5
1
0.033 4 5 6
0
0.5
1
0.0052 3 4 5
0
0.5
1
0.4925
2 3 4 50
0.5
1
0.9552 3 4 5
0
0.5
1
0.9252 3 4 5
0
0.5
1
0.6325
10 20 30 400
0.5
1Subcortical-grey-left-
0.99520 25 30 35
0
0.5
1men
0.99515 20 25 30
0
0.5
1women
0.315
10 20 30 400
0.5
1
0.8920 25 30 35
0
0.5
1
0.815 20 25 30
0
0.5
1
0.33
20 25 30 350
0.5
1
120 25 30 35
0
0.5
1
0.99520 25 30
0
0.5
1
0.28
10 20 30 40 500
0.2
0.4
0.6
0.8
1Parietal-CSF-left-
0.9923610 20 30 40 50
0
0.2
0.4
0.6
0.8
1men
0.69636
10 20 30 40 500
0.2
0.4
0.6
0.8
1control
0.9124510 15 20 25 30
0
0.2
0.4
0.6
0.8
1women
0.99903
That’s all, folks!
• High-quality databases for medical research of the HUBIN type open up for intelligent data analysis methods used in engineering and business
• Already with the limited data presently available, interesting clues emerge
• Multiple testing considerations are important• Long term effort - stable economy and
engagement is vital.
top related