machine learning methods for population neuroimaging

Machine Learning Methods for Population Neuroimaging

Thomas E. Nichols

University of Warwick

Neuroimaging: More then eye-candy?

• s

Neuroimaging Frisson• Brain

– Seat of behavior, cognition, consciousness

– Essential for unraveling mental disorders• Profound cost to society

– Cost of treatment, welfare/disability & loss of earnings

» UK: £100 billion/year | US: $300 billion year/year

• Functional neuroimaging– Before, we only had neuropsychology

• Wait for cerebral accidents & observe behavior change– E.g. stroke, industrial accidents

– Now have Functional MRI

Phineas Gage

Magnetic Resonance Imaging

T1 Map T2 Map Grey Matter (GM) White Matter (WM)

Fractional Anisotropy (FA) Mean Diffusivity (MD) Susceptibility Weighting (SWI) Cerebral Blood Volume (CBV)

Different Types of MRI Acquisitions/Parameters

Different Possible Analysis Types

BOLD fMRI

Cortical Thickness

Diffusion Tensors

Track-Based Analysisof FA

• MRI - rich set of tools

– Multiple “dials” to assess brainanatomy & physiology

“Old” Neuroimaging:Observational Studies of Structure

• “Voxel Based Morphometry” Study of London Taxi cab drivers– n=16 Taxi drivers

• Mean 2yrs studying “The Knowledge”

– n=30 controls

• Hippocampal volume differences, increasing with time as driver– Hippocampus key for long-

term memory formation

• Methods– Small, observational sample– Mass Univariate Model

• Parametric RFT

Maguire et al. (2000). Navigation-related structural change in the hippocampi of taxi drivers. PNAS, 97(8), 4398–403.

Volume Differences vs. Time as Driver

Brain Volume Differences:

Taxi Drivers>

Controls

Traditional Neuroimaging• Traditionally a small sample affair

– Typical group size 10!

• Low n = low power

– Power: Prob. of true positive

– But if study has low power,even if true effect, you’ll neverreplicate it

Random sample of 300 articles using fMRI in 2007. Carp (2012) NeuroImage.Power of 730 studies in

Neuroscience. Median power = 20%.Button et al. (2013) Nature Reviews Neuroscience.

Population Neuroimaging

• Population Sampling

– Not using a sample of convenience

– A selection method that should equally sample all members of a population

• E.g. based on voter registration, doctor

– Representative of a larger population

– And a larger sample

Machine Learning forPrediction

Prediction – The basics

• Supervised Learning – Build prediction algorithm, maps inputs to an output

• Input: data/features– E.g. disease duration, blood work, age, etc.

• Output: labels/target variable– E.g. “success” (no symptoms for 6 months), symptom score

– Must know ‘truth’, true labels– Build algorithm using ‘training’ set, evaluate with ‘test’

• Unsupervised Learning– Exploring heterogeneity in data– Do groups of subjects cluster / segregate?

• Based on which variables?

– After unsupervised learning, may give you idea for supervised model

• Core goal of statistics:

– Make an inference on population sampled

– E.g. On average, do women have longer hair than men?

• Randomly sample men, women, measure hair length

• Test Ho: μMensHair = μWomensHair vs. Ha: μMensHair < μWomensHair

• If reject Ho, conclude something about population means

– Very likely with sufficient data

Statistics vs. Machine LearningInference vs. Prediction

Population of Women’s Hair Length

Men’sHair Length

Mean Hair Length over 40 Women

Mean Hair Lengthover 60 men

TruePopulation Distributions

Sampling Distributionof Mean

σM σW

σM σW

√60 √40

Statistics vs. Machine LearningInference vs. Prediction

• Core goal of (supervised) machine learning:

– Make individual predictions, often the inverse question

– Can I use Hair Length to predict gender?

• Yes, but not perfectly– Population distributions set limit on accuracy

– No increase precision from averaging to help you


Men’sHair Length



Men’sHair Length


Multivariate Machine Learning

• What if, instead of just using hair length, we could use other variables?

– Number of pairs of baseball caps owned?

– Indeed… often considering multiple variables gives better prediction

Common to all predictive models:• Training data, {labels,features}

• Classifier/predictor

• Key aspect is: • What form does have?• How do you train/estimate it?

Internal prediction

External prediction

= M or F?

# ca

ps

Hair Length

Key Prediction Concept: Overfitting

• Trying to predict continuous response (green curve)

• For this n=10 dataset, a order-9 polynomial fits perfectly!

• But it likely won’t generalize

• Thus essential to use cross-validation schemes– Must test on held-out data, to

estimate generalization accuracy

True function

Fitted function

Fitting Predictive Models:Cross Validation

• Avoiding Over-fitting

– To get good estimate of accuracy on truly new, unseen data

• Leave One Out Cross-Validation (LOOCV)

– Hold out one case/subject’s data treat as “new”

– Fit model on “held in” data

– Make prediction on “held out” data

– Repeat N times, giving ‘held-out’ estimate for each case

– Simplest approach; gives unbiased estimates of accuracy

– Most computationally expensive, gives variable estimates of accuracy

Leave One Out Cross-Validation

sub

j. 1

sub

j. 2

sub

j. 3

sub

j. 4

sub

j. 5

sub

j. 6

sub

j. 7

sub

j. 8

sub

j. 9

sub

j. 1

0su

bj.

11

sub

j. 1

2su

bj.

13

sub

j. 1

4su

bj.

15

sub

j. 1

6su

bj.

17

sub

j. 1

8su

bj.

19

sub

j. 2

0

run 1

run 2

run 3

run 4

run 20

…

Test Datum

Training Data

…

Fitting Predictive Models:Cross Validation

• K-fold cross-validation

– Divide data in to K “folds”

– Analyze fit with remainingK-1 “folds”

– Predict each of the held-out data

– Repeat K times

– More computationally efficient, butless variable

• Generally K-fold CV is recommended

– K=10 typical

Test

Test

Test

Test

Illustration of 4-fold CV

T r a i n g

a i n g

n g

T r a i n g

T r

T r a i

Prediction Methods:Ridge & Lasso Regression

GLM – No regularization Ridge: λ = 1.5×10-8 Ridge: λ = 1

Test

Illustration of 4-fold nested CV

T r a i n g

Test a i n gT r

run 1

run 1/1

run 1/2

run 1/3

run 1/4

run 2

• Use test data to optimize

• Then use optimal to finally predictrun 1’s test data

• Revisit curve-fitting… with Ridge Regression

• No automatic way to find λ!

– Must use another cross-validation!

– Nested CV’s canbe very slooow

Population NeuroimagingHCP Application

Human Connectome Project (HCP) (1)

• Missouri Twin Registry (MOTWIN) – Ascertainment from birth records

• State of Missouri Division of Vital Statistics

– In 1990’s, parents of all twins born to Missouri residents from 1975 to 1991 invite to participate

– HCP ran from 2010-2015• Twin ages at study start, 20-36 years

• Families with at least 4 offspring selected, all siblings invited

• “Extended Twin Design”– Allows heritability to be estimated, also improves

power to detect genetic associations

Human Connectome Project (HCP) (2)

• Target sample– n=1,200 (300 families of 4)

• Extensive testing with standard psychological, health-history tests

• Extensive state-of-the-art MRI– Structural MRI

• 2× T1w, 2× T2w, 0.7 mm3 isotropic

– Functional MRI, Task & Rest• TR=0.72s 2 mm3 isotropic, 4× 15min resting

– Diffusion MRI• 1.25 mm3 isotropic

HCP Mega Trawl

• Can we predict fundament subject features with resting-state fMRI connectivity?

• Would like directional “arrows”

– In practice, all we get are undirectedJoint work with Steve Smith, Oxford& HCP team

HCP Mega Trawl

• 50 nodes (defined by ICA)

– Gives 50 × 50 network matrix

HCP Mega Trawl

• 50nodes (defined by ICA)

– Gives 50 × 50 network matrix… clustered

(full) correlation

(full) correlation

HCP Mega Trawl

• 50nodes (defined by ICA)

– Partial correlation sparser than full

partial correlation

(full) correlation

HCP MegaTrawl:Predict each SM with NetMat

• Network Matrix (NetMat) for each subject

– 50×(50-1)/2 = 1,225 unique edges

• Also tried for networks with 25, 50, 100 & 200 nodes

– Partial correlation (r2z) between each node

• “Subject Measures” (SM) for each subject

– 280 behavioral and demographic measures

• Fluid IQ, life satisfaction, stress, dexterity, smoking, alcohol/drug use, sleep quality, …

HCP Mega Trawl

• For each SM

– 10-fold CV to estimate “prediction R2”

– For each fold

• Using held-in (9/10th) fold– Nuisance regression estimated & applied to SM & netmats

» Using brain/head size, motion, acquisition quarter

– Feature selection

» Each edge used to predict SM alone; best half of edges kept

– Retained edges jointly predict SM in Elastic (L1+L2) regression

» Optimized with 10-fold CV, nested within outer CV

• Held-out (1/10th)– Nuisance-adjustment, using pre-estimated model

– Prediction with final model

Prediction Evaluation Measures:Continuous Reponses – MegaTrawl

• Coefficient of Determination (R2) is 4%– 4% of total variance explained by predicted values

– Note difference from 4% ≠ (r)2 = 0.242 = 0.058

• CoD is -4%! No useful prediction variance explained

HCP Mega Trawl Redux

• Try it out!

– https://db.humanconnectome.org/megatrawl

• Disappointing

– No measure predicted well

• Except boring things like, year of data acquisition, gender

• Ideally would like to relate all SM’s to all NetMat’s

– Not just using an entire NetMat’s to predict one SM

HCP CCA Trawl:Relate all of the two sets variables

• Network Matrix (NetMat) for each subject– 200×(200-1)/2 = 19,900 unique edges

– Partial correlation (r2z) between each node

• “Subject Measures” (SM) for each subject– 280 behavioral and demographic measures

• Fluid IQ, life satisfaction, stress, dexterity, smoking, alcohol/drug use, sleep quality, …

• Canonical Correlation Analysis (CCA)– Find

• some linear combination of NetMat edges that

• best correlates with

• some linear of SMs

Smith, et al. (2015). A positive-negative mode of population covariation links brain connectivity, demographics and behavior. Nature Neuroscience.

Exactly 1 mode found!

Conclusions

• Population Neuroimaging

– Need large sample size to overcome intrinsic power limitations of previous studies

– Need representative samples to generalise to a population other than healthy undergraduates

• Machine Learning

– Prediction is the future

– But building accurate predictive models is harder than finding population differences

• And easier to screw up 70

machine learning methods for population neuroimaging

Documents