data science, statistics, and healthstatweb.stanford.edu/~tibs/ftp/talk-columbia.pdf ·...
TRANSCRIPT
Data Science, Statistics, and Healthwith a focus on Statistical Learning and Sparsity
and applications to biomedicine
Rob TibshiraniDepartments of Biomedical Data Science & Statistics
Stanford University
Columbia data science summit 2020 1 / 56
Outline
1. Some general thoughts about data science and health
2. Some comments about supervised learning, sparsity, anddeep learning.
3. General advice for data scientists
4. Example: Predicting platelet usage at Stanford Hospital
5. Some recent advances:• LassoNet [A feature-sparse Neural Network]• Lasso and elastic net for GWAS[Polygenic risk scores]• Contrast trees and boosting [New from Jerry Friedman!]• Generalized Random Forests [Estimating causal effects]
2 / 56
There is a lot of excitement about Data Science andhealth
• Artificial intelligence, predictive analytics, precisionmedicine are all hot areas with huge potential
• A wealth of data is now available in every area of publichealth and medicine, from new machines and assays, smartphones, smart watches ....
• Already, there have been good successes in data science inpathology, radiology, and other diagnostic specialties.
3 / 56
For Statisticians: 15 minutes of fame
• 2009: “ I keep saying the sexy job in the next ten yearswill be statisticians.” Hal Varian, Chief Economist Google
• 2012 “Data Scientist: The Sexiest Job of the 21st Century”Harvard Business Review
4 / 56
Some obstacles to this success
• Data siloing is a problem. In the US health care systemresearchers tend not to share data openly. Access to large,representative datasets is essential for this work.
• The bar is higher in health. There is more more atstake in the health area than in say a recommendationsystem for movies. Errors are much more costly
• The public may be less tolerant of data science/machineerrors than human errors (analogous to self-driving cars).
5 / 56
High level comments about Data Science, Statistics andMachine learning
• Data Science and Statistics involve much more thansimply running a machine learning algorithm on data
• For example:• What is the question of interest? How can I collect data or
run a experiment to address this question?• What inferences can be drawn from the data?• What action should I take as a result of what I’ve learned?• Do I need to worry about bias, confounding,
generalizability, concept drift ...?
• →Statistical concepts and training are essential
6 / 56
Important points for data science applications in health
• Models should be kept as simple as possible, as they areeasier to understand. And we can more easily anticipatehow/when they might break. See Brad Efron’s recent paper“Prediction, Estimation, and Attribution”
• Uncertainly quantification is essential. We must buildtrust in our models.
• Algorithmic fairness is important
• Example of a key unsolved problem: in classification, withhigh dimensional features, how can we arrange for ourclassifier to sometimes “abstain”? (because it isextrapolating).
• Estimation of heterogeneous treatment effects(HTE), eg for personalized medicine, is a very importantarea in need of research. A simple unsolved problem: howcan we do internal cross-validation of a model for HTE?
7 / 56
The Supervising Learning Paradigm
Training Data Fitting Prediction
Traditional statistics: domain experts work for 10 years tolearn good features; they bring the statistician a small cleandatasetToday’s approach: we start with a large dataset with manyfeatures, and use a machine learning algorithm to find the asgood ones. A huge change.
8 / 56
This talk is about supervised learning: building models fromdata for predicting an outcome using a collection of inputfeatures.Big data vary in shape. These call for different approaches.
Wide Data
Tall Data
Thousands / Millions of Variables
Hundreds of Samples
Tens / Hundreds of Variables
Thousands / Tens of Thousands of Samples
Lasso & Elastic Net
Random Forests & Gradient Boosting
We have too many variables; prone to overfitting.Lasso fits linear models to the data that are sparse in the variables. Does automatic variable selection.
Sometimes simple models (linear) don’t suffice.We have enough samples to fit nonlinear models with many interactions, and not too many variables.A Random Forest is an automatic and powerful way to do this.
9 / 56
The Lasso
The Lasso is an estimator defined by the followingoptimization problem:
minimizeβ0,β
1
2
∑i
(yi − β0 −∑j
xijβj)2 subject to
∑|βj | ≤ s
• Penalty =⇒ sparsity (feature selection)
• Convex problem (good for computation and theory)
• Our lab has written a open-source R language packagecalled glmnet for fitting lasso models (Friedman, Hastie,Simon, Tibs). Available on CRAN.
• glmnet v3.0 (just out) now features the relaxed lasso andother goodies (like a progress bar!)
Almost 2 million downloads
10 / 56
The Elephant in the Room: DEEP LEARNING
Will it eat the lasso and other statistical models?
11 / 56
Deep Nets/Deep Learning
x1
x2
x3
x4
f(x)
Hiddenlayer L2
Inputlayer L1
Outputlayer L3
Neural network diagram with a single hidden layer. The hidden layer
derives transformations of the inputs — nonlinear transformations of linear
combinations — which are then used to model the output12 / 56
What has changed since the invention of Neural nets?
• Much bigger training sets and faster computation(especially GPUs)
• Many clever ideas: convolution filters, stochastic gradientdescent, input distortion ...
• Use of multiple layers (the “Deep” part): Tommy Poggiosays that the universal approximation theorems for singlelayer NNs in the 1980s set the field back 30 years
• Confession: I was Geoff Hinton’s colleague at Univ. ofToronto (1985-1998) and didn’t appreciate the potential ofNeural Networks!
13 / 56
What makes Deep Nets so powerful
(and challenging to analyze!)
It’s not one “mathematical model” but a customizableframework– a set of engineering tools that can exploit thespecial aspects of the problem (weight-sharing, convolution,feedback, recurrence ...)
14 / 56
Will Deep Nets eat the lasso and otherstatistical models?
Deep Nets are especially powerful when the features have somespatial or temporal organization (signals, images), and SNR ishigh
But they are not a good approach when
• we have moderate #obs or wide data ( #obs < #features),
• SNR is low, or
• interpretability is important
• It’s difficult to find examples where Deep Nets beat lasso orGBM in low SNR settings, with “generic ” features
15 / 56
General advice for Data Scientists
Details matter!
Q: Have you tried method X?A: I tried that last year and it didn’t work very well!
Q: What do you mean? What exactly did you try? How did youmeasure performance?A: I can’t remember.
16 / 56
General tips
• Try a number of methods and use cross-validation totune and compare models (our “lab” is the computer-experiments are quick and free)
• Be systematic when you run methods and makecomparisons: compare methods on the same training andtest sets.
• Keep carefully documented scripts and data archives(where practical).
• Your work should be reproducible by you and others,even a few years from now!
17 / 56
In Praise of Simplicity
‘Simplicity is the ultimate sophistication” — Leonardo Da Vinci
• Many times I have been asked to review a data analysis bya biology postdoc or a company employee. Almost everytime, they are unnecessarily complicated. Multiple steps,each one poorly justified.
• Why? I think we all like to justify– internally andexternally— our advanced degrees. And then there’s the“hit everything with deep learning” problem
• Suggestion: Always try simple methods first. Move on tomore complex methods, only if necessary
18 / 56
Outline again
1. Some general thoughts about data science and health
2. Some comments about supervised learning, sparsity, anddeep learning
3. General advice for data scientists
4. → Example: Predicting platelet usage at StanfordHospital
5. Some recent advances:• LassoNet [A feature-sparse Neural Network]• Lasso and elastic net for GWAS[Polygenic risk scores]• Contrast trees and boosting [New from Jerry Friedman!]• Generalized Random Forests [Estimating causal effects]
19 / 56
How many units of platelets will the Stanford Hospitalneed tomorrow?
20 / 56
AllisonZemekThoPhamSaurabhGombar
LeyingGuanXiaoyingTian Balasubramanian Narasimhan
21 / 56
22 / 56
Background
• Each day Stanford hospital orders some number of units(bags) of platelets from Stanford blood center, based onthe estimated need (roughly 45 units)
• The daily needs are estimated “manually”
• Platelets have just 5 days of shelf-life; they are safety-testedfor 2 days. Hence are usable for just 3 days.
• Currently about 1400 units (bags) are wasted each year.That’s about 8% of the total number ordered.
• There’s rarely any shortage (shortage is bad but notcatastrophic)
• Can we do better?
23 / 56
Data overview
days
Thr
ee d
ays'
tota
l con
sum
ptio
n
0 200 400 600
6010
016
0
days
Dai
ly c
onsu
mpt
ion
0 200 400 600
2040
60
24 / 56
Data description
Daily platelet use from 2/8/2013 - 2/8/2015.
• Response: number of platelet transfusions on a given day.
• Covariates:
1. Complete blood count (CBC) data: Platelet count,White blood cell count, Red blood cell count, Hemoglobinconcentration, number of lymphocytes, . . .
2. Census data: location of the patient, admission date,discharge date, . . .
3. Surgery schedule data: scheduled surgery date, type ofsurgical services, . . .
4. . . .
25 / 56
Notation
yi : actual PLT usage in day i.xi : amount of new PLT that arrives at day i.ri(k) : remaining PLT which can be used in the following kdays, k = 1, 2wi : PLT wasted in day i.si : PLT shortage in day i.
• Overall objective: waste as little as possible, with littleor no shortage
26 / 56
Our first approach
27 / 56
Our first approach
• Build a supervised learning model (via lasso) to predict useyi for next three days (other methods like random forestsor gradient boosting didn’t give better accuracy).
• Use the estimates yi to estimate how many units xi toorder. Add a buffer to predictions to ensure there is noshortage. Do this is a “rolling manner”.
• Worked quite well- reducing waste to 2.8%- - but the lossfunction here is not ideal
28 / 56
More direct approach
This approach minimizes the waste directly:
J(β) =n∑
i=1
wi + λ||β||1 (1)
where
three days′total need ti = z
Ti β, ∀i = 1, 2, .., n (2)
number to order : xi+3 = ti − ri(1)− ri(2)− xi+1 − xi+2 (3)
waste wi = [ri−1(1)− yi]+ (4)
actual remaining ri(1) = [ri−1(2) + ri−1(1)− yi − wi]+ (5)
ri(2) = [xi − [yi + wi − ri−1(2)− ri−1(1)]+]+ (6)
Constraint : fresh bags remaining ri(2) ≥ c0 (7)
(8)
This can be shown to be a convex problem (LP).
29 / 56
ResultsChose sensible features- previous platelet use, day of week, #patients in key wards.
Over 2 years of backtesting: no shortage, reduces waste from1400 bags/ year (8%) to just 339 bags/year (1.9%)
Corresponds to a predicted direct savings at Stanford of$350,000/year. If implemented nationally could result inapproximately $110 million in savings.
30 / 56
Moving forward
• System has just been deployed at the Stanford Bloodcenter (R Shiny app). But yet not in production
• We are distributing the software around the world, forother centers to train and deploy
• see Platelet inventory R packagehttps://bnaras.github.io/pip/
• Just this week we learned that the predictions have beenway off for the past month! Data problem? Modelproblem? Not sure yet.
• A remaining challenge: How can we detect if/when thesystem is no longer working well?
31 / 56
Outline once again
1. Some general thoughts about data science and health
2. Some comments about supervised learning, sparsity, anddeep learning
3. General advice data scientists
4. Example: Predicting platelet usage at Stanford Hospital
5. → Some recent advances:• LassoNet [A feature-sparse Neural Network]• Lasso and elastic net for GWAS[Polygenic risk scores]• Contrast trees and boosting [New from Jerry Friedman!]• Generalized Random Forests [Estimating causal effects]
32 / 56
A feature sparse Neural net
“‘LassoNet”
Joint work with my Statistics student Ismael Lehradri andformer Stanford Statistics Student Feng Ruan
33 / 56
Motivation
• Two methods for Supervised learning:
1. The Lasso fits linear models with sparsity (featureselection). Interpretable but may not predict well (missesnon-linearities and interactions)
2. Neural networks fit flexible nonlinear models, but arehard to interpret. Typically all features are used. (Somemethods have been proposed for doing “post-mortem”analyses, to try to deduce feature importance.)
Can we marry the two and get the best of both worlds?
34 / 56
The LassoFor i = 1, 2, . . . N , let yi be the output for observation i, and(xi1, . . . xip) be the features (inputs) from which we will predictyi.
The Lasso is an estimator defined by the followingoptimization problem:
minimizeβ0,β1
2
∑i
(yi − β0 −∑j
xijβj)2 + λ ·
∑|βj |
• The parameters βj are feature weights that are chosen tominimize the objective function: the penalty deliverssparsity (feature selection).• The value λ is a user-defined tuning parameter: largerλ =⇒ more sparse. We solve this problem for an entirepath of λ values.• This is a convex problem (good for computation and
theory)35 / 56
LassoNet
We assume the model
yi = β0 +∑j
xijβj +
K∑k=1
[αk + γk · f(θTk xi)] + εi (9)
with ε ∼ (0, σ2). Here f is a monotone, nonlinear function suchas a sigmoid or rectified linear unit, and each θk = (θ1k, . . . θpk)is a p-vector. Our objective is to minimize
J(β,Θ, α, γ) =1
2
∑i
(yi − yi)2 + λ∑j
|βj |+ λ∑jk
|θjk|
subject to |θjk| ≤ |βj | ∀j, k. ← secret sauce
36 / 56
In a nutshell
“Features can participate in the hidden layer only if they havenon-zero main effects”
• A neural network is a complex model, but is made simplerif the model is a function of only a subset of the features
• Eg a protein-based blood test for a disease: if only 10proteins out of 1000 need to be measured, then thecomplicated neural net mapping of these 10 proteins is stillacceptable to scientists
37 / 56
Computational strategy
• We seek the solution over a path of λ values, using currentsolutions as warm starts. We use a projected proximalgradient procedure at each λ
• Unlike Lasso, objective is non-convex. Makesoptimization much harder. Eg the starting values for (β, θ)matter!
• In lasso (our glmnet package), we compute a path ofsolutions from sparse (λ large) to dense (λ small)
• This worked badly for LassoNet; we tried many tricks;finally Feng suggested that we go from Dense to Sparse.Bingo!
38 / 56
Toy Example
●
●
●
●
●
●
●
●
●
●●
●●●●●● ● ● ●● ● ● ●●●●●● ● ● ● ●●● ●● ●●●● ●●● ● ● ●●●●●●● ●●●●●●●●●●●●●●●
●
0 5 10 15 20
1015
2025
3035
40
Number of nonzero predictors
Test
err
or
lassoNNLassoNet
Boston housing data with 13 noise predictors added. The testerror shown for a (single hidden layer) neural net is the
minimum achieved over 200 epochs.
39 / 56
Example: Response to HIV drugsPredictors are ∼ 500 mutations (binary). From Dr. BobSchafer.
●
●
●
●
●
●
●
●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●
●●●●●●●●
●●●● ● ●●●●●●●●●●●
●●●●●●●●●●●
●●●
0 50 100 150 200
0.02
0.04
0.06
0.08
0.10
0.12
Number of predictors
Test
err
or
●
●●
●●
●●
●●
●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
LassoNNLassoNet
40 / 56
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
A typical 1
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
A typical 2
●
●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 200 400 600 800
0.00
50.
020
0.05
0
Number of nonzero betas
Mis
clas
sific
atio
n er
ror
●●●●●●●●
●●●
●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
LassoNNLassoNet
Test error
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
NN weights
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
LassoNet betas
0.0 0.2 0.4 0.6 0.8 1.00.
00.
20.
40.
60.
81.
0
LassoNet thetas
MNIST example, discriminating 1s from 2s. Shown are typicaldigits, the test error from lasso, NN and LassoNet and
estimated weights from the latter two methods.
41 / 56
Lasso and elastic net for GWAS
Estimation of Polygenic risk scores
• with current software (eg glmnet), lasso and elastic netcannot be applied to data of size say 500K (patients) by800K (SNPs)
• we have developed a new approach using the idea of strongscreening rules (Tibs et al JRSSB 2012), that successfullycarries out this computation in hours (exact, withinmachine precision)
• Joint work with PhD students Junyang Qian , YosukeTanigawa, Trevor Hastie and the Manny Rivas group atStanford DBDS
• “A Fast and Flexible Algorithm for Solving the Lasso inLarge-scale and Ultrahigh-dimensional Problems” , Qian etal bioRxiv 2019
42 / 56
Strong RulesFor lasso fit, with active set A:∣∣∣〈xj ,y −Xβ(λ)〉
∣∣∣ = λ ∀j ∈ A≤ λ ∀j /∈ A
So variables nearly in A will have inner-products with theresiduals nearly equal to λ.
Suppose fit at λ` is Xβ(λ`), and we want to compute the fit atλ`+1 < λ`.Strong rules gamble on set
S`+1 ={j : |〈xj ,y −Xβ(λ`)〉| > λ`+1 − (λ` − λ`+1)
}glmnet screens at every λ step, and after convergence, checksif any violations. Mostly A`+1 ⊆ S`+1.
∗ Tibshirani, Bien, Friedman, Hastie, Simon, Taylor, Tibshirani (JRSSB 2012)
43 / 56
0 50 100 150 200 250
010
0020
0030
0040
0050
00
Number of Predictors in Model
Num
ber
of P
redi
ctor
s af
ter
Filt
erin
g
●●●●●●
●●
●
●
●
●
●
●
●
●● ● ● ● ● ●●●● ● ●● ● ●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●
●●●●●●●● ● ●● ● ● ● ● ●● ● ● ● ● ●●●● ● ●● ● ●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●
●●●●●●●●● ●● ● ● ● ● ●● ● ● ● ● ●●●● ● ●● ● ●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●
global STRONGsequential DPP (exact)sequential STRONG
Percent Variance Explained
0 0.19 0.38 0.56 0.7 0.8 0.87 0.94 0.99 1 1
Strong rules inspired by El Ghaoui, Viallon and Rabbani (2010)
Sequential DPP due to Wang, Lin, Gong, Wonka and Ye (2013)44 / 56
Batch Screening Iterative Lasso
(BASIL)
1. Initialize active variables A0= empty set, λ = λmax2. for k = 1, 2, . . .
• Screening: At current λmax, form the strong setSk = Ak ∪ Ek where Ak is set of M variables with largestinner products with the residual and Ek are the ever activevariables
• Fitting: Solve the lasso using just the strong set ofpredictors Sk
• Checking: find the smallest λk+1 so that KKT conditionsare satisfied (ie. no predictors have been omitted thatshould be present)
Typically M = 1000; Screening and Checking require the fulldataset, and are done efficiently using memory-mapped I/O. (bigmemory R package)
45 / 56
Algorithmic details in pictures
0 10 20 30 40
−0.
4−
0.2
0.0
0.2
0.4
Iteration 1
Lambda Index
Coe
ffici
ents
0 10 20 30 40
−0.
4−
0.2
0.0
0.2
0.4
Iteration 2
Lambda Index
Coe
ffici
ents
0 10 20 30 40
−0.
4−
0.2
0.0
0.2
0.4
Iteration 3
Lambda Index
Coe
ffici
ents
0 10 20 30 40
−0.
4−
0.2
0.0
0.2
0.4
Iteration 4
Lambda Index
Coe
ffici
ents
completed fitnew valid fitnew invalid fit
Figure 1: The lasso coefficient profile that shows the progression of the BASIL algorithm. The previouslyfinished part of the path is colored grey, the newly completed and verified is in green, and the part that isnewly computed but failed the verification is colored red.
of expensive disk read operations. At each iteration, we roll out the solution path progressively,which is illustrated in Figure 1 and will be detailed in the next section. In addition, we proposeoptimization specific for the SNP data in the UK Biobank studies to speed up the procedure.
1.4 Outline of the paperThe rest of the paper is organized as follows.
• Section 2 describes the proposed batch screening iterative lasso (BASIL) algorithm for theGaussian family in detail and its extension to other problems such as logistic regression.
• Section 3 discusses related methods and packages for solving large-scale lasso problems.
4
.CC-BY 4.0 International licenseIt is made available under a was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which. http://dx.doi.org/10.1101/630079doi: bioRxiv preprint first posted online May. 7, 2019;
46 / 56
GWAS example
• Large cohort of 500K British adults (Bycroft et al 2018)
• Each individual genotyped at 805K locations (AA, Aa, aaor NA)
• 100s of phenotypes measured on each subject
• We looked at white British subset of 337K, and illustratewith height phenotype
• Divided the data 60% training, 20% validation, 20% test.
Package snpnet available on Github (link on Hastie’s website,and to 2019 report)
47 / 56
Lasso fit to height
0 20 40 60 80
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
Lambda Index
R2
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
●●
●● ● ●
●●
●●
●
●●
●●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ● ● ● ● ● ● ● ● ● ● ●
●●
● ● ● ●● ●
●●
●●
●●
●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
● ● ●●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●
●
●
●
●
Lasso (train)Lasso (val)ReLasso (train)ReLasso (val)
15 36 91 245
599
1413
3209
6888
1339
5
2280
3
3475
4
4767
3
Figure 3: R2 plot for height. The top axis shows the number of active variables in the model.
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
140
160
180
200
140 160 180 200Predicted Height (cm)
Act
ual H
eigh
t (cm
)
0
2500
5000
7500
10000
−20 −10 0 10 20 30Residual (cm)
Fre
quen
cy
Figure 4: Left: actual height versus predicted height on 5000 random samples from the test set. Thecorrelation between actual height and predicted height is 0.9416. Right: histogram of the lasso residualsfor height. Standard deviation of the residual is 5.05 (cm).
variance introduced by refitting such large models may be larger than the reduction in bias. Thelarge models confirm the extreme polygenicity of standing height.
17
.CC-BY 4.0 International licenseIt is made available under a was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which. http://dx.doi.org/10.1101/630079doi: bioRxiv preprint first posted online May. 7, 2019;
R-squared or heritability shown for regularization paths48 / 56
Detailed results
smallest univariate p-values, and a multivariate linear or logistic regression is fit on those variantsjointly. The sequence of models are evaluated on the validation set, and the one with the smallestvalidation error is chosen. We call this method Sequential LR for convenience in the following resultpart (Model 6).
4.4 ResultsWe present results of the lasso and related methods for quantitative traits including standing heightand BMI, and for qualitative traits including asthma and high cholesterol. A comparison of theunivariate p-values and the lasso coefficients for all these traits is showed in the form of Manhattanplots in the Appendix A (Supplementary Figure 13, 14).
4.4.1 Quantitative Traits
Standing Height Height is a polygenic and heritable trait that has been studied for a long time.It has been used as a model for other quantitative traits, since it is easy to measure reliably. Fromtwin and sibling studies, the narrow sense heritability is estimated to be 70-80% (Silventoinen et al.,2003; Visscher et al., 2006, 2010). Recent estimates controlling for shared environmental factorspresent in twin studies calculate heritability at 0.69 (Zaitlen et al., 2013; Hemani et al., 2013). Alinear based model with common SNPs explains 45% of the variance (Yang et al., 2010) and a modelincluding imputed variants explains 56% of the variance, almost matching the estimated heritability(Yang et al., 2015). So far, GWAS studies have discovered 697 associated variants that explain onefifth of the heritability (Lango Allen et al., 2010; Wood et al., 2014). Recently, a large sample studywas able to identify more variants with low frequencies that are associated with height (Marouliet al., 2017). Using lasso with the larger UK Biobank dataset allows both a better estimate of theproportion of variance that can be explained by genomic predictors and simultaneous selection ofSNPs that may be associated. We obtain R2 values that are close to the estimated heritability. Theresults are summarized in Table 3. The associated R2 curves for the lasso and the relaxed lasso areshown in Figure 3. The residuals of the optimal lasso prediction are plotted in Figure 4.
Model Form R2train R2
val R2test Size
(1) Age + Sex 0.5300 0.5260 0.5288 2(2) Age + Sex + 10 PCs 0.5344 0.5304 0.5336 12(3) Strong Single SNP 0.5364 0.5323 0.5355 13(4) 10K Combined 0.5482 0.5408 0.5444 10,012(5) 100K Combined 0.5833 0.5515 0.5551 100,012(6) Sequential LR 0.7416 0.6596 0.6601 17,012(7) Lasso 0.8304 0.6992 0.6999 47,673(8) Relaxed Lasso 0.7789 0.6718 0.6727 13,395
Table 3: R2 values for height. For sequential LR, lasso and relaxed lasso, the chosen model is based onmaximum R2 on the validation set. Model (3) to (8) each includes Model (2) plus their own specificationas stated in the Form column.
A large number (47,673) of SNPs need to be selected in order to achieve the optimal R2test =
0.6992 for the lasso. Comparatively, the relaxed lasso sacrifices some predictive performance byincluding a much smaller subset of variables (13,395). Past the optimal point, the additional
16
.CC-BY 4.0 International licenseIt is made available under a was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which. http://dx.doi.org/10.1101/630079doi: bioRxiv preprint first posted online May. 7, 2019;
49 / 56
Contrast trees and boosting
NEW! Jerome Friedman arXiv (2019)Especially useful when N >> p.
• Start with data (xi, yi, zi), i = 1, 2, . . . N .
• x, y are usual features and outcome; z = f(x) is anauxiliary quantity. Examples: zi = yi from some fittedmodel, or zi equal to some quantile of the distribution ofyi|xi• A contrast tree is defined as follows. We define a tree split
via a discrepancy measure over a node Rm, such as
dm =1
Nm
∑xi∈Rm
|yi − zi|
50 / 56
• The split criterion is
(f `m · f rm) ·max(d(l)m , d(r)m )2
wheref `m, frm are the fraction of observations in the
“parent” region Rm associated with each of the twodaughters.
• Notice the max instead of the usual sum!
• This criterion is used to build a contrast tree and thenforms the basis for contrast boosting
51 / 56
Example
• Predicting income> 50, 000
• 14 predictor variables consisting of various demographicand financial properties associated with each person.N = 50, 000. Overall test error rate from gradient boostingis 13%.
• Choose zi = fitted value from gradient boosting
52 / 56
53 / 56
Generalized Random Forests
Estimates treatment effect: predicted difference in outcomeunder two different actions (treatments) .
See grf (generalized random forests) R package onCRAN/github
(Susan Athey, Julie Tibshirani and Stefan Wager)
54 / 56
Other features of Generalized Random Forests
• non-parametric least-squares regression, quantile regression
• confidence intervals for heterogeneous treatment effects
• customizable split criteria
55 / 56
For further reading
Many of the methods used are described in detail in our bookson Statistical Learning: (last one by Efron & Hastie)
1
James · W
itten · Hastie · Tibshirani
Springer Texts in Statistics
Gareth James · Daniela Witten · Trevor Hastie · Robert Tibshirani
An Introduction to Statistical Learning with Applications in R
Springer Texts in Statistics
An Introduction to Statistical Learning
Gareth JamesDaniela WittenTrevor HastieRobert Tibshirani
Statistics
An Introduction to Statistical Learning
with Applications in R
An Introduction to Statistical Learning provides an accessible overview of the fi eld
of statistical learning, an essential toolset for making sense of the vast and complex
data sets that have emerged in fi elds ranging from biology to fi nance to marketing to
astrophysics in the past twenty years. Th is book presents some of the most important
modeling and prediction techniques, along with relevant applications. Topics include
linear regression, classifi cation, resampling methods, shrinkage approaches, tree-based
methods, support vector machines, clustering, and more. Color graphics and real-world
examples are used to illustrate the methods presented. Since the goal of this textbook
is to facilitate the use of these statistical learning techniques by practitioners in sci-
ence, industry, and other fi elds, each chapter contains a tutorial on implementing the
analyses and methods presented in R, an extremely popular open source statistical
soft ware platform.
Two of the authors co-wrote Th e Elements of Statistical Learning (Hastie, Tibshirani
and Friedman, 2nd edition 2009), a popular reference book for statistics and machine
learning researchers. An Introduction to Statistical Learning covers many of the same
topics, but at a level accessible to a much broader audience. Th is book is targeted at
statisticians and non-statisticians alike who wish to use cutting-edge statistical learn-
ing techniques to analyze their data. Th e text assumes only a previous course in linear
regression and no knowledge of matrix algebra.
Gareth James is a professor of statistics at University of Southern California. He has
published an extensive body of methodological work in the domain of statistical learn-
ing with particular emphasis on high-dimensional and functional data. Th e conceptual
framework for this book grew out of his MBA elective courses in this area.
Daniela Witten is an assistant professor of biostatistics at University of Washington. Her
research focuses largely on high-dimensional statistical machine learning. She has
contributed to the translation of statistical learning techniques to the fi eld of genomics,
through collaborations and as a member of the Institute of Medicine committee that
led to the report Evolution of Translational Omics.
Trevor Hastie and Robert Tibshirani are professors of statistics at Stanford University, and
are co-authors of the successful textbook Elements of Statistical Learning. Hastie and
Tibshirani developed generalized additive models and wrote a popular book of that
title. Hastie co-developed much of the statistical modeling soft ware and environment
in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso
and is co-author of the very successful An Introduction to the Bootstrap.
9 781461 471370
ISBN 978-1-4614-7137-0
STS Springer Series in Statistics
Trevor HastieRobert TibshiraniJerome Friedman
Springer Series in Statistics
The Elements ofStatistical LearningData Mining, Inference, and Prediction
The Elements of Statistical Learning
During the past decade there has been an explosion in computation and information tech-nology. With it have come vast amounts of data in a variety of fields such as medicine, biolo-gy, finance, and marketing. The challenge of understanding these data has led to the devel-opment of new tools in the field of statistics, and spawned new areas such as data mining,machine learning, and bioinformatics. Many of these tools have common underpinnings butare often expressed with different terminology. This book describes the important ideas inthese areas in a common conceptual framework. While the approach is statistical, theemphasis is on concepts rather than mathematics. Many examples are given, with a liberaluse of color graphics. It should be a valuable resource for statisticians and anyone interestedin data mining in science or industry. The book’s coverage is broad, from supervised learning(prediction) to unsupervised learning. The many topics include neural networks, supportvector machines, classification trees and boosting—the first comprehensive treatment of thistopic in any book.
This major new edition features many topics not covered in the original, including graphicalmodels, random forests, ensemble methods, least angle regression & path algorithms for thelasso, non-negative matrix factorization, and spectral clustering. There is also a chapter onmethods for “wide” data (p bigger than n), including multiple testing and false discovery rates.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics atStanford University. They are prominent researchers in this area: Hastie and Tibshiranideveloped generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS andinvented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of thevery successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.
› springer.com
S T A T I S T I C S
----
Trevor Hastie • Robert Tibshirani • Jerome FriedmanThe Elements of Statictical Learning
Hastie • Tibshirani • Friedman
Second Edition
All available online for free
Packages on CRAN/github: SNPnet, lassoNet, grf
56 / 56