chernozhukov et al. on double / debiased machine learning
TRANSCRIPT
![Page 1: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/1.jpg)
Chernozhukov et al. on Double / DebiasedMachine Learning
Slides by Chris FeltonPrinceton University, Sociology Department
Sociology Statistics Reading GroupNovember 2018
![Page 2: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/2.jpg)
Papers
Background:
Robinson, Peter. 1988. “Root-N-Consistent SemiparametricRegression,” Econometrica 56(4):931-954.
Main paper:
Chernozhukov, Victor, Denis Chetverikov, Mert Demirer,Esther Duflo, Christian Hansen, Whitney Newey, and JamesRobins. 2018. “Double / Debiased Machine Learning forTreatment and Structural Parameters,” Econometrics Journal21(1):1-68.
![Page 3: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/3.jpg)
Papers
Background:
Robinson, Peter. 1988. “Root-N-Consistent SemiparametricRegression,” Econometrica 56(4):931-954.
Main paper:
Chernozhukov, Victor, Denis Chetverikov, Mert Demirer,Esther Duflo, Christian Hansen, Whitney Newey, and JamesRobins. 2018. “Double / Debiased Machine Learning forTreatment and Structural Parameters,” Econometrics Journal21(1):1-68.
![Page 4: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/4.jpg)
Papers
Background:
Robinson, Peter. 1988. “Root-N-Consistent SemiparametricRegression,” Econometrica 56(4):931-954.
Main paper:
Chernozhukov, Victor, Denis Chetverikov, Mert Demirer,Esther Duflo, Christian Hansen, Whitney Newey, and JamesRobins. 2018. “Double / Debiased Machine Learning forTreatment and Structural Parameters,” Econometrics Journal21(1):1-68.
![Page 5: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/5.jpg)
Why This Paper?
Provides a general framework for estimating treatment effectsusing machine learning (ML) methods
In particular, we can use any (preferably n14 -consistent) ML
estimator with this approach
Enables us to construct valid confidence intervals for ourtreatment effect estimates
Introduces a√n-consistent estimator
As n→∞, the estimation error θ − θ goes to zero at a rate ofn− 1
2 (or 1/√n)
We really like our estimators to be at least√n-consistent
1/n12 will approach 0 more quickly than, e.g., 1/n
14 as n grows
![Page 6: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/6.jpg)
Why This Paper?
Provides a general framework for estimating treatment effectsusing machine learning (ML) methods
In particular, we can use any (preferably n14 -consistent) ML
estimator with this approach
Enables us to construct valid confidence intervals for ourtreatment effect estimates
Introduces a√n-consistent estimator
As n→∞, the estimation error θ − θ goes to zero at a rate ofn− 1
2 (or 1/√n)
We really like our estimators to be at least√n-consistent
1/n12 will approach 0 more quickly than, e.g., 1/n
14 as n grows
![Page 7: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/7.jpg)
Why This Paper?
Provides a general framework for estimating treatment effectsusing machine learning (ML) methods
In particular, we can use any (preferably n14 -consistent) ML
estimator with this approach
Enables us to construct valid confidence intervals for ourtreatment effect estimates
Introduces a√n-consistent estimator
As n→∞, the estimation error θ − θ goes to zero at a rate ofn− 1
2 (or 1/√n)
We really like our estimators to be at least√n-consistent
1/n12 will approach 0 more quickly than, e.g., 1/n
14 as n grows
![Page 8: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/8.jpg)
Why This Paper?
Provides a general framework for estimating treatment effectsusing machine learning (ML) methods
In particular, we can use any (preferably n14 -consistent) ML
estimator with this approach
Enables us to construct valid confidence intervals for ourtreatment effect estimates
Introduces a√n-consistent estimator
As n→∞, the estimation error θ − θ goes to zero at a rate ofn− 1
2 (or 1/√n)
We really like our estimators to be at least√n-consistent
1/n12 will approach 0 more quickly than, e.g., 1/n
14 as n grows
![Page 9: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/9.jpg)
Why This Paper?
Provides a general framework for estimating treatment effectsusing machine learning (ML) methods
In particular, we can use any (preferably n14 -consistent) ML
estimator with this approach
Enables us to construct valid confidence intervals for ourtreatment effect estimates
Introduces a√n-consistent estimator
As n→∞, the estimation error θ − θ goes to zero at a rate ofn− 1
2 (or 1/√n)
We really like our estimators to be at least√n-consistent
1/n12 will approach 0 more quickly than, e.g., 1/n
14 as n grows
![Page 10: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/10.jpg)
Why This Paper?
Provides a general framework for estimating treatment effectsusing machine learning (ML) methods
In particular, we can use any (preferably n14 -consistent) ML
estimator with this approach
Enables us to construct valid confidence intervals for ourtreatment effect estimates
Introduces a√n-consistent estimator
As n→∞, the estimation error θ − θ goes to zero at a rate ofn− 1
2 (or 1/√n)
We really like our estimators to be at least√n-consistent
1/n12 will approach 0 more quickly than, e.g., 1/n
14 as n grows
![Page 11: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/11.jpg)
Why This Paper?
Provides a general framework for estimating treatment effectsusing machine learning (ML) methods
In particular, we can use any (preferably n14 -consistent) ML
estimator with this approach
Enables us to construct valid confidence intervals for ourtreatment effect estimates
Introduces a√n-consistent estimator
As n→∞, the estimation error θ − θ goes to zero at a rate ofn− 1
2 (or 1/√n)
We really like our estimators to be at least√n-consistent
1/n12 will approach 0 more quickly than, e.g., 1/n
14 as n grows
![Page 12: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/12.jpg)
Why This Paper?
Provides a general framework for estimating treatment effectsusing machine learning (ML) methods
In particular, we can use any (preferably n14 -consistent) ML
estimator with this approach
Enables us to construct valid confidence intervals for ourtreatment effect estimates
Introduces a√n-consistent estimator
As n→∞, the estimation error θ − θ goes to zero at a rate ofn− 1
2 (or 1/√n)
We really like our estimators to be at least√n-consistent
1/n12 will approach 0 more quickly than, e.g., 1/n
14 as n grows
![Page 13: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/13.jpg)
Why Use ML for Causal Inference, Anyway?
In observational studies, we often estimate causal effects byconditioning on confounders
We typically condition on confounders by making strongassumptions about the functional form of our model
E.g., a standard OLS model assumes a linear and additiveconditional expectation function
If we misspecify the functional form, we will end up withbiased estimates of treatment effects even in the absence ofunmeasured confounding
Our parametric specifications often lack strong substantivejustification
ML provides a systematic framework for learning the form ofthe conditional expectation function from the data
![Page 14: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/14.jpg)
Why Use ML for Causal Inference, Anyway?
In observational studies, we often estimate causal effects byconditioning on confounders
We typically condition on confounders by making strongassumptions about the functional form of our model
E.g., a standard OLS model assumes a linear and additiveconditional expectation function
If we misspecify the functional form, we will end up withbiased estimates of treatment effects even in the absence ofunmeasured confounding
Our parametric specifications often lack strong substantivejustification
ML provides a systematic framework for learning the form ofthe conditional expectation function from the data
![Page 15: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/15.jpg)
Why Use ML for Causal Inference, Anyway?
In observational studies, we often estimate causal effects byconditioning on confounders
We typically condition on confounders by making strongassumptions about the functional form of our model
E.g., a standard OLS model assumes a linear and additiveconditional expectation function
If we misspecify the functional form, we will end up withbiased estimates of treatment effects even in the absence ofunmeasured confounding
Our parametric specifications often lack strong substantivejustification
ML provides a systematic framework for learning the form ofthe conditional expectation function from the data
![Page 16: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/16.jpg)
Why Use ML for Causal Inference, Anyway?
In observational studies, we often estimate causal effects byconditioning on confounders
We typically condition on confounders by making strongassumptions about the functional form of our model
E.g., a standard OLS model assumes a linear and additiveconditional expectation function
If we misspecify the functional form, we will end up withbiased estimates of treatment effects even in the absence ofunmeasured confounding
Our parametric specifications often lack strong substantivejustification
ML provides a systematic framework for learning the form ofthe conditional expectation function from the data
![Page 17: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/17.jpg)
Why Use ML for Causal Inference, Anyway?
In observational studies, we often estimate causal effects byconditioning on confounders
We typically condition on confounders by making strongassumptions about the functional form of our model
E.g., a standard OLS model assumes a linear and additiveconditional expectation function
If we misspecify the functional form, we will end up withbiased estimates of treatment effects even in the absence ofunmeasured confounding
Our parametric specifications often lack strong substantivejustification
ML provides a systematic framework for learning the form ofthe conditional expectation function from the data
![Page 18: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/18.jpg)
Why Use ML for Causal Inference, Anyway?
In observational studies, we often estimate causal effects byconditioning on confounders
We typically condition on confounders by making strongassumptions about the functional form of our model
E.g., a standard OLS model assumes a linear and additiveconditional expectation function
If we misspecify the functional form, we will end up withbiased estimates of treatment effects even in the absence ofunmeasured confounding
Our parametric specifications often lack strong substantivejustification
ML provides a systematic framework for learning the form ofthe conditional expectation function from the data
![Page 19: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/19.jpg)
Why Use ML for Causal Inference, Anyway?
In observational studies, we often estimate causal effects byconditioning on confounders
We typically condition on confounders by making strongassumptions about the functional form of our model
E.g., a standard OLS model assumes a linear and additiveconditional expectation function
If we misspecify the functional form, we will end up withbiased estimates of treatment effects even in the absence ofunmeasured confounding
Our parametric specifications often lack strong substantivejustification
ML provides a systematic framework for learning the form ofthe conditional expectation function from the data
![Page 20: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/20.jpg)
Why Use ML for Causal Inference, Anyway?
We sometimes find ourselves working with high-dimensionaldata
I.e., we have many covariates p relative to the number ofobservations n
Two types of high-dimensionality:
1 We simply have many measured covariates (e.g., text data,genetic data)
2 We have few measured covariates but wish to generate manynon-linear transformations of and interactions between thesecovariates
ML models perform much better in high dimensions thantraditional statistical models do
![Page 21: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/21.jpg)
Why Use ML for Causal Inference, Anyway?
We sometimes find ourselves working with high-dimensionaldata
I.e., we have many covariates p relative to the number ofobservations n
Two types of high-dimensionality:
1 We simply have many measured covariates (e.g., text data,genetic data)
2 We have few measured covariates but wish to generate manynon-linear transformations of and interactions between thesecovariates
ML models perform much better in high dimensions thantraditional statistical models do
![Page 22: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/22.jpg)
Why Use ML for Causal Inference, Anyway?
We sometimes find ourselves working with high-dimensionaldata
I.e., we have many covariates p relative to the number ofobservations n
Two types of high-dimensionality:
1 We simply have many measured covariates (e.g., text data,genetic data)
2 We have few measured covariates but wish to generate manynon-linear transformations of and interactions between thesecovariates
ML models perform much better in high dimensions thantraditional statistical models do
![Page 23: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/23.jpg)
Why Use ML for Causal Inference, Anyway?
We sometimes find ourselves working with high-dimensionaldata
I.e., we have many covariates p relative to the number ofobservations n
Two types of high-dimensionality:
1 We simply have many measured covariates (e.g., text data,genetic data)
2 We have few measured covariates but wish to generate manynon-linear transformations of and interactions between thesecovariates
ML models perform much better in high dimensions thantraditional statistical models do
![Page 24: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/24.jpg)
Why Use ML for Causal Inference, Anyway?
We sometimes find ourselves working with high-dimensionaldata
I.e., we have many covariates p relative to the number ofobservations n
Two types of high-dimensionality:
1 We simply have many measured covariates (e.g., text data,genetic data)
2 We have few measured covariates but wish to generate manynon-linear transformations of and interactions between thesecovariates
ML models perform much better in high dimensions thantraditional statistical models do
![Page 25: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/25.jpg)
Why Use ML for Causal Inference, Anyway?
We sometimes find ourselves working with high-dimensionaldata
I.e., we have many covariates p relative to the number ofobservations n
Two types of high-dimensionality:
1 We simply have many measured covariates (e.g., text data,genetic data)
2 We have few measured covariates but wish to generate manynon-linear transformations of and interactions between thesecovariates
ML models perform much better in high dimensions thantraditional statistical models do
![Page 26: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/26.jpg)
Why Use ML for Causal Inference, Anyway?
We sometimes find ourselves working with high-dimensionaldata
I.e., we have many covariates p relative to the number ofobservations n
Two types of high-dimensionality:
1 We simply have many measured covariates (e.g., text data,genetic data)
2 We have few measured covariates but wish to generate manynon-linear transformations of and interactions between thesecovariates
ML models perform much better in high dimensions thantraditional statistical models do
![Page 27: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/27.jpg)
Why Use ML for Causal Inference, Anyway?
1 ML allows us to do causal inference with minimalassumptions about the functional form of our model
Warning: ML does not help us relax identificationassumptions (e.g., no unmeasured confounding, parallel trends,exclusion restriction, etc.)
2 ML allows us to do causal inference withhigh-dimensional data
![Page 28: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/28.jpg)
Why Use ML for Causal Inference, Anyway?
1 ML allows us to do causal inference with minimalassumptions about the functional form of our model
Warning: ML does not help us relax identificationassumptions (e.g., no unmeasured confounding, parallel trends,exclusion restriction, etc.)
2 ML allows us to do causal inference withhigh-dimensional data
![Page 29: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/29.jpg)
Why Use ML for Causal Inference, Anyway?
1 ML allows us to do causal inference with minimalassumptions about the functional form of our model
Warning: ML does not help us relax identificationassumptions (e.g., no unmeasured confounding, parallel trends,exclusion restriction, etc.)
2 ML allows us to do causal inference withhigh-dimensional data
![Page 30: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/30.jpg)
Why Use ML for Causal Inference, Anyway?
1 ML allows us to do causal inference with minimalassumptions about the functional form of our model
Warning: ML does not help us relax identificationassumptions (e.g., no unmeasured confounding, parallel trends,exclusion restriction, etc.)
2 ML allows us to do causal inference withhigh-dimensional data
![Page 31: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/31.jpg)
Why Using ML for Causal Inference is Tricky
ML methods were designed for prediction
But off-the-shelf ML methods are biased estimators fortreatment effects
To minimize MSE = bias2 + variance, we trade off variancefor bias
Consistent ML methods converge more slowly than 1√n
Off-the-shelf methods also fail to provide confidenceintervals for our treatment effect estimates
![Page 32: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/32.jpg)
Why Using ML for Causal Inference is Tricky
ML methods were designed for prediction
But off-the-shelf ML methods are biased estimators fortreatment effects
To minimize MSE = bias2 + variance, we trade off variancefor bias
Consistent ML methods converge more slowly than 1√n
Off-the-shelf methods also fail to provide confidenceintervals for our treatment effect estimates
![Page 33: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/33.jpg)
Why Using ML for Causal Inference is Tricky
ML methods were designed for prediction
But off-the-shelf ML methods are biased estimators fortreatment effects
To minimize MSE = bias2 + variance, we trade off variancefor bias
Consistent ML methods converge more slowly than 1√n
Off-the-shelf methods also fail to provide confidenceintervals for our treatment effect estimates
![Page 34: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/34.jpg)
Why Using ML for Causal Inference is Tricky
ML methods were designed for prediction
But off-the-shelf ML methods are biased estimators fortreatment effects
To minimize MSE = bias2 + variance, we trade off variancefor bias
Consistent ML methods converge more slowly than 1√n
Off-the-shelf methods also fail to provide confidenceintervals for our treatment effect estimates
![Page 35: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/35.jpg)
Why Using ML for Causal Inference is Tricky
ML methods were designed for prediction
But off-the-shelf ML methods are biased estimators fortreatment effects
To minimize MSE = bias2 + variance, we trade off variancefor bias
Consistent ML methods converge more slowly than 1√n
Off-the-shelf methods also fail to provide confidenceintervals for our treatment effect estimates
![Page 36: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/36.jpg)
Why Using ML for Causal Inference is Tricky
ML methods were designed for prediction
But off-the-shelf ML methods are biased estimators fortreatment effects
To minimize MSE = bias2 + variance, we trade off variancefor bias
Consistent ML methods converge more slowly than 1√n
Off-the-shelf methods also fail to provide confidenceintervals for our treatment effect estimates
![Page 37: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/37.jpg)
Key Aims of Double / Debiased Machine Learning (DML)
1 Eliminate the bias
2 Achieve√n-consistency
3 Construct valid confidence intervals
![Page 38: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/38.jpg)
Key Aims of Double / Debiased Machine Learning (DML)
1 Eliminate the bias
2 Achieve√n-consistency
3 Construct valid confidence intervals
![Page 39: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/39.jpg)
Key Aims of Double / Debiased Machine Learning (DML)
1 Eliminate the bias
2 Achieve√n-consistency
3 Construct valid confidence intervals
![Page 40: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/40.jpg)
Key Aims of Double / Debiased Machine Learning (DML)
1 Eliminate the bias
2 Achieve√n-consistency
3 Construct valid confidence intervals
![Page 41: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/41.jpg)
Outline of Presentation
1 Introduce partially linear model set-up
2 Explain two sources of estimation bias from ML and how weovercome them
Correct bias from regularization with Neyman orthogonality
We will see that we can achieve Neyman orthogonality using aresiduals-on-residuals approach reminiscent of Robinson(1988) and the Frisch–Waugh–Lovell theorem
Correct bias from overfitting using sample-splitting
Employ cross-fitting to avoid the loss of efficiency thatnormally comes with sample-splitting
3 Outline a procedure for conducting inference with DML
4 Examine estimators for the ATE and variance that go beyondthe partially linear model set-up
![Page 42: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/42.jpg)
Outline of Presentation
1 Introduce partially linear model set-up
2 Explain two sources of estimation bias from ML and how weovercome them
Correct bias from regularization with Neyman orthogonality
We will see that we can achieve Neyman orthogonality using aresiduals-on-residuals approach reminiscent of Robinson(1988) and the Frisch–Waugh–Lovell theorem
Correct bias from overfitting using sample-splitting
Employ cross-fitting to avoid the loss of efficiency thatnormally comes with sample-splitting
3 Outline a procedure for conducting inference with DML
4 Examine estimators for the ATE and variance that go beyondthe partially linear model set-up
![Page 43: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/43.jpg)
Outline of Presentation
1 Introduce partially linear model set-up
2 Explain two sources of estimation bias from ML and how weovercome them
Correct bias from regularization with Neyman orthogonality
We will see that we can achieve Neyman orthogonality using aresiduals-on-residuals approach reminiscent of Robinson(1988) and the Frisch–Waugh–Lovell theorem
Correct bias from overfitting using sample-splitting
Employ cross-fitting to avoid the loss of efficiency thatnormally comes with sample-splitting
3 Outline a procedure for conducting inference with DML
4 Examine estimators for the ATE and variance that go beyondthe partially linear model set-up
![Page 44: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/44.jpg)
Outline of Presentation
1 Introduce partially linear model set-up
2 Explain two sources of estimation bias from ML and how weovercome them
Correct bias from regularization with Neyman orthogonality
We will see that we can achieve Neyman orthogonality using aresiduals-on-residuals approach reminiscent of Robinson(1988) and the Frisch–Waugh–Lovell theorem
Correct bias from overfitting using sample-splitting
Employ cross-fitting to avoid the loss of efficiency thatnormally comes with sample-splitting
3 Outline a procedure for conducting inference with DML
4 Examine estimators for the ATE and variance that go beyondthe partially linear model set-up
![Page 45: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/45.jpg)
Outline of Presentation
1 Introduce partially linear model set-up
2 Explain two sources of estimation bias from ML and how weovercome them
Correct bias from regularization with Neyman orthogonality
We will see that we can achieve Neyman orthogonality using aresiduals-on-residuals approach reminiscent of Robinson(1988) and the Frisch–Waugh–Lovell theorem
Correct bias from overfitting using sample-splitting
Employ cross-fitting to avoid the loss of efficiency thatnormally comes with sample-splitting
3 Outline a procedure for conducting inference with DML
4 Examine estimators for the ATE and variance that go beyondthe partially linear model set-up
![Page 46: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/46.jpg)
Outline of Presentation
1 Introduce partially linear model set-up
2 Explain two sources of estimation bias from ML and how weovercome them
Correct bias from regularization with Neyman orthogonality
We will see that we can achieve Neyman orthogonality using aresiduals-on-residuals approach reminiscent of Robinson(1988) and the Frisch–Waugh–Lovell theorem
Correct bias from overfitting using sample-splitting
Employ cross-fitting to avoid the loss of efficiency thatnormally comes with sample-splitting
3 Outline a procedure for conducting inference with DML
4 Examine estimators for the ATE and variance that go beyondthe partially linear model set-up
![Page 47: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/47.jpg)
Outline of Presentation
1 Introduce partially linear model set-up
2 Explain two sources of estimation bias from ML and how weovercome them
Correct bias from regularization with Neyman orthogonality
We will see that we can achieve Neyman orthogonality using aresiduals-on-residuals approach reminiscent of Robinson(1988) and the Frisch–Waugh–Lovell theorem
Correct bias from overfitting using sample-splitting
Employ cross-fitting to avoid the loss of efficiency thatnormally comes with sample-splitting
3 Outline a procedure for conducting inference with DML
4 Examine estimators for the ATE and variance that go beyondthe partially linear model set-up
![Page 48: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/48.jpg)
Outline of Presentation
1 Introduce partially linear model set-up
2 Explain two sources of estimation bias from ML and how weovercome them
Correct bias from regularization with Neyman orthogonality
We will see that we can achieve Neyman orthogonality using aresiduals-on-residuals approach reminiscent of Robinson(1988) and the Frisch–Waugh–Lovell theorem
Correct bias from overfitting using sample-splitting
Employ cross-fitting to avoid the loss of efficiency thatnormally comes with sample-splitting
3 Outline a procedure for conducting inference with DML
4 Examine estimators for the ATE and variance that go beyondthe partially linear model set-up
![Page 49: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/49.jpg)
Outline of Presentation
1 Introduce partially linear model set-up
2 Explain two sources of estimation bias from ML and how weovercome them
Correct bias from regularization with Neyman orthogonality
We will see that we can achieve Neyman orthogonality using aresiduals-on-residuals approach reminiscent of Robinson(1988) and the Frisch–Waugh–Lovell theorem
Correct bias from overfitting using sample-splitting
Employ cross-fitting to avoid the loss of efficiency thatnormally comes with sample-splitting
3 Outline a procedure for conducting inference with DML
4 Examine estimators for the ATE and variance that go beyondthe partially linear model set-up
![Page 50: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/50.jpg)
Don’t worry if none of that makes sense yet!
It’s not as complicated as it sounds, and we
will work through it slowly.
![Page 51: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/51.jpg)
Don’t worry if none of that makes sense yet!
It’s not as complicated as it sounds, and we
will work through it slowly.
![Page 52: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/52.jpg)
The Partially Linear Model Set-up
![Page 53: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/53.jpg)
The Partially Linear Model Set-up
Y = Dθ0 + g0(X ) + U
D = m0(X ) + V
![Page 54: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/54.jpg)
The Partially Linear Model Set-up
Y = Dθ0 + g0(X ) + U
D = m0(X ) + V
Y : Outcome
![Page 55: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/55.jpg)
The Partially Linear Model Set-up
Y = Dθ0 + g0(X ) + U
D = m0(X ) + V
Y : Outcome
D: Treatment
![Page 56: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/56.jpg)
The Partially Linear Model Set-up
Y = Dθ0 + g0(X ) + U
D = m0(X ) + V
Y : Outcome
D: Treatment
X : Measured confounders
![Page 57: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/57.jpg)
The Partially Linear Model Set-up
Y = Dθ0 + g0(X ) + U
D = m0(X ) + V
Y : Outcome
D: Treatment
X : Measured confounders
U and V are our error terms
![Page 58: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/58.jpg)
The Partially Linear Model Set-up
Y = Dθ0 + g0(X ) + U
D = m0(X ) + V
Y : Outcome
D: Treatment
X : Measured confounders
U and V are our error terms
We assume zero conditional mean:
E[U | X ,D] = 0 E[V | X ] = 0
![Page 59: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/59.jpg)
The Partially Linear Model Set-up
Y = Dθ0 + g0(X ) + U
D = m0(X ) + V
θ0: The true treatment effect
“theta-naught”Warning: not necessarily the Average Treatment Effect
Regression gives us a weighted average of individual treatmenteffects where weights are determined by the conditionalvariance of treatment (see Aronow and Samii 2016)
![Page 60: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/60.jpg)
The Partially Linear Model Set-up
Y = Dθ0 + g0(X ) + U
D = m0(X ) + V
θ0: The true treatment effect“theta-naught”
Warning: not necessarily the Average Treatment Effect
Regression gives us a weighted average of individual treatmenteffects where weights are determined by the conditionalvariance of treatment (see Aronow and Samii 2016)
![Page 61: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/61.jpg)
The Partially Linear Model Set-up
Y = Dθ0 + g0(X ) + U
D = m0(X ) + V
θ0: The true treatment effect“theta-naught”Warning: not necessarily the Average Treatment Effect
Regression gives us a weighted average of individual treatmenteffects where weights are determined by the conditionalvariance of treatment (see Aronow and Samii 2016)
![Page 62: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/62.jpg)
The Partially Linear Model Set-up
Y = Dθ0 + g0(X ) + U
D = m0(X ) + V
θ0: The true treatment effect“theta-naught”Warning: not necessarily the Average Treatment Effect
Regression gives us a weighted average of individual treatmenteffects where weights are determined by the conditionalvariance of treatment (see Aronow and Samii 2016)
![Page 63: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/63.jpg)
The Partially Linear Model Set-up
Y = Dθ0 + g0(X ) + U
D = m0(X ) + V
θ0: The true treatment effect“theta-naught”Warning: not necessarily the Average Treatment Effect
Regression gives us a weighted average of individual treatmenteffects where weights are determined by the conditionalvariance of treatment (see Aronow and Samii 2016)
g0(·): some function mapping X to Y ,conditional on D
![Page 64: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/64.jpg)
The Partially Linear Model Set-up
Y = Dθ0 + g0(X ) + U
D = m0(X ) + V
θ0: The true treatment effect“theta-naught”Warning: not necessarily the Average Treatment Effect
Regression gives us a weighted average of individual treatmenteffects where weights are determined by the conditionalvariance of treatment (see Aronow and Samii 2016)
g0(·): some function mapping X to Y ,conditional on D
m0(·): some function mapping X to D
![Page 65: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/65.jpg)
The Partially Linear Model Set-up
Y = Dθ0 + g0(X ) + U
D = m0(X ) + V
This set-up allows both Y and D to be non-linear andinteractive functions of X in contrast to standard OLS,which assumes a linear and additive model
However, note that this partially linear model assumesthat the effect of D on Y is additive and linear
Our confounders can interact with one another, but notwith our treatment!
And remember we’re still making the standardidentification assumptions (unconfoundedness conditionalon X , positivity, and consistency)
![Page 66: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/66.jpg)
The Partially Linear Model Set-up
Y = Dθ0 + g0(X ) + U
D = m0(X ) + V
This set-up allows both Y and D to be non-linear andinteractive functions of X in contrast to standard OLS,which assumes a linear and additive model
However, note that this partially linear model assumesthat the effect of D on Y is additive and linear
Our confounders can interact with one another, but notwith our treatment!
And remember we’re still making the standardidentification assumptions (unconfoundedness conditionalon X , positivity, and consistency)
![Page 67: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/67.jpg)
The Partially Linear Model Set-up
Y = Dθ0 + g0(X ) + U
D = m0(X ) + V
This set-up allows both Y and D to be non-linear andinteractive functions of X in contrast to standard OLS,which assumes a linear and additive model
However, note that this partially linear model assumesthat the effect of D on Y is additive and linear
Our confounders can interact with one another, but notwith our treatment!
And remember we’re still making the standardidentification assumptions (unconfoundedness conditionalon X , positivity, and consistency)
![Page 68: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/68.jpg)
The Partially Linear Model Set-up
Y = Dθ0 + g0(X ) + U
D = m0(X ) + V
This set-up allows both Y and D to be non-linear andinteractive functions of X in contrast to standard OLS,which assumes a linear and additive model
However, note that this partially linear model assumesthat the effect of D on Y is additive and linear
Our confounders can interact with one another, but notwith our treatment!
And remember we’re still making the standardidentification assumptions (unconfoundedness conditionalon X , positivity, and consistency)
![Page 69: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/69.jpg)
The Partially Linear Model Set-up
Y = Dθ0 + g0(X ) + U
D = m0(X ) + V
If ML is useful because it allows us to relax linearity andadditivity, why would we assume linearity and additivity inD?
We’re just using this model for illustration!
We can assume a fully interactive and non-linear modelwhen we actually use DML
But our partially linear model set-up will allow us tobetter explain how DML works
![Page 70: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/70.jpg)
The Partially Linear Model Set-up
Y = Dθ0 + g0(X ) + U
D = m0(X ) + V
If ML is useful because it allows us to relax linearity andadditivity, why would we assume linearity and additivity inD?
We’re just using this model for illustration!
We can assume a fully interactive and non-linear modelwhen we actually use DML
But our partially linear model set-up will allow us tobetter explain how DML works
![Page 71: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/71.jpg)
The Partially Linear Model Set-up
Y = Dθ0 + g0(X ) + U
D = m0(X ) + V
If ML is useful because it allows us to relax linearity andadditivity, why would we assume linearity and additivity inD?
We’re just using this model for illustration!
We can assume a fully interactive and non-linear modelwhen we actually use DML
But our partially linear model set-up will allow us tobetter explain how DML works
![Page 72: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/72.jpg)
The Partially Linear Model Set-up
Y = Dθ0 + g0(X ) + U
D = m0(X ) + V
If ML is useful because it allows us to relax linearity andadditivity, why would we assume linearity and additivity inD?
We’re just using this model for illustration!
We can assume a fully interactive and non-linear modelwhen we actually use DML
But our partially linear model set-up will allow us tobetter explain how DML works
![Page 73: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/73.jpg)
Where We Are and Where We’re Going
We’ve introduced the partially linear model set-up
Next, we will introduce an intuitive procedure—“the naiveapproach”—for estimating θ0 with ML assuming apartially linear model
We will show that this estimation procedure is biased andnot√n-consistent
Then we’re going to illustrate two sources of this bias andshow how DML avoids these two types of bias
![Page 74: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/74.jpg)
Where We Are and Where We’re Going
We’ve introduced the partially linear model set-up
Next, we will introduce an intuitive procedure—“the naiveapproach”—for estimating θ0 with ML assuming apartially linear model
We will show that this estimation procedure is biased andnot√n-consistent
Then we’re going to illustrate two sources of this bias andshow how DML avoids these two types of bias
![Page 75: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/75.jpg)
Where We Are and Where We’re Going
We’ve introduced the partially linear model set-up
Next, we will introduce an intuitive procedure—“the naiveapproach”—for estimating θ0 with ML assuming apartially linear model
We will show that this estimation procedure is biased andnot√n-consistent
Then we’re going to illustrate two sources of this bias andshow how DML avoids these two types of bias
![Page 76: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/76.jpg)
Where We Are and Where We’re Going
We’ve introduced the partially linear model set-up
Next, we will introduce an intuitive procedure—“the naiveapproach”—for estimating θ0 with ML assuming apartially linear model
We will show that this estimation procedure is biased andnot√n-consistent
Then we’re going to illustrate two sources of this bias andshow how DML avoids these two types of bias
![Page 77: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/77.jpg)
Where We Are and Where We’re Going
We’ve introduced the partially linear model set-up
Next, we will introduce an intuitive procedure—“the naiveapproach”—for estimating θ0 with ML assuming apartially linear model
We will show that this estimation procedure is biased andnot√n-consistent
Then we’re going to illustrate two sources of this bias andshow how DML avoids these two types of bias
![Page 78: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/78.jpg)
Causal Inference with ML: The Naive Approach
The naive approach: estimate Y = D θ0 + g0(X ) + U usingML
θ0: our naive estimate of θ0, “theta-naught-hat”
How might we estimate θ0 and g0(X )?
Remember, m0(X ) = E[D | X ], but g0(X ) 6= E[Y | X ]That’s because Dθ0 is also included in this modelIn the paper, the authors use `0(X ) for E[Y | X ]
To estimate both θ0 and g0(X ), we could use an iterativemethod that alternates between using random forest forestimating g0(X ) and OLS for estimating θ0
Alternatively, we could generate many non-lineartransformations of the covariates in X as well as interactionsbetween these covariates and use LASSO to estimate themodel
![Page 79: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/79.jpg)
Causal Inference with ML: The Naive Approach
The naive approach: estimate Y = D θ0 + g0(X ) + U usingML
θ0: our naive estimate of θ0, “theta-naught-hat”
How might we estimate θ0 and g0(X )?
Remember, m0(X ) = E[D | X ], but g0(X ) 6= E[Y | X ]That’s because Dθ0 is also included in this modelIn the paper, the authors use `0(X ) for E[Y | X ]
To estimate both θ0 and g0(X ), we could use an iterativemethod that alternates between using random forest forestimating g0(X ) and OLS for estimating θ0
Alternatively, we could generate many non-lineartransformations of the covariates in X as well as interactionsbetween these covariates and use LASSO to estimate themodel
![Page 80: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/80.jpg)
Causal Inference with ML: The Naive Approach
The naive approach: estimate Y = D θ0 + g0(X ) + U usingML
θ0: our naive estimate of θ0, “theta-naught-hat”
How might we estimate θ0 and g0(X )?
Remember, m0(X ) = E[D | X ], but g0(X ) 6= E[Y | X ]That’s because Dθ0 is also included in this modelIn the paper, the authors use `0(X ) for E[Y | X ]
To estimate both θ0 and g0(X ), we could use an iterativemethod that alternates between using random forest forestimating g0(X ) and OLS for estimating θ0
Alternatively, we could generate many non-lineartransformations of the covariates in X as well as interactionsbetween these covariates and use LASSO to estimate themodel
![Page 81: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/81.jpg)
Causal Inference with ML: The Naive Approach
The naive approach: estimate Y = D θ0 + g0(X ) + U usingML
θ0: our naive estimate of θ0, “theta-naught-hat”
How might we estimate θ0 and g0(X )?
Remember, m0(X ) = E[D | X ], but g0(X ) 6= E[Y | X ]That’s because Dθ0 is also included in this modelIn the paper, the authors use `0(X ) for E[Y | X ]
To estimate both θ0 and g0(X ), we could use an iterativemethod that alternates between using random forest forestimating g0(X ) and OLS for estimating θ0
Alternatively, we could generate many non-lineartransformations of the covariates in X as well as interactionsbetween these covariates and use LASSO to estimate themodel
![Page 82: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/82.jpg)
Causal Inference with ML: The Naive Approach
The naive approach: estimate Y = D θ0 + g0(X ) + U usingML
θ0: our naive estimate of θ0, “theta-naught-hat”
How might we estimate θ0 and g0(X )?
Remember, m0(X ) = E[D | X ], but g0(X ) 6= E[Y | X ]
That’s because Dθ0 is also included in this modelIn the paper, the authors use `0(X ) for E[Y | X ]
To estimate both θ0 and g0(X ), we could use an iterativemethod that alternates between using random forest forestimating g0(X ) and OLS for estimating θ0
Alternatively, we could generate many non-lineartransformations of the covariates in X as well as interactionsbetween these covariates and use LASSO to estimate themodel
![Page 83: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/83.jpg)
Causal Inference with ML: The Naive Approach
The naive approach: estimate Y = D θ0 + g0(X ) + U usingML
θ0: our naive estimate of θ0, “theta-naught-hat”
How might we estimate θ0 and g0(X )?
Remember, m0(X ) = E[D | X ], but g0(X ) 6= E[Y | X ]That’s because Dθ0 is also included in this model
In the paper, the authors use `0(X ) for E[Y | X ]
To estimate both θ0 and g0(X ), we could use an iterativemethod that alternates between using random forest forestimating g0(X ) and OLS for estimating θ0
Alternatively, we could generate many non-lineartransformations of the covariates in X as well as interactionsbetween these covariates and use LASSO to estimate themodel
![Page 84: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/84.jpg)
Causal Inference with ML: The Naive Approach
The naive approach: estimate Y = D θ0 + g0(X ) + U usingML
θ0: our naive estimate of θ0, “theta-naught-hat”
How might we estimate θ0 and g0(X )?
Remember, m0(X ) = E[D | X ], but g0(X ) 6= E[Y | X ]That’s because Dθ0 is also included in this modelIn the paper, the authors use `0(X ) for E[Y | X ]
To estimate both θ0 and g0(X ), we could use an iterativemethod that alternates between using random forest forestimating g0(X ) and OLS for estimating θ0
Alternatively, we could generate many non-lineartransformations of the covariates in X as well as interactionsbetween these covariates and use LASSO to estimate themodel
![Page 85: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/85.jpg)
Causal Inference with ML: The Naive Approach
The naive approach: estimate Y = D θ0 + g0(X ) + U usingML
θ0: our naive estimate of θ0, “theta-naught-hat”
How might we estimate θ0 and g0(X )?
Remember, m0(X ) = E[D | X ], but g0(X ) 6= E[Y | X ]That’s because Dθ0 is also included in this modelIn the paper, the authors use `0(X ) for E[Y | X ]
To estimate both θ0 and g0(X ), we could use an iterativemethod that alternates between using random forest forestimating g0(X ) and OLS for estimating θ0
Alternatively, we could generate many non-lineartransformations of the covariates in X as well as interactionsbetween these covariates and use LASSO to estimate themodel
![Page 86: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/86.jpg)
Causal Inference with ML: The Naive Approach
The naive approach: estimate Y = D θ0 + g0(X ) + U usingML
θ0: our naive estimate of θ0, “theta-naught-hat”
How might we estimate θ0 and g0(X )?
Remember, m0(X ) = E[D | X ], but g0(X ) 6= E[Y | X ]That’s because Dθ0 is also included in this modelIn the paper, the authors use `0(X ) for E[Y | X ]
To estimate both θ0 and g0(X ), we could use an iterativemethod that alternates between using random forest forestimating g0(X ) and OLS for estimating θ0
Alternatively, we could generate many non-lineartransformations of the covariates in X as well as interactionsbetween these covariates and use LASSO to estimate themodel
![Page 87: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/87.jpg)
Two Sources of Bias in Our Naive Estimator
1 Bias from regularization
To avoid overfitting the data with complex functional forms,ML algorithms use regularizationThis decreases the variance of the estimator and reducesoverfitting......but introduces bias and prevents
√n-consistency
2 Bias from overfitting
Sometimes our efforts to regularize fail to prevent overfittingOverfitting: mistaking noise for signalMore carefully, we overfit when we model the idiosyncrasies ofour particular sample too closely, which may lead to poorout-of-sample performanceOverfitting → bias and slow convergence
![Page 88: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/88.jpg)
Two Sources of Bias in Our Naive Estimator
1 Bias from regularization
To avoid overfitting the data with complex functional forms,ML algorithms use regularizationThis decreases the variance of the estimator and reducesoverfitting......but introduces bias and prevents
√n-consistency
2 Bias from overfitting
Sometimes our efforts to regularize fail to prevent overfittingOverfitting: mistaking noise for signalMore carefully, we overfit when we model the idiosyncrasies ofour particular sample too closely, which may lead to poorout-of-sample performanceOverfitting → bias and slow convergence
![Page 89: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/89.jpg)
Two Sources of Bias in Our Naive Estimator
1 Bias from regularization
To avoid overfitting the data with complex functional forms,ML algorithms use regularization
This decreases the variance of the estimator and reducesoverfitting......but introduces bias and prevents
√n-consistency
2 Bias from overfitting
Sometimes our efforts to regularize fail to prevent overfittingOverfitting: mistaking noise for signalMore carefully, we overfit when we model the idiosyncrasies ofour particular sample too closely, which may lead to poorout-of-sample performanceOverfitting → bias and slow convergence
![Page 90: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/90.jpg)
Two Sources of Bias in Our Naive Estimator
1 Bias from regularization
To avoid overfitting the data with complex functional forms,ML algorithms use regularizationThis decreases the variance of the estimator and reducesoverfitting...
...but introduces bias and prevents√n-consistency
2 Bias from overfitting
Sometimes our efforts to regularize fail to prevent overfittingOverfitting: mistaking noise for signalMore carefully, we overfit when we model the idiosyncrasies ofour particular sample too closely, which may lead to poorout-of-sample performanceOverfitting → bias and slow convergence
![Page 91: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/91.jpg)
Two Sources of Bias in Our Naive Estimator
1 Bias from regularization
To avoid overfitting the data with complex functional forms,ML algorithms use regularizationThis decreases the variance of the estimator and reducesoverfitting......but introduces bias and prevents
√n-consistency
2 Bias from overfitting
Sometimes our efforts to regularize fail to prevent overfittingOverfitting: mistaking noise for signalMore carefully, we overfit when we model the idiosyncrasies ofour particular sample too closely, which may lead to poorout-of-sample performanceOverfitting → bias and slow convergence
![Page 92: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/92.jpg)
Two Sources of Bias in Our Naive Estimator
1 Bias from regularization
To avoid overfitting the data with complex functional forms,ML algorithms use regularizationThis decreases the variance of the estimator and reducesoverfitting......but introduces bias and prevents
√n-consistency
2 Bias from overfitting
Sometimes our efforts to regularize fail to prevent overfittingOverfitting: mistaking noise for signalMore carefully, we overfit when we model the idiosyncrasies ofour particular sample too closely, which may lead to poorout-of-sample performanceOverfitting → bias and slow convergence
![Page 93: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/93.jpg)
Two Sources of Bias in Our Naive Estimator
1 Bias from regularization
To avoid overfitting the data with complex functional forms,ML algorithms use regularizationThis decreases the variance of the estimator and reducesoverfitting......but introduces bias and prevents
√n-consistency
2 Bias from overfitting
Sometimes our efforts to regularize fail to prevent overfitting
Overfitting: mistaking noise for signalMore carefully, we overfit when we model the idiosyncrasies ofour particular sample too closely, which may lead to poorout-of-sample performanceOverfitting → bias and slow convergence
![Page 94: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/94.jpg)
Two Sources of Bias in Our Naive Estimator
1 Bias from regularization
To avoid overfitting the data with complex functional forms,ML algorithms use regularizationThis decreases the variance of the estimator and reducesoverfitting......but introduces bias and prevents
√n-consistency
2 Bias from overfitting
Sometimes our efforts to regularize fail to prevent overfittingOverfitting: mistaking noise for signal
More carefully, we overfit when we model the idiosyncrasies ofour particular sample too closely, which may lead to poorout-of-sample performanceOverfitting → bias and slow convergence
![Page 95: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/95.jpg)
Two Sources of Bias in Our Naive Estimator
1 Bias from regularization
To avoid overfitting the data with complex functional forms,ML algorithms use regularizationThis decreases the variance of the estimator and reducesoverfitting......but introduces bias and prevents
√n-consistency
2 Bias from overfitting
Sometimes our efforts to regularize fail to prevent overfittingOverfitting: mistaking noise for signalMore carefully, we overfit when we model the idiosyncrasies ofour particular sample too closely, which may lead to poorout-of-sample performance
Overfitting → bias and slow convergence
![Page 96: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/96.jpg)
Two Sources of Bias in Our Naive Estimator
1 Bias from regularization
To avoid overfitting the data with complex functional forms,ML algorithms use regularizationThis decreases the variance of the estimator and reducesoverfitting......but introduces bias and prevents
√n-consistency
2 Bias from overfitting
Sometimes our efforts to regularize fail to prevent overfittingOverfitting: mistaking noise for signalMore carefully, we overfit when we model the idiosyncrasies ofour particular sample too closely, which may lead to poorout-of-sample performanceOverfitting → bias and slow convergence
![Page 97: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/97.jpg)
Two Sources of Bias in Our Naive Estimator
For clarity, we will isolate each type of bias
When we look at regularization bias, we will assume we haveused sample-splitting to avoid bias from overfitting
When we look at bias from overfitting, we will assume wehave used orthogonalization to prevent regularization bias
We’ll explain how sample-splitting and orthogonalization worksoon
![Page 98: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/98.jpg)
Two Sources of Bias in Our Naive Estimator
For clarity, we will isolate each type of bias
When we look at regularization bias, we will assume we haveused sample-splitting to avoid bias from overfitting
When we look at bias from overfitting, we will assume wehave used orthogonalization to prevent regularization bias
We’ll explain how sample-splitting and orthogonalization worksoon
![Page 99: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/99.jpg)
Two Sources of Bias in Our Naive Estimator
For clarity, we will isolate each type of bias
When we look at regularization bias, we will assume we haveused sample-splitting to avoid bias from overfitting
When we look at bias from overfitting, we will assume wehave used orthogonalization to prevent regularization bias
We’ll explain how sample-splitting and orthogonalization worksoon
![Page 100: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/100.jpg)
Two Sources of Bias in Our Naive Estimator
For clarity, we will isolate each type of bias
When we look at regularization bias, we will assume we haveused sample-splitting to avoid bias from overfitting
When we look at bias from overfitting, we will assume wehave used orthogonalization to prevent regularization bias
We’ll explain how sample-splitting and orthogonalization worksoon
![Page 101: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/101.jpg)
Two Sources of Bias in Our Naive Estimator
For clarity, we will isolate each type of bias
When we look at regularization bias, we will assume we haveused sample-splitting to avoid bias from overfitting
When we look at bias from overfitting, we will assume wehave used orthogonalization to prevent regularization bias
We’ll explain how sample-splitting and orthogonalization worksoon
![Page 102: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/102.jpg)
Regularization Bias
Let’s start by looking at the scaled estimation errorin θ0 when we use sample-splitting without
orthogonalization
![Page 103: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/103.jpg)
Regularization Bias
√n(θ0 − θ0) =
(1
n
∑i∈I
D2i
)−1 1√n
∑i∈I
DiUi︸ ︷︷ ︸:=a
+(1
n
∑i∈I
D2i
)−1 1√n
∑i∈I
Di (g0(Xi )− g0(Xi ))︸ ︷︷ ︸:=b
![Page 104: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/104.jpg)
Regularization Bias
√n(θ0 − θ0) =
(1
n
∑i∈I
D2i
)−1 1√n
∑i∈I
DiUi︸ ︷︷ ︸:=a
+(1
n
∑i∈I
D2i
)−1 1√n
∑i∈I
Di (g0(Xi )− g0(Xi ))︸ ︷︷ ︸:=b
This looks scary, so let’s take it one term at a time
![Page 105: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/105.jpg)
Regularization Bias
√n(θ0 − θ0) =
(1
n
∑i∈I
D2i
)−1 1√n
∑i∈I
DiUi︸ ︷︷ ︸:=a
+(1
n
∑i∈I
D2i
)−1 1√n
∑i∈I
Di (g0(Xi )− g0(Xi ))︸ ︷︷ ︸:=b
This looks scary, so let’s take it one term at a time√n(θ0 − θ0) represents our scaled estimation error
![Page 106: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/106.jpg)
Regularization Bias
√n(θ0 − θ0) =
(1
n
∑i∈I
D2i
)−1 1√n
∑i∈I
DiUi︸ ︷︷ ︸:=a
+(1
n
∑i∈I
D2i
)−1 1√n
∑i∈I
Di (g0(Xi )− g0(Xi ))︸ ︷︷ ︸:=b
This looks scary, so let’s take it one term at a time√n(θ0 − θ0) represents our scaled estimation error
If we want consistency, we want our estimation error to go tozero
![Page 107: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/107.jpg)
Regularization Bias
√n(θ0 − θ0) =
(1
n
∑i∈I
D2i
)−1 1√n
∑i∈I
DiUi︸ ︷︷ ︸:=a
+(1
n
∑i∈I
D2i
)−1 1√n
∑i∈I
Di (g0(Xi )− g0(Xi ))︸ ︷︷ ︸:=b
This looks scary, so let’s take it one term at a time√n(θ0 − θ0) represents our scaled estimation error
If we want consistency, we want this term to go to zero
a N(0, Σ). Great!
![Page 108: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/108.jpg)
Regularization Bias
√n(θ0 − θ0) =
(1
n
∑i∈I
D2i
)−1 1√n
∑i∈I
DiUi︸ ︷︷ ︸:=a
+(1
n
∑i∈I
D2i
)−1 1√n
∑i∈I
Di (g0(Xi )− g0(Xi ))︸ ︷︷ ︸:=b
b is the sum of terms that do not have mean zero divided by√n
![Page 109: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/109.jpg)
Regularization Bias
√n(θ0 − θ0) =
(1
n
∑i∈I
D2i
)−1 1√n
∑i∈I
DiUi︸ ︷︷ ︸:=a
+(1
n
∑i∈I
D2i
)−1 1√n
∑i∈I
Di (g0(Xi )− g0(Xi ))︸ ︷︷ ︸:=b
b is the sum of terms that do not have mean zero divided by√n
Specifically, g0(Xi )− g0(Xi ) will not have mean zero becauseg0 is biased
![Page 110: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/110.jpg)
Regularization Bias
√n(θ0 − θ0) =
(1
n
∑i∈I
D2i
)−1 1√n
∑i∈I
DiUi︸ ︷︷ ︸:=a
+(1
n
∑i∈I
D2i
)−1 1√n
∑i∈I
Di (g0(Xi )− g0(Xi ))︸ ︷︷ ︸:=b
b is the sum of terms that do not have mean zero divided by√n
Specifically, g0(Xi )− g0(Xi ) will not have mean zero becauseg0 is biased
b will approach 0, but too slowly for our estimator to be√n-consistent!
![Page 111: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/111.jpg)
Causal Inference with ML using Orthogonalization
To overcome this regularization bias, let’s useorthogonalization
Instead of fitting one ML model, we fit two:
1 Estimate D = m0(X ) + V , our treatment model2 Estimate Y = D θ0 + g0(X ) + U as we do in the naive
approach, our outcome model3 Regress Y − g0(X ) on V
The resulting θ0 (“theta-naught-check”) is free ofregularization bias!
We can call this a “partialling-out” approach because we havepartialled out the associations between X and D and betweenY and X (conditional on D)
![Page 112: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/112.jpg)
Causal Inference with ML using Orthogonalization
To overcome this regularization bias, let’s useorthogonalization
Instead of fitting one ML model, we fit two:
1 Estimate D = m0(X ) + V , our treatment model2 Estimate Y = D θ0 + g0(X ) + U as we do in the naive
approach, our outcome model3 Regress Y − g0(X ) on V
The resulting θ0 (“theta-naught-check”) is free ofregularization bias!
We can call this a “partialling-out” approach because we havepartialled out the associations between X and D and betweenY and X (conditional on D)
![Page 113: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/113.jpg)
Causal Inference with ML using Orthogonalization
To overcome this regularization bias, let’s useorthogonalization
Instead of fitting one ML model, we fit two:
1 Estimate D = m0(X ) + V , our treatment model2 Estimate Y = D θ0 + g0(X ) + U as we do in the naive
approach, our outcome model3 Regress Y − g0(X ) on V
The resulting θ0 (“theta-naught-check”) is free ofregularization bias!
We can call this a “partialling-out” approach because we havepartialled out the associations between X and D and betweenY and X (conditional on D)
![Page 114: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/114.jpg)
Causal Inference with ML using Orthogonalization
To overcome this regularization bias, let’s useorthogonalization
Instead of fitting one ML model, we fit two:
1 Estimate D = m0(X ) + V , our treatment model
2 Estimate Y = D θ0 + g0(X ) + U as we do in the naiveapproach, our outcome model
3 Regress Y − g0(X ) on V
The resulting θ0 (“theta-naught-check”) is free ofregularization bias!
We can call this a “partialling-out” approach because we havepartialled out the associations between X and D and betweenY and X (conditional on D)
![Page 115: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/115.jpg)
Causal Inference with ML using Orthogonalization
To overcome this regularization bias, let’s useorthogonalization
Instead of fitting one ML model, we fit two:
1 Estimate D = m0(X ) + V , our treatment model2 Estimate Y = D θ0 + g0(X ) + U as we do in the naive
approach, our outcome model
3 Regress Y − g0(X ) on V
The resulting θ0 (“theta-naught-check”) is free ofregularization bias!
We can call this a “partialling-out” approach because we havepartialled out the associations between X and D and betweenY and X (conditional on D)
![Page 116: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/116.jpg)
Causal Inference with ML using Orthogonalization
To overcome this regularization bias, let’s useorthogonalization
Instead of fitting one ML model, we fit two:
1 Estimate D = m0(X ) + V , our treatment model2 Estimate Y = D θ0 + g0(X ) + U as we do in the naive
approach, our outcome model3 Regress Y − g0(X ) on V
The resulting θ0 (“theta-naught-check”) is free ofregularization bias!
We can call this a “partialling-out” approach because we havepartialled out the associations between X and D and betweenY and X (conditional on D)
![Page 117: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/117.jpg)
Causal Inference with ML using Orthogonalization
To overcome this regularization bias, let’s useorthogonalization
Instead of fitting one ML model, we fit two:
1 Estimate D = m0(X ) + V , our treatment model2 Estimate Y = D θ0 + g0(X ) + U as we do in the naive
approach, our outcome model3 Regress Y − g0(X ) on V
The resulting θ0 (“theta-naught-check”) is free ofregularization bias!
We can call this a “partialling-out” approach because we havepartialled out the associations between X and D and betweenY and X (conditional on D)
![Page 118: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/118.jpg)
Causal Inference with ML using Orthogonalization
To overcome this regularization bias, let’s useorthogonalization
Instead of fitting one ML model, we fit two:
1 Estimate D = m0(X ) + V , our treatment model2 Estimate Y = D θ0 + g0(X ) + U as we do in the naive
approach, our outcome model3 Regress Y − g0(X ) on V
The resulting θ0 (“theta-naught-check”) is free ofregularization bias!
We can call this a “partialling-out” approach because we havepartialled out the associations between X and D and betweenY and X (conditional on D)
![Page 119: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/119.jpg)
Causal Inference with ML using Orthogonalization
In notation:
θ0 =(1
n
∑i∈I
ViDi
)−1 1
n
∑i∈I
V (Yi − g0(Xi))
Look familiar?
![Page 120: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/120.jpg)
Causal Inference with ML using Orthogonalization
In notation:
θ0 =(1
n
∑i∈I
ViDi
)−1 1
n
∑i∈I
V (Yi − g0(Xi))
Look familiar?
![Page 121: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/121.jpg)
Causal Inference with ML using Orthogonalization
In notation:
θ0 =(1
n
∑i∈I
ViDi
)−1 1
n
∑i∈I
V (Yi − g0(Xi))
Look familiar?
What about now?
βIV = (Z ′D)−1Z ′y
It’s very similar to our standard linear instrumental variableestimator, two-stage least squares!
![Page 122: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/122.jpg)
Causal Inference with ML using Orthogonalization
In notation:
θ0 =(1
n
∑i∈I
ViDi
)−1 1
n
∑i∈I
V (Yi − g0(Xi))
Look familiar?
What about now?
βIV = (Z ′D)−1Z ′y
It’s very similar to our standard linear instrumental variableestimator, two-stage least squares!
![Page 123: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/123.jpg)
How Orthogonalization De-biases
Remember b from the scaled estimation error equation? Nowwe have b∗:
b∗ = (E [V 2])−1 1√n
∑i∈I
(m0(Xi )−m0(Xi ))︸ ︷︷ ︸m0 estimation error
(g0(Xi )− g0(Xi ))︸ ︷︷ ︸g0 estimation error
Because this term is based on the product of two estimationerrors, it vanishes more quickly
If g0 and m0 are each n14 -consistent, θ0 will be
√n-consistent
To see why, just note that n14 × n
14 = n
12
![Page 124: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/124.jpg)
How Orthogonalization De-biases
Remember b from the scaled estimation error equation? Nowwe have b∗:
b∗ = (E [V 2])−1 1√n
∑i∈I
(m0(Xi )−m0(Xi ))︸ ︷︷ ︸m0 estimation error
(g0(Xi )− g0(Xi ))︸ ︷︷ ︸g0 estimation error
Because this term is based on the product of two estimationerrors, it vanishes more quickly
If g0 and m0 are each n14 -consistent, θ0 will be
√n-consistent
To see why, just note that n14 × n
14 = n
12
![Page 125: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/125.jpg)
How Orthogonalization De-biases
Remember b from the scaled estimation error equation? Nowwe have b∗:
b∗ = (E [V 2])−1 1√n
∑i∈I
(m0(Xi )−m0(Xi ))︸ ︷︷ ︸m0 estimation error
(g0(Xi )− g0(Xi ))︸ ︷︷ ︸g0 estimation error
Because this term is based on the product of two estimationerrors, it vanishes more quickly
If g0 and m0 are each n14 -consistent, θ0 will be
√n-consistent
To see why, just note that n14 × n
14 = n
12
![Page 126: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/126.jpg)
How Orthogonalization De-biases
Remember b from the scaled estimation error equation? Nowwe have b∗:
b∗ = (E [V 2])−1 1√n
∑i∈I
(m0(Xi )−m0(Xi ))︸ ︷︷ ︸m0 estimation error
(g0(Xi )− g0(Xi ))︸ ︷︷ ︸g0 estimation error
Because this term is based on the product of two estimationerrors, it vanishes more quickly
If g0 and m0 are each n14 -consistent, θ0 will be
√n-consistent
To see why, just note that n14 × n
14 = n
12
![Page 127: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/127.jpg)
How Orthogonalization De-biases
Remember b from the scaled estimation error equation? Nowwe have b∗:
b∗ = (E [V 2])−1 1√n
∑i∈I
(m0(Xi )−m0(Xi ))︸ ︷︷ ︸m0 estimation error
(g0(Xi )− g0(Xi ))︸ ︷︷ ︸g0 estimation error
Because this term is based on the product of two estimationerrors, it vanishes more quickly
If g0 and m0 are each n14 -consistent, θ0 will be
√n-consistent
To see why, just note that n14 × n
14 = n
12
![Page 128: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/128.jpg)
How Orthogonalization De-biases
Remember b from the scaled estimation error equation? Nowwe have b∗:
b∗ = (E [V 2])−1 1√n
∑i∈I
(m0(Xi )−m0(Xi ))︸ ︷︷ ︸m0 estimation error
(g0(Xi )− g0(Xi ))︸ ︷︷ ︸g0 estimation error
Because this term is based on the product of two estimationerrors, it vanishes more quickly
If g0 and m0 are each n14 -consistent, θ0 will be
√n-consistent
To see why, just note that n14 × n
14 = n
12
![Page 129: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/129.jpg)
A Quick Detour
Chernozhukov et al. use a second partialling-out estimator forpartially linear models as well
This estimator is very similar to Robinson’s partialling-outestimator, which is in turn very similar theFrisch–Waugh–Lovell partialling-out estimator
If you’ve taken an introductory statistics course, you’veprobably learned about the Frisch–Waugh–Lovell theorem
But even if you haven’t (or don’t remember!), reviewingFrisch–Waugh–Lovell theorem and Robinson can help us buildintuition for how DML works
![Page 130: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/130.jpg)
A Quick Detour
Chernozhukov et al. use a second partialling-out estimator forpartially linear models as well
This estimator is very similar to Robinson’s partialling-outestimator, which is in turn very similar theFrisch–Waugh–Lovell partialling-out estimator
If you’ve taken an introductory statistics course, you’veprobably learned about the Frisch–Waugh–Lovell theorem
But even if you haven’t (or don’t remember!), reviewingFrisch–Waugh–Lovell theorem and Robinson can help us buildintuition for how DML works
![Page 131: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/131.jpg)
A Quick Detour
Chernozhukov et al. use a second partialling-out estimator forpartially linear models as well
This estimator is very similar to Robinson’s partialling-outestimator, which is in turn very similar theFrisch–Waugh–Lovell partialling-out estimator
If you’ve taken an introductory statistics course, you’veprobably learned about the Frisch–Waugh–Lovell theorem
But even if you haven’t (or don’t remember!), reviewingFrisch–Waugh–Lovell theorem and Robinson can help us buildintuition for how DML works
![Page 132: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/132.jpg)
A Quick Detour
Chernozhukov et al. use a second partialling-out estimator forpartially linear models as well
This estimator is very similar to Robinson’s partialling-outestimator, which is in turn very similar theFrisch–Waugh–Lovell partialling-out estimator
If you’ve taken an introductory statistics course, you’veprobably learned about the Frisch–Waugh–Lovell theorem
But even if you haven’t (or don’t remember!), reviewingFrisch–Waugh–Lovell theorem and Robinson can help us buildintuition for how DML works
![Page 133: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/133.jpg)
A Quick Detour
Chernozhukov et al. use a second partialling-out estimator forpartially linear models as well
This estimator is very similar to Robinson’s partialling-outestimator, which is in turn very similar theFrisch–Waugh–Lovell partialling-out estimator
If you’ve taken an introductory statistics course, you’veprobably learned about the Frisch–Waugh–Lovell theorem
But even if you haven’t (or don’t remember!), reviewingFrisch–Waugh–Lovell theorem and Robinson can help us buildintuition for how DML works
![Page 134: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/134.jpg)
The Frisch–Waugh–Lovell Theorem
Let’s say we want to estimate the following model using OLS:
Y = β0 + β1D + β2X + U
The Frisch–Waugh–Lovell Theorem shows us that we can recoverthe OLS estimate of β1 using a residuals-on-residuals OLSregression:
1 Regress D on X using OLS
Let D be the predicted values of D and let the residualsV = D − D
2 Regress Y on X using OLS
Let Y be the predicted values of Y and let the residualsW = Y − Y
3 Regress W on V using OLS
The estimated coefficient on V will be the same as the estimatedcoefficient β1 from regressing Y on D and X using OLS!
![Page 135: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/135.jpg)
The Frisch–Waugh–Lovell Theorem
Let’s say we want to estimate the following model using OLS:
Y = β0 + β1D + β2X + U
The Frisch–Waugh–Lovell Theorem shows us that we can recoverthe OLS estimate of β1 using a residuals-on-residuals OLSregression:
1 Regress D on X using OLS
Let D be the predicted values of D and let the residualsV = D − D
2 Regress Y on X using OLS
Let Y be the predicted values of Y and let the residualsW = Y − Y
3 Regress W on V using OLS
The estimated coefficient on V will be the same as the estimatedcoefficient β1 from regressing Y on D and X using OLS!
![Page 136: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/136.jpg)
The Frisch–Waugh–Lovell Theorem
Let’s say we want to estimate the following model using OLS:
Y = β0 + β1D + β2X + U
The Frisch–Waugh–Lovell Theorem shows us that we can recoverthe OLS estimate of β1 using a residuals-on-residuals OLSregression:
1 Regress D on X using OLS
Let D be the predicted values of D and let the residualsV = D − D
2 Regress Y on X using OLS
Let Y be the predicted values of Y and let the residualsW = Y − Y
3 Regress W on V using OLS
The estimated coefficient on V will be the same as the estimatedcoefficient β1 from regressing Y on D and X using OLS!
![Page 137: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/137.jpg)
The Frisch–Waugh–Lovell Theorem
Let’s say we want to estimate the following model using OLS:
Y = β0 + β1D + β2X + U
The Frisch–Waugh–Lovell Theorem shows us that we can recoverthe OLS estimate of β1 using a residuals-on-residuals OLSregression:
1 Regress D on X using OLS
Let D be the predicted values of D and let the residualsV = D − D
2 Regress Y on X using OLS
Let Y be the predicted values of Y and let the residualsW = Y − Y
3 Regress W on V using OLS
The estimated coefficient on V will be the same as the estimatedcoefficient β1 from regressing Y on D and X using OLS!
![Page 138: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/138.jpg)
The Frisch–Waugh–Lovell Theorem
Let’s say we want to estimate the following model using OLS:
Y = β0 + β1D + β2X + U
The Frisch–Waugh–Lovell Theorem shows us that we can recoverthe OLS estimate of β1 using a residuals-on-residuals OLSregression:
1 Regress D on X using OLS
Let D be the predicted values of D and let the residualsV = D − D
2 Regress Y on X using OLS
Let Y be the predicted values of Y and let the residualsW = Y − Y
3 Regress W on V using OLS
The estimated coefficient on V will be the same as the estimatedcoefficient β1 from regressing Y on D and X using OLS!
![Page 139: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/139.jpg)
The Frisch–Waugh–Lovell Theorem
Let’s say we want to estimate the following model using OLS:
Y = β0 + β1D + β2X + U
The Frisch–Waugh–Lovell Theorem shows us that we can recoverthe OLS estimate of β1 using a residuals-on-residuals OLSregression:
1 Regress D on X using OLS
Let D be the predicted values of D and let the residualsV = D − D
2 Regress Y on X using OLS
Let Y be the predicted values of Y and let the residualsW = Y − Y
3 Regress W on V using OLS
The estimated coefficient on V will be the same as the estimatedcoefficient β1 from regressing Y on D and X using OLS!
![Page 140: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/140.jpg)
The Frisch–Waugh–Lovell Theorem
Let’s say we want to estimate the following model using OLS:
Y = β0 + β1D + β2X + U
The Frisch–Waugh–Lovell Theorem shows us that we can recoverthe OLS estimate of β1 using a residuals-on-residuals OLSregression:
1 Regress D on X using OLS
Let D be the predicted values of D and let the residualsV = D − D
2 Regress Y on X using OLS
Let Y be the predicted values of Y and let the residualsW = Y − Y
3 Regress W on V using OLS
The estimated coefficient on V will be the same as the estimatedcoefficient β1 from regressing Y on D and X using OLS!
![Page 141: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/141.jpg)
The Frisch–Waugh–Lovell Theorem
Let’s say we want to estimate the following model using OLS:
Y = β0 + β1D + β2X + U
The Frisch–Waugh–Lovell Theorem shows us that we can recoverthe OLS estimate of β1 using a residuals-on-residuals OLSregression:
1 Regress D on X using OLS
Let D be the predicted values of D and let the residualsV = D − D
2 Regress Y on X using OLS
Let Y be the predicted values of Y and let the residualsW = Y − Y
3 Regress W on V using OLS
The estimated coefficient on V will be the same as the estimatedcoefficient β1 from regressing Y on D and X using OLS!
![Page 142: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/142.jpg)
Robinson
The Frisch–Waugh–Lovell procedure:
1 Linear regression of D on X2 Linear regression of Y on X3 Linear regression of the residuals from 2 on the residuals
from 1
Robinson’s innovation: let’s replace the linear regressionsfrom 1 and 2 with some non-parametric regression
Robinson’s procedure:
1 Kernel regression of D on X2 Kernel regression of Y on X3 Linear regression of the residuals from 2 on the residuals
from 1
![Page 143: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/143.jpg)
Robinson
The Frisch–Waugh–Lovell procedure:
1 Linear regression of D on X2 Linear regression of Y on X3 Linear regression of the residuals from 2 on the residuals
from 1
Robinson’s innovation: let’s replace the linear regressionsfrom 1 and 2 with some non-parametric regression
Robinson’s procedure:
1 Kernel regression of D on X2 Kernel regression of Y on X3 Linear regression of the residuals from 2 on the residuals
from 1
![Page 144: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/144.jpg)
Robinson
The Frisch–Waugh–Lovell procedure:
1 Linear regression of D on X
2 Linear regression of Y on X3 Linear regression of the residuals from 2 on the residuals
from 1
Robinson’s innovation: let’s replace the linear regressionsfrom 1 and 2 with some non-parametric regression
Robinson’s procedure:
1 Kernel regression of D on X2 Kernel regression of Y on X3 Linear regression of the residuals from 2 on the residuals
from 1
![Page 145: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/145.jpg)
Robinson
The Frisch–Waugh–Lovell procedure:
1 Linear regression of D on X2 Linear regression of Y on X
3 Linear regression of the residuals from 2 on the residualsfrom 1
Robinson’s innovation: let’s replace the linear regressionsfrom 1 and 2 with some non-parametric regression
Robinson’s procedure:
1 Kernel regression of D on X2 Kernel regression of Y on X3 Linear regression of the residuals from 2 on the residuals
from 1
![Page 146: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/146.jpg)
Robinson
The Frisch–Waugh–Lovell procedure:
1 Linear regression of D on X2 Linear regression of Y on X3 Linear regression of the residuals from 2 on the residuals
from 1
Robinson’s innovation: let’s replace the linear regressionsfrom 1 and 2 with some non-parametric regression
Robinson’s procedure:
1 Kernel regression of D on X2 Kernel regression of Y on X3 Linear regression of the residuals from 2 on the residuals
from 1
![Page 147: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/147.jpg)
Robinson
The Frisch–Waugh–Lovell procedure:
1 Linear regression of D on X2 Linear regression of Y on X3 Linear regression of the residuals from 2 on the residuals
from 1
Robinson’s innovation: let’s replace the linear regressionsfrom 1 and 2 with some non-parametric regression
Robinson’s procedure:
1 Kernel regression of D on X2 Kernel regression of Y on X3 Linear regression of the residuals from 2 on the residuals
from 1
![Page 148: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/148.jpg)
Robinson
The Frisch–Waugh–Lovell procedure:
1 Linear regression of D on X2 Linear regression of Y on X3 Linear regression of the residuals from 2 on the residuals
from 1
Robinson’s innovation: let’s replace the linear regressionsfrom 1 and 2 with some non-parametric regression
Robinson’s procedure:
1 Kernel regression of D on X2 Kernel regression of Y on X3 Linear regression of the residuals from 2 on the residuals
from 1
![Page 149: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/149.jpg)
Robinson
The Frisch–Waugh–Lovell procedure:
1 Linear regression of D on X2 Linear regression of Y on X3 Linear regression of the residuals from 2 on the residuals
from 1
Robinson’s innovation: let’s replace the linear regressionsfrom 1 and 2 with some non-parametric regression
Robinson’s procedure:
1 Kernel regression of D on X
2 Kernel regression of Y on X3 Linear regression of the residuals from 2 on the residuals
from 1
![Page 150: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/150.jpg)
Robinson
The Frisch–Waugh–Lovell procedure:
1 Linear regression of D on X2 Linear regression of Y on X3 Linear regression of the residuals from 2 on the residuals
from 1
Robinson’s innovation: let’s replace the linear regressionsfrom 1 and 2 with some non-parametric regression
Robinson’s procedure:
1 Kernel regression of D on X2 Kernel regression of Y on X
3 Linear regression of the residuals from 2 on the residualsfrom 1
![Page 151: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/151.jpg)
Robinson
The Frisch–Waugh–Lovell procedure:
1 Linear regression of D on X2 Linear regression of Y on X3 Linear regression of the residuals from 2 on the residuals
from 1
Robinson’s innovation: let’s replace the linear regressionsfrom 1 and 2 with some non-parametric regression
Robinson’s procedure:
1 Kernel regression of D on X2 Kernel regression of Y on X3 Linear regression of the residuals from 2 on the residuals
from 1
![Page 152: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/152.jpg)
Another Way to Orthogonalize
DML using residuals-on-residuals regression:
1 Estimate D = m0(X ) + V2 Estimate Y = ˆ
0(X ) + U
Note the absence of D and the switch from g0(·) to `0(·),which is essentially E[Y | X ]
3 Regress U on V using OLS for an estimate θ0
![Page 153: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/153.jpg)
Another Way to Orthogonalize
DML using residuals-on-residuals regression:
1 Estimate D = m0(X ) + V2 Estimate Y = ˆ
0(X ) + U
Note the absence of D and the switch from g0(·) to `0(·),which is essentially E[Y | X ]
3 Regress U on V using OLS for an estimate θ0
![Page 154: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/154.jpg)
Another Way to Orthogonalize
DML using residuals-on-residuals regression:
1 Estimate D = m0(X ) + V
2 Estimate Y = ˆ0(X ) + U
Note the absence of D and the switch from g0(·) to `0(·),which is essentially E[Y | X ]
3 Regress U on V using OLS for an estimate θ0
![Page 155: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/155.jpg)
Another Way to Orthogonalize
DML using residuals-on-residuals regression:
1 Estimate D = m0(X ) + V2 Estimate Y = ˆ
0(X ) + U
Note the absence of D and the switch from g0(·) to `0(·),which is essentially E[Y | X ]
3 Regress U on V using OLS for an estimate θ0
![Page 156: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/156.jpg)
Another Way to Orthogonalize
DML using residuals-on-residuals regression:
1 Estimate D = m0(X ) + V2 Estimate Y = ˆ
0(X ) + U
Note the absence of D and the switch from g0(·) to `0(·),which is essentially E[Y | X ]
3 Regress U on V using OLS for an estimate θ0
![Page 157: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/157.jpg)
Another Way to Orthogonalize
Robinson’s procedure:
1 Predict D with X using kernel regression2 Predict Y with X using kernel regression3 Linear regression of the residuals from 2 on the residuals
from 1
DML residuals-on-residuals procedure:
1 Predict D with X using any n 14 -consistent ML model
2 Predict Y with X using any n 14 -consistent ML model
3 Linear regression of the residuals from 2 on the residualsfrom 1
![Page 158: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/158.jpg)
Another Way to Orthogonalize
Robinson’s procedure:
1 Predict D with X using kernel regression2 Predict Y with X using kernel regression3 Linear regression of the residuals from 2 on the residuals
from 1
DML residuals-on-residuals procedure:
1 Predict D with X using any n 14 -consistent ML model
2 Predict Y with X using any n 14 -consistent ML model
3 Linear regression of the residuals from 2 on the residualsfrom 1
![Page 159: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/159.jpg)
Another Way to Orthogonalize
Robinson’s procedure:
1 Predict D with X using kernel regression
2 Predict Y with X using kernel regression3 Linear regression of the residuals from 2 on the residuals
from 1
DML residuals-on-residuals procedure:
1 Predict D with X using any n 14 -consistent ML model
2 Predict Y with X using any n 14 -consistent ML model
3 Linear regression of the residuals from 2 on the residualsfrom 1
![Page 160: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/160.jpg)
Another Way to Orthogonalize
Robinson’s procedure:
1 Predict D with X using kernel regression2 Predict Y with X using kernel regression
3 Linear regression of the residuals from 2 on the residualsfrom 1
DML residuals-on-residuals procedure:
1 Predict D with X using any n 14 -consistent ML model
2 Predict Y with X using any n 14 -consistent ML model
3 Linear regression of the residuals from 2 on the residualsfrom 1
![Page 161: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/161.jpg)
Another Way to Orthogonalize
Robinson’s procedure:
1 Predict D with X using kernel regression2 Predict Y with X using kernel regression3 Linear regression of the residuals from 2 on the residuals
from 1
DML residuals-on-residuals procedure:
1 Predict D with X using any n 14 -consistent ML model
2 Predict Y with X using any n 14 -consistent ML model
3 Linear regression of the residuals from 2 on the residualsfrom 1
![Page 162: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/162.jpg)
Another Way to Orthogonalize
Robinson’s procedure:
1 Predict D with X using kernel regression2 Predict Y with X using kernel regression3 Linear regression of the residuals from 2 on the residuals
from 1
DML residuals-on-residuals procedure:
1 Predict D with X using any n 14 -consistent ML model
2 Predict Y with X using any n 14 -consistent ML model
3 Linear regression of the residuals from 2 on the residualsfrom 1
![Page 163: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/163.jpg)
Another Way to Orthogonalize
Robinson’s procedure:
1 Predict D with X using kernel regression2 Predict Y with X using kernel regression3 Linear regression of the residuals from 2 on the residuals
from 1
DML residuals-on-residuals procedure:
1 Predict D with X using any n 14 -consistent ML model
2 Predict Y with X using any n 14 -consistent ML model
3 Linear regression of the residuals from 2 on the residualsfrom 1
![Page 164: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/164.jpg)
Another Way to Orthogonalize
Robinson’s procedure:
1 Predict D with X using kernel regression2 Predict Y with X using kernel regression3 Linear regression of the residuals from 2 on the residuals
from 1
DML residuals-on-residuals procedure:
1 Predict D with X using any n 14 -consistent ML model
2 Predict Y with X using any n 14 -consistent ML model
3 Linear regression of the residuals from 2 on the residualsfrom 1
![Page 165: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/165.jpg)
Where We Are and Where We’re Going
We saw that can eliminate regularization biasusing orthogonalization
Now we’re going to show how we can eliminatebias from overfitting using sample-splitting andcross-fitting
![Page 166: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/166.jpg)
Where We Are and Where We’re Going
We saw that can eliminate regularization biasusing orthogonalization
Now we’re going to show how we can eliminatebias from overfitting using sample-splitting andcross-fitting
![Page 167: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/167.jpg)
Where We Are and Where We’re Going
We saw that can eliminate regularization biasusing orthogonalization
Now we’re going to show how we can eliminatebias from overfitting using sample-splitting andcross-fitting
![Page 168: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/168.jpg)
Bias from Overfitting
ML algorithms sometimes overfit our data due to theirflexibility, and this can lead to bias and slower convergence
Let’s say we estimate θ0 using orthogonalization but withoutsample-splitting
That is, we fit our machine learning models and estimate ourtarget parameter θ0 on the same set of observations
Our scaled estimation error√n(θ0 − θ0) = a∗ + b∗ + c∗
We looked at b∗ before, and we don’t have to worry about a∗
![Page 169: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/169.jpg)
Bias from Overfitting
ML algorithms sometimes overfit our data due to theirflexibility, and this can lead to bias and slower convergence
Let’s say we estimate θ0 using orthogonalization but withoutsample-splitting
That is, we fit our machine learning models and estimate ourtarget parameter θ0 on the same set of observations
Our scaled estimation error√n(θ0 − θ0) = a∗ + b∗ + c∗
We looked at b∗ before, and we don’t have to worry about a∗
![Page 170: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/170.jpg)
Bias from Overfitting
ML algorithms sometimes overfit our data due to theirflexibility, and this can lead to bias and slower convergence
Let’s say we estimate θ0 using orthogonalization but withoutsample-splitting
That is, we fit our machine learning models and estimate ourtarget parameter θ0 on the same set of observations
Our scaled estimation error√n(θ0 − θ0) = a∗ + b∗ + c∗
We looked at b∗ before, and we don’t have to worry about a∗
![Page 171: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/171.jpg)
Bias from Overfitting
ML algorithms sometimes overfit our data due to theirflexibility, and this can lead to bias and slower convergence
Let’s say we estimate θ0 using orthogonalization but withoutsample-splitting
That is, we fit our machine learning models and estimate ourtarget parameter θ0 on the same set of observations
Our scaled estimation error√n(θ0 − θ0) = a∗ + b∗ + c∗
We looked at b∗ before, and we don’t have to worry about a∗
![Page 172: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/172.jpg)
Bias from Overfitting
ML algorithms sometimes overfit our data due to theirflexibility, and this can lead to bias and slower convergence
Let’s say we estimate θ0 using orthogonalization but withoutsample-splitting
That is, we fit our machine learning models and estimate ourtarget parameter θ0 on the same set of observations
Our scaled estimation error√n(θ0 − θ0) = a∗ + b∗ + c∗
We looked at b∗ before, and we don’t have to worry about a∗
![Page 173: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/173.jpg)
Bias from Overfitting
ML algorithms sometimes overfit our data due to theirflexibility, and this can lead to bias and slower convergence
Let’s say we estimate θ0 using orthogonalization but withoutsample-splitting
That is, we fit our machine learning models and estimate ourtarget parameter θ0 on the same set of observations
Our scaled estimation error√n(θ0 − θ0) = a∗ + b∗ + c∗
We looked at b∗ before, and we don’t have to worry about a∗
![Page 174: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/174.jpg)
Bias from Overfitting
But c∗ contains terms like this:
1√n
∑i∈I
Vi (g0(Xi )− g0(Xi ))
V : the error term (not the estimated residuals) fromD = m0(X ) + Vg0(Xi )− g0(Xi ): estimation error in g0 fromY = D θ0 + g0(X ) + U
What happens if we estimate θ0 with the same set ofobservations we used to fit m0(X ) and g0(X )?
![Page 175: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/175.jpg)
Bias from Overfitting
But c∗ contains terms like this:
1√n
∑i∈I
Vi (g0(Xi )− g0(Xi ))
V : the error term (not the estimated residuals) fromD = m0(X ) + V
g0(Xi )− g0(Xi ): estimation error in g0 fromY = D θ0 + g0(X ) + U
What happens if we estimate θ0 with the same set ofobservations we used to fit m0(X ) and g0(X )?
![Page 176: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/176.jpg)
Bias from Overfitting
But c∗ contains terms like this:
1√n
∑i∈I
Vi (g0(Xi )− g0(Xi ))
V : the error term (not the estimated residuals) fromD = m0(X ) + Vg0(Xi )− g0(Xi ): estimation error in g0 fromY = D θ0 + g0(X ) + U
What happens if we estimate θ0 with the same set ofobservations we used to fit m0(X ) and g0(X )?
![Page 177: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/177.jpg)
Bias from Overfitting
But c∗ contains terms like this:
1√n
∑i∈I
Vi (g0(Xi )− g0(Xi ))
V : the error term (not the estimated residuals) fromD = m0(X ) + Vg0(Xi )− g0(Xi ): estimation error in g0 fromY = D θ0 + g0(X ) + U
What happens if we estimate θ0 with the same set ofobservations we used to fit m0(X ) and g0(X )?
![Page 178: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/178.jpg)
Bias from Overfitting
But c∗ contains terms like this:
1√n
∑i∈I
Vi (g0(Xi )− g0(Xi ))
Overfitting: modeling the noise too closely
When g0 is overfit, it will pick up on some of the noise Ufrom the outcome model
If our noise terms V and U are associated, estimation error ing0 from overfitting might be associated with V
Note: g0 and V might also be associated if we have anyunmeasured confounding, but we’re assuming that away
![Page 179: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/179.jpg)
Bias from Overfitting
But c∗ contains terms like this:
1√n
∑i∈I
Vi (g0(Xi )− g0(Xi ))
Overfitting: modeling the noise too closely
When g0 is overfit, it will pick up on some of the noise Ufrom the outcome model
If our noise terms V and U are associated, estimation error ing0 from overfitting might be associated with V
Note: g0 and V might also be associated if we have anyunmeasured confounding, but we’re assuming that away
![Page 180: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/180.jpg)
Bias from Overfitting
But c∗ contains terms like this:
1√n
∑i∈I
Vi (g0(Xi )− g0(Xi ))
Overfitting: modeling the noise too closely
When g0 is overfit, it will pick up on some of the noise Ufrom the outcome model
If our noise terms V and U are associated, estimation error ing0 from overfitting might be associated with V
Note: g0 and V might also be associated if we have anyunmeasured confounding, but we’re assuming that away
![Page 181: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/181.jpg)
Bias from Overfitting
But c∗ contains terms like this:
1√n
∑i∈I
Vi (g0(Xi )− g0(Xi ))
Overfitting: modeling the noise too closely
When g0 is overfit, it will pick up on some of the noise Ufrom the outcome model
If our noise terms V and U are associated, estimation error ing0 from overfitting might be associated with V
Note: g0 and V might also be associated if we have anyunmeasured confounding, but we’re assuming that away
![Page 182: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/182.jpg)
Bias from Overfitting
But c∗ contains terms like this:
1√n
∑i∈I
Vi (g0(Xi )− g0(Xi ))
Overfitting: modeling the noise too closely
When g0 is overfit, it will pick up on some of the noise Ufrom the outcome model
If our noise terms V and U are associated, estimation error ing0 from overfitting might be associated with V
Note: g0 and V might also be associated if we have anyunmeasured confounding, but we’re assuming that away
![Page 183: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/183.jpg)
How to Avoid Bias from Overfitting
To break the association between V and g0 and avoid biasfrom overfitting, we employ sample-splitting:
1 Randomly partition our data into two subsets2 Fit your ML models g0 and m0 on the first subset3 Estimate θ0 in the second subset using the g0 and m0
functions we fit in the first subset
Key: We never estimate θ0 using the same observations thatwe used to fit g0 and m0
![Page 184: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/184.jpg)
How to Avoid Bias from Overfitting
To break the association between V and g0 and avoid biasfrom overfitting, we employ sample-splitting:
1 Randomly partition our data into two subsets2 Fit your ML models g0 and m0 on the first subset3 Estimate θ0 in the second subset using the g0 and m0
functions we fit in the first subset
Key: We never estimate θ0 using the same observations thatwe used to fit g0 and m0
![Page 185: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/185.jpg)
How to Avoid Bias from Overfitting
To break the association between V and g0 and avoid biasfrom overfitting, we employ sample-splitting:
1 Randomly partition our data into two subsets
2 Fit your ML models g0 and m0 on the first subset3 Estimate θ0 in the second subset using the g0 and m0
functions we fit in the first subset
Key: We never estimate θ0 using the same observations thatwe used to fit g0 and m0
![Page 186: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/186.jpg)
How to Avoid Bias from Overfitting
To break the association between V and g0 and avoid biasfrom overfitting, we employ sample-splitting:
1 Randomly partition our data into two subsets2 Fit your ML models g0 and m0 on the first subset
3 Estimate θ0 in the second subset using the g0 and m0
functions we fit in the first subset
Key: We never estimate θ0 using the same observations thatwe used to fit g0 and m0
![Page 187: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/187.jpg)
How to Avoid Bias from Overfitting
To break the association between V and g0 and avoid biasfrom overfitting, we employ sample-splitting:
1 Randomly partition our data into two subsets2 Fit your ML models g0 and m0 on the first subset3 Estimate θ0 in the second subset using the g0 and m0
functions we fit in the first subset
Key: We never estimate θ0 using the same observations thatwe used to fit g0 and m0
![Page 188: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/188.jpg)
How to Avoid Bias from Overfitting
To break the association between V and g0 and avoid biasfrom overfitting, we employ sample-splitting:
1 Randomly partition our data into two subsets2 Fit your ML models g0 and m0 on the first subset3 Estimate θ0 in the second subset using the g0 and m0
functions we fit in the first subset
Key: We never estimate θ0 using the same observations thatwe used to fit g0 and m0
![Page 189: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/189.jpg)
Cross-Fitting
The standard approach to sample-splitting will reduceefficiency and statistical power
We can avoid the loss of efficiency and power usingcross-fitting:
1 Randomly partition your data into two subsets2 Fit two ML models g0,1 and m0,1 in the first subset3 Estimate θ0,1 in the second subset using the g0,1 and m0,1
functions we fit in the first subset4 Fit two ML models g0,2 and m0,2 in the second subset5 Estimate θ0,2 in the first subset using the g0,2 and m0,2
functions we fit in the second subset6 Average our two estimates θ0,1 and θ0,2 for our final estimateθ0
![Page 190: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/190.jpg)
Cross-Fitting
The standard approach to sample-splitting will reduceefficiency and statistical power
We can avoid the loss of efficiency and power usingcross-fitting:
1 Randomly partition your data into two subsets2 Fit two ML models g0,1 and m0,1 in the first subset3 Estimate θ0,1 in the second subset using the g0,1 and m0,1
functions we fit in the first subset4 Fit two ML models g0,2 and m0,2 in the second subset5 Estimate θ0,2 in the first subset using the g0,2 and m0,2
functions we fit in the second subset6 Average our two estimates θ0,1 and θ0,2 for our final estimateθ0
![Page 191: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/191.jpg)
Cross-Fitting
The standard approach to sample-splitting will reduceefficiency and statistical power
We can avoid the loss of efficiency and power usingcross-fitting:
1 Randomly partition your data into two subsets2 Fit two ML models g0,1 and m0,1 in the first subset3 Estimate θ0,1 in the second subset using the g0,1 and m0,1
functions we fit in the first subset4 Fit two ML models g0,2 and m0,2 in the second subset5 Estimate θ0,2 in the first subset using the g0,2 and m0,2
functions we fit in the second subset6 Average our two estimates θ0,1 and θ0,2 for our final estimateθ0
![Page 192: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/192.jpg)
Cross-Fitting
The standard approach to sample-splitting will reduceefficiency and statistical power
We can avoid the loss of efficiency and power usingcross-fitting:
1 Randomly partition your data into two subsets
2 Fit two ML models g0,1 and m0,1 in the first subset3 Estimate θ0,1 in the second subset using the g0,1 and m0,1
functions we fit in the first subset4 Fit two ML models g0,2 and m0,2 in the second subset5 Estimate θ0,2 in the first subset using the g0,2 and m0,2
functions we fit in the second subset6 Average our two estimates θ0,1 and θ0,2 for our final estimateθ0
![Page 193: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/193.jpg)
Cross-Fitting
The standard approach to sample-splitting will reduceefficiency and statistical power
We can avoid the loss of efficiency and power usingcross-fitting:
1 Randomly partition your data into two subsets2 Fit two ML models g0,1 and m0,1 in the first subset
3 Estimate θ0,1 in the second subset using the g0,1 and m0,1
functions we fit in the first subset4 Fit two ML models g0,2 and m0,2 in the second subset5 Estimate θ0,2 in the first subset using the g0,2 and m0,2
functions we fit in the second subset6 Average our two estimates θ0,1 and θ0,2 for our final estimateθ0
![Page 194: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/194.jpg)
Cross-Fitting
The standard approach to sample-splitting will reduceefficiency and statistical power
We can avoid the loss of efficiency and power usingcross-fitting:
1 Randomly partition your data into two subsets2 Fit two ML models g0,1 and m0,1 in the first subset3 Estimate θ0,1 in the second subset using the g0,1 and m0,1
functions we fit in the first subset
4 Fit two ML models g0,2 and m0,2 in the second subset5 Estimate θ0,2 in the first subset using the g0,2 and m0,2
functions we fit in the second subset6 Average our two estimates θ0,1 and θ0,2 for our final estimateθ0
![Page 195: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/195.jpg)
Cross-Fitting
The standard approach to sample-splitting will reduceefficiency and statistical power
We can avoid the loss of efficiency and power usingcross-fitting:
1 Randomly partition your data into two subsets2 Fit two ML models g0,1 and m0,1 in the first subset3 Estimate θ0,1 in the second subset using the g0,1 and m0,1
functions we fit in the first subset4 Fit two ML models g0,2 and m0,2 in the second subset
5 Estimate θ0,2 in the first subset using the g0,2 and m0,2
functions we fit in the second subset6 Average our two estimates θ0,1 and θ0,2 for our final estimateθ0
![Page 196: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/196.jpg)
Cross-Fitting
The standard approach to sample-splitting will reduceefficiency and statistical power
We can avoid the loss of efficiency and power usingcross-fitting:
1 Randomly partition your data into two subsets2 Fit two ML models g0,1 and m0,1 in the first subset3 Estimate θ0,1 in the second subset using the g0,1 and m0,1
functions we fit in the first subset4 Fit two ML models g0,2 and m0,2 in the second subset5 Estimate θ0,2 in the first subset using the g0,2 and m0,2
functions we fit in the second subset
6 Average our two estimates θ0,1 and θ0,2 for our final estimateθ0
![Page 197: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/197.jpg)
Cross-Fitting
The standard approach to sample-splitting will reduceefficiency and statistical power
We can avoid the loss of efficiency and power usingcross-fitting:
1 Randomly partition your data into two subsets2 Fit two ML models g0,1 and m0,1 in the first subset3 Estimate θ0,1 in the second subset using the g0,1 and m0,1
functions we fit in the first subset4 Fit two ML models g0,2 and m0,2 in the second subset5 Estimate θ0,2 in the first subset using the g0,2 and m0,2
functions we fit in the second subset6 Average our two estimates θ0,1 and θ0,2 for our final estimateθ0
![Page 198: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/198.jpg)
Where We Are and Where We’re Going
We now (hopefully) have a good intuition forhow fitting two ML models allows us to removebias and achieve faster convergence
Next we’re going to look at how the authorsformally define Neyman orthogonality and DML
![Page 199: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/199.jpg)
Where We Are and Where We’re Going
We now (hopefully) have a good intuition forhow fitting two ML models allows us to removebias and achieve faster convergence
Next we’re going to look at how the authorsformally define Neyman orthogonality and DML
![Page 200: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/200.jpg)
Where We Are and Where We’re Going
We now (hopefully) have a good intuition forhow fitting two ML models allows us to removebias and achieve faster convergence
Next we’re going to look at how the authorsformally define Neyman orthogonality and DML
![Page 201: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/201.jpg)
Defining Neyman Orthogonality
Remember: orthogonalize D with respect to X → eliminateregularization bias
Now we formalize this “Neyman orthogonality” condition
Letη0 = (g0,m0), η = (g ,m)
We have a new friend! Her name is η0 (“eta-naught”)
η0 is our nuisance parameter
We don’t really care what g0 and m0 areWe just want to use them to get good estimates of θ0
The exact form of our nuisance parameter isn’t a scientificallysubstantive quantity of interest
![Page 202: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/202.jpg)
Defining Neyman Orthogonality
Remember: orthogonalize D with respect to X → eliminateregularization bias
Now we formalize this “Neyman orthogonality” condition
Letη0 = (g0,m0), η = (g ,m)
We have a new friend! Her name is η0 (“eta-naught”)
η0 is our nuisance parameter
We don’t really care what g0 and m0 areWe just want to use them to get good estimates of θ0
The exact form of our nuisance parameter isn’t a scientificallysubstantive quantity of interest
![Page 203: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/203.jpg)
Defining Neyman Orthogonality
Remember: orthogonalize D with respect to X → eliminateregularization bias
Now we formalize this “Neyman orthogonality” condition
Letη0 = (g0,m0), η = (g ,m)
We have a new friend! Her name is η0 (“eta-naught”)
η0 is our nuisance parameter
We don’t really care what g0 and m0 areWe just want to use them to get good estimates of θ0
The exact form of our nuisance parameter isn’t a scientificallysubstantive quantity of interest
![Page 204: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/204.jpg)
Defining Neyman Orthogonality
Remember: orthogonalize D with respect to X → eliminateregularization bias
Now we formalize this “Neyman orthogonality” condition
Letη0 = (g0,m0), η = (g ,m)
We have a new friend! Her name is η0 (“eta-naught”)
η0 is our nuisance parameter
We don’t really care what g0 and m0 areWe just want to use them to get good estimates of θ0
The exact form of our nuisance parameter isn’t a scientificallysubstantive quantity of interest
![Page 205: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/205.jpg)
Defining Neyman Orthogonality
Remember: orthogonalize D with respect to X → eliminateregularization bias
Now we formalize this “Neyman orthogonality” condition
Letη0 = (g0,m0), η = (g ,m)
We have a new friend! Her name is η0 (“eta-naught”)
η0 is our nuisance parameter
We don’t really care what g0 and m0 areWe just want to use them to get good estimates of θ0
The exact form of our nuisance parameter isn’t a scientificallysubstantive quantity of interest
![Page 206: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/206.jpg)
Defining Neyman Orthogonality
Remember: orthogonalize D with respect to X → eliminateregularization bias
Now we formalize this “Neyman orthogonality” condition
Letη0 = (g0,m0), η = (g ,m)
We have a new friend! Her name is η0 (“eta-naught”)
η0 is our nuisance parameter
We don’t really care what g0 and m0 are
We just want to use them to get good estimates of θ0
The exact form of our nuisance parameter isn’t a scientificallysubstantive quantity of interest
![Page 207: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/207.jpg)
Defining Neyman Orthogonality
Remember: orthogonalize D with respect to X → eliminateregularization bias
Now we formalize this “Neyman orthogonality” condition
Letη0 = (g0,m0), η = (g ,m)
We have a new friend! Her name is η0 (“eta-naught”)
η0 is our nuisance parameter
We don’t really care what g0 and m0 areWe just want to use them to get good estimates of θ0
The exact form of our nuisance parameter isn’t a scientificallysubstantive quantity of interest
![Page 208: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/208.jpg)
Defining Neyman Orthogonality
Remember: orthogonalize D with respect to X → eliminateregularization bias
Now we formalize this “Neyman orthogonality” condition
Letη0 = (g0,m0), η = (g ,m)
We have a new friend! Her name is η0 (“eta-naught”)
η0 is our nuisance parameter
We don’t really care what g0 and m0 areWe just want to use them to get good estimates of θ0
The exact form of our nuisance parameter isn’t a scientificallysubstantive quantity of interest
![Page 209: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/209.jpg)
Defining Neyman Orthogonality
Introduce the following score function:
ψ(W ; θ, η0) = (D −m0(X ))︸ ︷︷ ︸V
× (Y − g0(X )− (D −m0(X ))θ)︸ ︷︷ ︸U
Looks scary! Let’s break it down
ψ: our score function “psi”
W : our data
Each underbraced term represents a noise term from ourpartially linear model
![Page 210: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/210.jpg)
Defining Neyman Orthogonality
Introduce the following score function:
ψ(W ; θ, η0) = (D −m0(X ))︸ ︷︷ ︸V
× (Y − g0(X )− (D −m0(X ))θ)︸ ︷︷ ︸U
Looks scary! Let’s break it down
ψ: our score function “psi”
W : our data
Each underbraced term represents a noise term from ourpartially linear model
![Page 211: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/211.jpg)
Defining Neyman Orthogonality
Introduce the following score function:
ψ(W ; θ, η0) = (D −m0(X ))︸ ︷︷ ︸V
× (Y − g0(X )− (D −m0(X ))θ)︸ ︷︷ ︸U
Looks scary! Let’s break it down
ψ: our score function “psi”
W : our data
Each underbraced term represents a noise term from ourpartially linear model
![Page 212: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/212.jpg)
Defining Neyman Orthogonality
Introduce the following score function:
ψ(W ; θ, η0) = (D −m0(X ))︸ ︷︷ ︸V
× (Y − g0(X )− (D −m0(X ))θ)︸ ︷︷ ︸U
Looks scary! Let’s break it down
ψ: our score function “psi”
W : our data
Each underbraced term represents a noise term from ourpartially linear model
![Page 213: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/213.jpg)
Defining Neyman Orthogonality
Introduce the following score function:
ψ(W ; θ, η0) = (D −m0(X ))︸ ︷︷ ︸V
× (Y − g0(X )− (D −m0(X ))θ)︸ ︷︷ ︸U
Looks scary! Let’s break it down
ψ: our score function “psi”
W : our data
Each underbraced term represents a noise term from ourpartially linear model
![Page 214: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/214.jpg)
Defining Neyman Orthogonality
Introduce the following score function:
ψ(W ; θ, η0) = (D −m0(X ))︸ ︷︷ ︸V
× (Y − g0(X )− (D −m0(X ))θ)︸ ︷︷ ︸U
Looks scary! Let’s break it down
ψ: our score function “psi”
W : our data
Each underbraced term represents a noise term from ourpartially linear model
![Page 215: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/215.jpg)
Defining Neyman Orthogonality
Introduce the following moment condition:
ψ(W ; θ, η0) = (D −m0(X ))×(Y − g0(X )− (D −m0(X ))θ) = 0
So we want our score function to = 0. Why?
Y = V θ0 + g0(X ) + U
Y = (D −m0(X ))θ0 + g0(X ) + U
V = (D −m0(X )) is our regressor
U = (Y − g0(X )− (D −m0(X ))θ) is our error term
![Page 216: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/216.jpg)
Defining Neyman Orthogonality
Introduce the following moment condition:
ψ(W ; θ, η0) = (D −m0(X ))×(Y − g0(X )− (D −m0(X ))θ) = 0
So we want our score function to = 0. Why?
Y = V θ0 + g0(X ) + U
Y = (D −m0(X ))θ0 + g0(X ) + U
V = (D −m0(X )) is our regressor
U = (Y − g0(X )− (D −m0(X ))θ) is our error term
![Page 217: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/217.jpg)
Defining Neyman Orthogonality
Introduce the following moment condition:
ψ(W ; θ, η0) = (D −m0(X ))×(Y − g0(X )− (D −m0(X ))θ) = 0
So we want our score function to = 0. Why?
Y = V θ0 + g0(X ) + U
Y = (D −m0(X ))θ0 + g0(X ) + U
V = (D −m0(X )) is our regressor
U = (Y − g0(X )− (D −m0(X ))θ) is our error term
![Page 218: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/218.jpg)
Defining Neyman Orthogonality
Introduce the following moment condition:
ψ(W ; θ, η0) = (D −m0(X ))×(Y − g0(X )− (D −m0(X ))θ) = 0
So we want our score function to = 0. Why?
Y = V θ0 + g0(X ) + U
Y = (D −m0(X ))θ0 + g0(X ) + U
V = (D −m0(X )) is our regressor
U = (Y − g0(X )− (D −m0(X ))θ) is our error term
![Page 219: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/219.jpg)
Defining Neyman Orthogonality
Introduce the following moment condition:
ψ(W ; θ, η0) = (D −m0(X ))×(Y − g0(X )− (D −m0(X ))θ) = 0
So we want our score function to = 0. Why?
Y = V θ0 + g0(X ) + U
Y = (D −m0(X ))θ0 + g0(X ) + U
V = (D −m0(X )) is our regressor
U = (Y − g0(X )− (D −m0(X ))θ) is our error term
![Page 220: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/220.jpg)
Defining Neyman Orthogonality
Introduce the following moment condition:
ψ(W ; θ, η0) = (D −m0(X ))×(Y − g0(X )− (D −m0(X ))θ) = 0
So we want our score function to = 0. Why?
Y = V θ0 + g0(X ) + U
Y = (D −m0(X ))θ0 + g0(X ) + U
V = (D −m0(X )) is our regressor
U = (Y − g0(X )− (D −m0(X ))θ) is our error term
![Page 221: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/221.jpg)
Defining Neyman Orthogonality
We’re saying we want our regressor V and our error term Uto be orthogonal to one another1
Two vectors are orthogonal to one another when their dotproduct equals 0
This moment condition is very similar to saying we want ourerror to be uncorrelated with our regressor
It is also very similar to (but slightly weaker than) thestandard zero conditional mean assumption for OLS
1Moment conditions like this will look more familiar to people who knowGeneralized Method of Moments (GMM). My understanding is that whileGMM is popular in economics, it’s less common in political science andsociology, which is why I explain what’s going on in a little more depth here.
![Page 222: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/222.jpg)
Defining Neyman Orthogonality
We’re saying we want our regressor V and our error term Uto be orthogonal to one another1
Two vectors are orthogonal to one another when their dotproduct equals 0
This moment condition is very similar to saying we want ourerror to be uncorrelated with our regressor
It is also very similar to (but slightly weaker than) thestandard zero conditional mean assumption for OLS
1Moment conditions like this will look more familiar to people who knowGeneralized Method of Moments (GMM). My understanding is that whileGMM is popular in economics, it’s less common in political science andsociology, which is why I explain what’s going on in a little more depth here.
![Page 223: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/223.jpg)
Defining Neyman Orthogonality
We’re saying we want our regressor V and our error term Uto be orthogonal to one another1
Two vectors are orthogonal to one another when their dotproduct equals 0
This moment condition is very similar to saying we want ourerror to be uncorrelated with our regressor
It is also very similar to (but slightly weaker than) thestandard zero conditional mean assumption for OLS
1Moment conditions like this will look more familiar to people who knowGeneralized Method of Moments (GMM). My understanding is that whileGMM is popular in economics, it’s less common in political science andsociology, which is why I explain what’s going on in a little more depth here.
![Page 224: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/224.jpg)
Defining Neyman Orthogonality
We’re saying we want our regressor V and our error term Uto be orthogonal to one another1
Two vectors are orthogonal to one another when their dotproduct equals 0
This moment condition is very similar to saying we want ourerror to be uncorrelated with our regressor
It is also very similar to (but slightly weaker than) thestandard zero conditional mean assumption for OLS
1Moment conditions like this will look more familiar to people who knowGeneralized Method of Moments (GMM). My understanding is that whileGMM is popular in economics, it’s less common in political science andsociology, which is why I explain what’s going on in a little more depth here.
![Page 225: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/225.jpg)
Defining Neyman Orthogonality
Now we can define Neyman orthogonality!
∂ηE[ψ(W ; θ0, η0)][η − η0] = 0
In words: the (Gateaux) derivative of our score function withrespect to our nuisance parameter is 0
Recall that a derivative represents our instantaneous rate ofchange
Thus, when the derivative = 0, our score function is robustto small perturbations in η0
It doesn’t change much when η0 moves around a little
![Page 226: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/226.jpg)
Defining Neyman Orthogonality
Now we can define Neyman orthogonality!
∂ηE[ψ(W ; θ0, η0)][η − η0] = 0
In words: the (Gateaux) derivative of our score function withrespect to our nuisance parameter is 0
Recall that a derivative represents our instantaneous rate ofchange
Thus, when the derivative = 0, our score function is robustto small perturbations in η0
It doesn’t change much when η0 moves around a little
![Page 227: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/227.jpg)
Defining Neyman Orthogonality
Now we can define Neyman orthogonality!
∂ηE[ψ(W ; θ0, η0)][η − η0] = 0
In words: the (Gateaux) derivative of our score function withrespect to our nuisance parameter is 0
Recall that a derivative represents our instantaneous rate ofchange
Thus, when the derivative = 0, our score function is robustto small perturbations in η0
It doesn’t change much when η0 moves around a little
![Page 228: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/228.jpg)
Defining Neyman Orthogonality
Now we can define Neyman orthogonality!
∂ηE[ψ(W ; θ0, η0)][η − η0] = 0
In words: the (Gateaux) derivative of our score function withrespect to our nuisance parameter is 0
Recall that a derivative represents our instantaneous rate ofchange
Thus, when the derivative = 0, our score function is robustto small perturbations in η0
It doesn’t change much when η0 moves around a little
![Page 229: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/229.jpg)
Defining Neyman Orthogonality
Now we can define Neyman orthogonality!
∂ηE[ψ(W ; θ0, η0)][η − η0] = 0
In words: the (Gateaux) derivative of our score function withrespect to our nuisance parameter is 0
Recall that a derivative represents our instantaneous rate ofchange
Thus, when the derivative = 0, our score function is robustto small perturbations in η0
It doesn’t change much when η0 moves around a little
![Page 230: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/230.jpg)
Defining Neyman Orthogonality
Now we can define Neyman orthogonality!
∂ηE[ψ(W ; θ0, η0)][η − η0] = 0
In words: the (Gateaux) derivative of our score function withrespect to our nuisance parameter is 0
Recall that a derivative represents our instantaneous rate ofchange
Thus, when the derivative = 0, our score function is robustto small perturbations in η0
It doesn’t change much when η0 moves around a little
![Page 231: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/231.jpg)
Defining Neyman Orthogonality
Now we can define Neyman orthogonality!
∂ηE[ψ(W ; θ0, η0)][η − η0] = 0
In words: the (Gateaux) derivative of our score function withrespect to our nuisance parameter is 0
Recall that a derivative represents our instantaneous rate ofchange
Thus, when the derivative = 0, our score function is robustto small perturbations in η0
It doesn’t change much when η0 moves around a little
![Page 232: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/232.jpg)
Now we’re ready to formally define DML!
![Page 233: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/233.jpg)
Defining DML
1 Take a K -fold random partition (Ik)Kk=1 of observationindices [N] = 1, ...,N such that the size of each fold Ik isn = N/K . Also, for each k ∈ [K ] = 1, ...,K , defineI ck := 1, ...,N \ Ik .
Create K equally sized partitionsI ck is the complement of Ik : if we have 100 observations andIk is the set of observations 1–20, then I ck is the set ofobservations 21–100
![Page 234: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/234.jpg)
Defining DML
2 For each k ∈ [K ], construct an ML estimator
η0,k = η0((Wi )i∈I ck )
Important: The estimator we use for fold k was fit in I ck !This is the sample splitting we talked about earlier thatremoves bias from overfitting
![Page 235: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/235.jpg)
Defining DML
3 Construct the estimator θ0 (“theta-naught-tilde”) as thesolution to
1
K
K∑k=1
En,k [ψ(W ; θ0, ˆη0,k)] = 0
Note that θ0 is not indexed by k , but the nuisance parameterη0,k is
We’re finding the θ0 that minimizes the average of the scoresacross all folds, where the scores vary by fold due to η0,k
This is a slightly different2 version of the cross-fittingapproach we talked about earlier that enables us to do samplesplitting without loss of efficiency
2In particular, we are no longer taking the average of k different estimatesof θ0 but instead finding one estimate that minimizes the average of the kdifferent score functions. Chernozhukov et al. recommend this latter approachbecause it behaves better in smaller samples.
![Page 236: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/236.jpg)
The ATE
Even under unconfoundedness, OLS does not (necessarily)give us the ATE
We set up the following moment condition for estimation ofthe ATE:
ψ(W ; θ, η) :=
(g(1,X )−g(0,X ))+D(Y − g(1,X ))
m(X )−(1− D)(Y − g(0,X ))
1−m(X )−θ
Recall the score function has to = 0
So we’re saying we want the three big terms in the middle to= θ
Let’s look at them more closely
![Page 237: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/237.jpg)
The ATE
Even under unconfoundedness, OLS does not (necessarily)give us the ATE
We set up the following moment condition for estimation ofthe ATE:
ψ(W ; θ, η) :=
(g(1,X )−g(0,X ))+D(Y − g(1,X ))
m(X )−(1− D)(Y − g(0,X ))
1−m(X )−θ
Recall the score function has to = 0
So we’re saying we want the three big terms in the middle to= θ
Let’s look at them more closely
![Page 238: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/238.jpg)
The ATE
Even under unconfoundedness, OLS does not (necessarily)give us the ATE
We set up the following moment condition for estimation ofthe ATE:
ψ(W ; θ, η) :=
(g(1,X )−g(0,X ))+D(Y − g(1,X ))
m(X )−(1− D)(Y − g(0,X ))
1−m(X )−θ
Recall the score function has to = 0
So we’re saying we want the three big terms in the middle to= θ
Let’s look at them more closely
![Page 239: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/239.jpg)
The ATE
Even under unconfoundedness, OLS does not (necessarily)give us the ATE
We set up the following moment condition for estimation ofthe ATE:
ψ(W ; θ, η) :=
(g(1,X )−g(0,X ))+D(Y − g(1,X ))
m(X )−(1− D)(Y − g(0,X ))
1−m(X )−θ
Recall the score function has to = 0
So we’re saying we want the three big terms in the middle to= θ
Let’s look at them more closely
![Page 240: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/240.jpg)
The ATE
Even under unconfoundedness, OLS does not (necessarily)give us the ATE
We set up the following moment condition for estimation ofthe ATE:
ψ(W ; θ, η) :=
(g(1,X )−g(0,X ))+D(Y − g(1,X ))
m(X )−(1− D)(Y − g(0,X ))
1−m(X )−θ
Recall the score function has to = 0
So we’re saying we want the three big terms in the middle to= θ
Let’s look at them more closely
![Page 241: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/241.jpg)
The ATE
Even under unconfoundedness, OLS does not (necessarily)give us the ATE
We set up the following moment condition for estimation ofthe ATE:
ψ(W ; θ, η) :=
(g(1,X )−g(0,X ))+D(Y − g(1,X ))
m(X )−(1− D)(Y − g(0,X ))
1−m(X )−θ
Recall the score function has to = 0
So we’re saying we want the three big terms in the middle to= θ
Let’s look at them more closely
![Page 242: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/242.jpg)
The ATE
(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect
estimate from ML models
+D(Y − g(1,X ))
m(X )− (1− D)(Y − g(0,X ))
1−m(X )︸ ︷︷ ︸Debiasing terms
g(1,X ): predicted outcome given X when D = 1, i.e.,predicted outcome for treated units
g(0,X ): predicted outcome given X when D = 0, i.e.,predicted outcome for control units
![Page 243: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/243.jpg)
The ATE
(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect
estimate from ML models
+D(Y − g(1,X ))
m(X )− (1− D)(Y − g(0,X ))
1−m(X )︸ ︷︷ ︸Debiasing terms
g(1,X ): predicted outcome given X when D = 1, i.e.,predicted outcome for treated units
g(0,X ): predicted outcome given X when D = 0, i.e.,predicted outcome for control units
![Page 244: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/244.jpg)
The ATE
(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect
estimate from ML models
+D(Y − g(1,X ))
m(X )− (1− D)(Y − g(0,X ))
1−m(X )︸ ︷︷ ︸Debiasing terms
g(1,X ): predicted outcome given X when D = 1, i.e.,predicted outcome for treated units
g(0,X ): predicted outcome given X when D = 0, i.e.,predicted outcome for control units
![Page 245: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/245.jpg)
The ATE
(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect
estimate from ML models
+D(Y − g(1,X ))
m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment
− (1− D)(Y − g(0,X ))
1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control
Recall that we can define the ATE as E [Y (1)− Y (0)] inpotential outcomes notation, where Y (1) and Y (0) representpotential outcomes under treatment and control,respectively
When Y (1) is downwardly biased, our ATE will be biaseddownward
When Y (0) is downwardly biased, our ATE will be biasedupward
![Page 246: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/246.jpg)
The ATE
(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect
estimate from ML models
+D(Y − g(1,X ))
m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment
− (1− D)(Y − g(0,X ))
1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control
Recall that we can define the ATE as E [Y (1)− Y (0)] inpotential outcomes notation, where Y (1) and Y (0) representpotential outcomes under treatment and control,respectively
When Y (1) is downwardly biased, our ATE will be biaseddownward
When Y (0) is downwardly biased, our ATE will be biasedupward
![Page 247: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/247.jpg)
The ATE
(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect
estimate from ML models
+D(Y − g(1,X ))
m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment
− (1− D)(Y − g(0,X ))
1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control
Recall that we can define the ATE as E [Y (1)− Y (0)] inpotential outcomes notation, where Y (1) and Y (0) representpotential outcomes under treatment and control,respectively
When Y (1) is downwardly biased, our ATE will be biaseddownward
When Y (0) is downwardly biased, our ATE will be biasedupward
![Page 248: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/248.jpg)
The ATE
(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect
estimate from ML models
+D(Y − g(1,X ))
m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment
− (1− D)(Y − g(0,X ))
1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control
Recall that we can define the ATE as E [Y (1)− Y (0)] inpotential outcomes notation, where Y (1) and Y (0) representpotential outcomes under treatment and control,respectively
When Y (1) is downwardly biased, our ATE will be biaseddownward
When Y (0) is downwardly biased, our ATE will be biasedupward
![Page 249: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/249.jpg)
The ATE
(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect
estimate from ML models
+D(Y − g(1,X ))
m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment
− (1− D)(Y − g(0,X ))
1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control
That’s why we want to add the residuals for the treated unitsand subtract the residuals for the control units
To see this, note that, e.g., D(Y − g(1,X )) represents theobserved potential outcome under treatment minus thepredicted potential outcome under treatment
![Page 250: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/250.jpg)
The ATE
(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect
estimate from ML models
+D(Y − g(1,X ))
m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment
− (1− D)(Y − g(0,X ))
1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control
That’s why we want to add the residuals for the treated unitsand subtract the residuals for the control units
To see this, note that, e.g., D(Y − g(1,X )) represents theobserved potential outcome under treatment minus thepredicted potential outcome under treatment
![Page 251: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/251.jpg)
The ATE
(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect
estimate from ML models
+D(Y − g(1,X ))
m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment
− (1− D)(Y − g(0,X ))
1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control
That’s why we want to add the residuals for the treated unitsand subtract the residuals for the control units
To see this, note that, e.g., D(Y − g(1,X )) represents theobserved potential outcome under treatment minus thepredicted potential outcome under treatment
![Page 252: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/252.jpg)
The ATE
(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect
estimate from ML models
+D(Y − g(1,X ))
m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment
− (1− D)(Y − g(0,X )
1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control
Finally, we weight by the inverse probability of treatment 1m(X )
for the treated units because units with high probability oftreatment will be overrepresented among the treated units
And we weight by the inverse probability of control 11−m(X ) for
the control units because units with high probability ofcontrol will be overrepresented among the control units
![Page 253: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/253.jpg)
The ATE
(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect
estimate from ML models
+D(Y − g(1,X ))
m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment
− (1− D)(Y − g(0,X )
1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control
Finally, we weight by the inverse probability of treatment 1m(X )
for the treated units because units with high probability oftreatment will be overrepresented among the treated units
And we weight by the inverse probability of control 11−m(X ) for
the control units because units with high probability ofcontrol will be overrepresented among the control units
![Page 254: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/254.jpg)
The ATE
(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect
estimate from ML models
+D(Y − g(1,X ))
m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment
− (1− D)(Y − g(0,X )
1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control
Finally, we weight by the inverse probability of treatment 1m(X )
for the treated units because units with high probability oftreatment will be overrepresented among the treated units
And we weight by the inverse probability of control 11−m(X ) for
the control units because units with high probability ofcontrol will be overrepresented among the control units
![Page 255: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/255.jpg)
The ATE
(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect
estimate from ML models
+D(Y − g(1,X ))
m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment
− (1− D)(Y − g(0,X )
1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control
Important: be sure to assess common support!
If the probability of treatment or control is very low in somestrata of X , the debiasing terms will blow up
Lack of common support → unstable estimates of treatmenteffects
![Page 256: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/256.jpg)
The ATE
(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect
estimate from ML models
+D(Y − g(1,X ))
m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment
− (1− D)(Y − g(0,X )
1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control
Important: be sure to assess common support!
If the probability of treatment or control is very low in somestrata of X , the debiasing terms will blow up
Lack of common support → unstable estimates of treatmenteffects
![Page 257: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/257.jpg)
The ATE
(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect
estimate from ML models
+D(Y − g(1,X ))
m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment
− (1− D)(Y − g(0,X )
1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control
Important: be sure to assess common support!
If the probability of treatment or control is very low in somestrata of X , the debiasing terms will blow up
Lack of common support → unstable estimates of treatmenteffects
![Page 258: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/258.jpg)
The ATE
(g(1,X )− g(0,X ))︸ ︷︷ ︸Biased treatment effect
estimate from ML models
+D(Y − g(1,X ))
m(X )︸ ︷︷ ︸Residuals for treateddivided by probabilityof receiving treatment
− (1− D)(Y − g(0,X )
1−m(X )︸ ︷︷ ︸Residuals for controldivided by probabilityof receiving control
Important: be sure to assess common support!
If the probability of treatment or control is very low in somestrata of X , the debiasing terms will blow up
Lack of common support → unstable estimates of treatmenteffects
![Page 259: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/259.jpg)
Variance and Confidence Intervals
To get valid confidence intervals, we assume that score arelinear in the following sense:
ψ(w ; θ, η) = ψa(w ; η)θ+ψb(w ; η) ∀ w ∈ W, θ ∈ Θ, η ∈ T
We use this new term ψa(w ; η) to estimate the asymptoticvariance of our estimator
![Page 260: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/260.jpg)
Variance and Confidence Intervals
To get valid confidence intervals, we assume that score arelinear in the following sense:
ψ(w ; θ, η) = ψa(w ; η)θ+ψb(w ; η) ∀ w ∈ W, θ ∈ Θ, η ∈ T
We use this new term ψa(w ; η) to estimate the asymptoticvariance of our estimator
![Page 261: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/261.jpg)
Variance and Confidence Intervals
To get valid confidence intervals, we assume that score arelinear in the following sense:
ψ(w ; θ, η) = ψa(w ; η)θ+ψb(w ; η) ∀ w ∈ W, θ ∈ Θ, η ∈ T
We use this new term ψa(w ; η) to estimate the asymptoticvariance of our estimator
![Page 262: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/262.jpg)
Variance and Confidence Intervals
We use the following estimator for the asymptotic variance ofDML:
σ2 = J0−1︸︷︷︸
J0 inverse
1
K
K∑k=1
En,k [ψ(W ; θ0, η0,k)︸ ︷︷ ︸Score function
ψ(W ; θ0, η0,k)′︸ ︷︷ ︸Score function
transpose
] (J0−1)′︸ ︷︷ ︸
J0 inversetranspose
where
J0 =1
K
K∑k=1
En,k [ψa(W ; η0,k)]
![Page 263: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/263.jpg)
Variance and Confidence Intervals
We use the following estimator for the asymptotic variance ofDML:
σ2 = J0−1︸︷︷︸
J0 inverse
1
K
K∑k=1
En,k [ψ(W ; θ0, η0,k)︸ ︷︷ ︸Score function
ψ(W ; θ0, η0,k)′︸ ︷︷ ︸Score function
transpose
] (J0−1)′︸ ︷︷ ︸
J0 inversetranspose
where
J0 =1
K
K∑k=1
En,k [ψa(W ; η0,k)]
![Page 264: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/264.jpg)
Variance and Confidence Intervals
We use the following estimator for the asymptotic variance ofDML:
σ2 = J0−1︸︷︷︸
J0 inverse
1
K
K∑k=1
En,k [ψ(W ; θ0, η0,k)︸ ︷︷ ︸Score function
ψ(W ; θ0, η0,k)′︸ ︷︷ ︸Score function
transpose
] (J0−1)′︸ ︷︷ ︸
J0 inversetranspose
where
J0 =1
K
K∑k=1
En,k [ψa(W ; η0,k)]
![Page 265: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/265.jpg)
Wrapping Up
Useful for selection-on-observables with high-dimensionalconfounding
Avoids strong functional form assumptions
Two sources of bias: regularization and overfitting
Two tools for eliminating bias: orthogonalization andcross-fitting√n consistency if both nuisance parameter estimators are
n14 -consistent
Asymptotically valid confidence intervals
![Page 266: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/266.jpg)
Wrapping Up
Useful for selection-on-observables with high-dimensionalconfounding
Avoids strong functional form assumptions
Two sources of bias: regularization and overfitting
Two tools for eliminating bias: orthogonalization andcross-fitting√n consistency if both nuisance parameter estimators are
n14 -consistent
Asymptotically valid confidence intervals
![Page 267: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/267.jpg)
Wrapping Up
Useful for selection-on-observables with high-dimensionalconfounding
Avoids strong functional form assumptions
Two sources of bias: regularization and overfitting
Two tools for eliminating bias: orthogonalization andcross-fitting√n consistency if both nuisance parameter estimators are
n14 -consistent
Asymptotically valid confidence intervals
![Page 268: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/268.jpg)
Wrapping Up
Useful for selection-on-observables with high-dimensionalconfounding
Avoids strong functional form assumptions
Two sources of bias: regularization and overfitting
Two tools for eliminating bias: orthogonalization andcross-fitting√n consistency if both nuisance parameter estimators are
n14 -consistent
Asymptotically valid confidence intervals
![Page 269: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/269.jpg)
Wrapping Up
Useful for selection-on-observables with high-dimensionalconfounding
Avoids strong functional form assumptions
Two sources of bias: regularization and overfitting
Two tools for eliminating bias: orthogonalization andcross-fitting
√n consistency if both nuisance parameter estimators are
n14 -consistent
Asymptotically valid confidence intervals
![Page 270: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/270.jpg)
Wrapping Up
Useful for selection-on-observables with high-dimensionalconfounding
Avoids strong functional form assumptions
Two sources of bias: regularization and overfitting
Two tools for eliminating bias: orthogonalization andcross-fitting√n consistency if both nuisance parameter estimators are
n14 -consistent
Asymptotically valid confidence intervals
![Page 271: Chernozhukov et al. on Double / Debiased Machine Learning](https://reader031.vdocuments.us/reader031/viewer/2022012514/618d842c3a2a8662df0a0449/html5/thumbnails/271.jpg)
Wrapping Up
Useful for selection-on-observables with high-dimensionalconfounding
Avoids strong functional form assumptions
Two sources of bias: regularization and overfitting
Two tools for eliminating bias: orthogonalization andcross-fitting√n consistency if both nuisance parameter estimators are
n14 -consistent
Asymptotically valid confidence intervals