Double machine learning
Uses ML to partial out the effect of high-dimensional controls from both the treatment and outcome, then estimates the causal effect on the residuals. Achieves root-n consistent inference without specifying the functional form of the nuisance functions.
High-dimensional controls create a dilemma: include everything and OLS is biased by regularization; select variables manually and risk omitted variable bias. DML uses ML for the controls and orthogonalization to remove the regularization bias from the causal estimate.
DML applies the Frisch-Waugh theorem in a nonparametric setting. Regress both Y and T on X using ML (the nuisance functions). The causal effect θ is recovered by regressing the Y residual on the T residual — variation in T unexplained by X.
Nuisance functions are estimated on held-out folds, then used to predict on the current fold. This sample-splitting removes the overfitting bias that would arise if the same data were used for both nuisance estimation and causal inference.
DML PIPELINE — PARTIALLING OUT
Given the observed covariates X, treatment T is independent of potential outcomes. Same as in matching and IPW — DML does not solve unobserved confounding, it handles high-dimensional observed confounding.
HOW TO TEST
Theoretical argument. Inspect whether plausible unobserved confounders are included in X. Sensitivity analysis via partial R² bounds (Cinelli & Hazlett) can quantify robustness.
The score function used to estimate θ is orthogonal to small perturbations in the nuisance functions. This is satisfied by the partialling-out score and the IRM score by construction — choose the right DoubleML model class for your setting.
HOW TO TEST
Ensured by the estimator choice. PLR satisfies orthogonality for continuous treatment; IRM satisfies it for binary treatment. Do not use a non-orthogonal score.
Every unit must have nonzero probability of receiving treatment. Extreme propensity scores near 0 or 1 cause instability in the IRM estimator. PLR is less sensitive to overlap violations for continuous treatment.
HOW TO TEST
Inspect the distribution of fitted treatment probabilities from ml_r. Trim or flag units with predicted probabilities near 0 or 1.
For root-n consistent inference on θ, the nuisance functions must converge to their true values at a rate faster than n^{-1/4}. Random forests and lasso satisfy this under sparsity or smoothness conditions. Deep neural nets may not in small samples.
HOW TO TEST
Check out-of-sample RMSE for both ml_l and ml_m. Poor nuisance fit directly inflates the variance of θ. Consider LASSO or elastic net as a baseline alongside random forests.
High-dimensional controls
DML is most valuable when the control set X is large — many variables, interactions, or flexible transformations. For low-dimensional X, OLS with standard controls is often sufficient and simpler.
Sample size
Cross-fitting loses some efficiency — each fold uses roughly (k-1)/k of the data. Recommended minimum is ~500 observations for stable nuisance estimates. With very small samples, consider LASSO over random forests.
Treatment type
PLR handles continuous or multi-valued treatment; IRM handles binary treatment. Choose the correct model class — the orthogonal score differs between them.
library(tidyverse)
library(DoubleML)
library(mlr3)
library(mlr3learners)
data <- read_csv("observational_data.csv")
# DML requires: outcome Y, treatment D, high-dimensional controls X
# X can be many variables — DML handles the selection automatically
# Create DoubleML data object
dml_data <- DoubleMLData$new(
data = as.data.frame(data),
y_col = "outcome", # Y
d_cols = "treatment", # D (can be vector for multiple treatments)
x_cols = c("age", "income", "education",
"region", "employment",
paste0("covar_", 1:20)) # high-dimensional X
)
dml_dataTwo model classes cover most empirical settings. PLR assumes a partially linear structure — the treatment effect θ enters linearly but controls enter flexibly. IRM makes no linearity assumption for binary treatment and estimates ATE or ATT via the doubly robust score.
PARTIALLY LINEAR REGRESSION (PLR)
library(DoubleML)
library(mlr3)
library(mlr3learners)
# Partially linear regression (PLR)
# Y = theta*D + g(X) + epsilon
# D = m(X) + v
# Both nuisance functions estimated via ML
obj_plr <- DoubleMLPLR$new(
data = dml_data,
ml_l = lrn("regr.ranger", # outcome model: E[Y|X]
num.trees = 200,
max.depth = 5),
ml_m = lrn("regr.ranger", # treatment model: E[D|X]
num.trees = 200,
max.depth = 5),
n_folds = 5, # 5-fold cross-fitting
score = "partialling out" # Frisch-Waugh partialling
)
obj_plr$fit()
obj_plr$summary()
# Aggregate over folds
obj_plr$coef # theta estimate
obj_plr$se # standard error
obj_plr$confint() # 95% CIINTERACTIVE REGRESSION MODEL (IRM) — BINARY TREATMENT
library(DoubleML)
library(mlr3)
library(mlr3learners)
# Interactive regression model (IRM)
# For binary treatment — allows heterogeneous effects
# Y = g(D, X) + epsilon (no linearity in D assumed)
# D ~ m(X) (binary treatment model via classification)
obj_irm <- DoubleMLIRM$new(
data = dml_data,
ml_g = lrn("regr.ranger"), # outcome model
ml_r = lrn("classif.ranger", # treatment model (binary)
predict_type = "prob"),
n_folds = 5,
score = "ATE" # or "ATT"
)
obj_irm$fit()
obj_irm$summary()Nuisance model RMSE
Check out-of-sample RMSE for both ml_l (outcome model) and ml_m (treatment model) across folds. High RMSE inflates the variance of θ. Try multiple learners and compare.
Sensitivity to learner choice
Re-estimate θ using different nuisance learners — lasso, random forest, gradient boosting. Substantial variation in θ across learners suggests the estimate is sensitive to nuisance model specification.
Number of folds
More folds reduce the efficiency loss from sample splitting. 5-fold is standard; 10-fold is better for small samples but slower. For very small N (<200), consider repeated cross-fitting with multiple random splits.
Aggregation across folds
DoubleML aggregates θ estimates across folds via a median (robust) or mean. Check that individual fold estimates are not wildly variable — high variance across folds suggests an unstable nuisance fit in some folds.
library(DoubleML)
library(mlr3)
library(mlr3tuning)
library(mlr3learners)
# Tune hyperparameters for the nuisance functions
# Using random search within cross-fitted folds
ml_l_tuned <- lrn("regr.ranger",
num.trees = to_tune(100, 500),
max.depth = to_tune(3, 8),
mtry.ratio = to_tune(0.1, 0.9)
)
# Wrap in auto-tuner (inner CV within each fold)
at_l <- auto_tuner(
tuner = tnr("random_search", batch_size = 20),
learner = ml_l_tuned,
resampling = rsmp("cv", folds = 3),
measure = msr("regr.mse"),
term_evals = 20
)
obj_tuned <- DoubleMLPLR$new(
data = dml_data,
ml_l = at_l,
ml_m = at_l$clone(),
n_folds = 5
)
obj_tuned$fit()
obj_tuned$summary()My DML estimate differs from OLS — which should I trust?
DML if X is high-dimensional and the conditional ignorability assumption holds. OLS is biased when regularization is used to handle many controls. The DML estimate is consistent under weaker conditions on the nuisance functions. If they agree, that is reassuring — if they disagree, investigate whether OLS is regularization-biased or whether the DML nuisance models are misfitting.
The standard errors are large — is that a problem with cross-fitting?
Large SEs from DML often reflect genuinely weak signal after partialling out controls — not a cross-fitting artifact. Cross-fitting can slightly increase variance relative to a full-sample nuisance fit, but the bias reduction it provides justifies this. If SEs are very large, check nuisance RMSE — poor nuisance fit directly inflates variance.
Can I use DML to estimate heterogeneous treatment effects?
Partially. PLR and IRM estimate an average θ, not unit-level effects. To estimate heterogeneous effects, use the DML framework combined with causal forests (the R-learner) or the DoubleMLIIVM class for interactive IV models. See Chapter 08 (Causal forest) for the full HTE workflow.
How do I report DML results alongside traditional estimates?
Report the DML estimate, SE, and 95% CI from the summary() output. Alongside it, report the nuisance RMSE for both models, the number of folds, and the learner type. If results are robust to learner choice, note that. If you also ran OLS, present both — the comparison is informative.