Causal Methods
CHAPTER 07
DML

Double machine learning

Uses ML to partial out the effect of high-dimensional controls from both the treatment and outcome, then estimates the causal effect on the residuals. Achieves root-n consistent inference without specifying the functional form of the nuisance functions.

IDENTIFICATION SETUP
01The problem it solves

High-dimensional controls create a dilemma: include everything and OLS is biased by regularization; select variables manually and risk omitted variable bias. DML uses ML for the controls and orthogonalization to remove the regularization bias from the causal estimate.

02Partialling out

DML applies the Frisch-Waugh theorem in a nonparametric setting. Regress both Y and T on X using ML (the nuisance functions). The causal effect θ is recovered by regressing the Y residual on the T residual — variation in T unexplained by X.

03Cross-fitting

Nuisance functions are estimated on held-out folds, then used to predict on the current fold. This sample-splitting removes the overfitting bias that would arise if the same data were used for both nuisance estimation and causal inference.

DML PIPELINE — PARTIALLING OUT

XCovariatesDTreatmentYOutcomeml_mE[D | X]ml_lE[Y | X]D residualY residualθCausal effectcross-fitted on held-out folds
ml_mPredicts treatment D from covariates X — E[D | X]
ml_lPredicts outcome Y from covariates X — E[Y | X]
Treatment residual — D minus its predicted value
Outcome residual — Y minus its predicted value
θCausal effect — OLS of Ỹ on D̃, partialled out
ASSUMPTIONS
Conditional ignorabilityrequired

Given the observed covariates X, treatment T is independent of potential outcomes. Same as in matching and IPW — DML does not solve unobserved confounding, it handles high-dimensional observed confounding.

HOW TO TEST

Theoretical argument. Inspect whether plausible unobserved confounders are included in X. Sensitivity analysis via partial R² bounds (Cinelli & Hazlett) can quantify robustness.

Neyman orthogonalityrequired

The score function used to estimate θ is orthogonal to small perturbations in the nuisance functions. This is satisfied by the partialling-out score and the IRM score by construction — choose the right DoubleML model class for your setting.

HOW TO TEST

Ensured by the estimator choice. PLR satisfies orthogonality for continuous treatment; IRM satisfies it for binary treatment. Do not use a non-orthogonal score.

Overlap (positivity)required

Every unit must have nonzero probability of receiving treatment. Extreme propensity scores near 0 or 1 cause instability in the IRM estimator. PLR is less sensitive to overlap violations for continuous treatment.

HOW TO TEST

Inspect the distribution of fitted treatment probabilities from ml_r. Trim or flag units with predicted probabilities near 0 or 1.

Nuisance function rate conditionsrecommended

For root-n consistent inference on θ, the nuisance functions must converge to their true values at a rate faster than n^{-1/4}. Random forests and lasso satisfy this under sparsity or smoothness conditions. Deep neural nets may not in small samples.

HOW TO TEST

Check out-of-sample RMSE for both ml_l and ml_m. Poor nuisance fit directly inflates the variance of θ. Consider LASSO or elastic net as a baseline alongside random forests.

DATA REQUIREMENTS

High-dimensional controls

DML is most valuable when the control set X is large — many variables, interactions, or flexible transformations. For low-dimensional X, OLS with standard controls is often sufficient and simpler.

Sample size

Cross-fitting loses some efficiency — each fold uses roughly (k-1)/k of the data. Recommended minimum is ~500 observations for stable nuisance estimates. With very small samples, consider LASSO over random forests.

Treatment type

PLR handles continuous or multi-valued treatment; IRM handles binary treatment. Choose the correct model class — the orthogonal score differs between them.

01_data_prep.R
library(tidyverse)
library(DoubleML)
library(mlr3)
library(mlr3learners)

data <- read_csv("observational_data.csv")

# DML requires: outcome Y, treatment D, high-dimensional controls X
# X can be many variables — DML handles the selection automatically

# Create DoubleML data object
dml_data <- DoubleMLData$new(
  data = as.data.frame(data),
  y_col = "outcome",      # Y
  d_cols = "treatment",    # D (can be vector for multiple treatments)
  x_cols = c("age", "income", "education",
                   "region", "employment",
                   paste0("covar_", 1:20))  # high-dimensional X
)

dml_data
ESTIMATION

Two model classes cover most empirical settings. PLR assumes a partially linear structure — the treatment effect θ enters linearly but controls enter flexibly. IRM makes no linearity assumption for binary treatment and estimates ATE or ATT via the doubly robust score.

Partially linear regression (PLR) — continuous or multi-valued T
Y = θ · T+ g(X) + ε     T = m(X) + v
Interactive regression model (IRM) — binary T
Y = g(T, X)+ ε     T ~ m(X)    (no linearity in T)

PARTIALLY LINEAR REGRESSION (PLR)

02_plr.R
library(DoubleML)
library(mlr3)
library(mlr3learners)

# Partially linear regression (PLR)
# Y = theta*D + g(X) + epsilon
# D = m(X) + v
# Both nuisance functions estimated via ML

obj_plr <- DoubleMLPLR$new(
  data = dml_data,
  ml_l = lrn("regr.ranger",   # outcome model: E[Y|X]
                    num.trees = 200,
                    max.depth = 5),
  ml_m = lrn("regr.ranger",   # treatment model: E[D|X]
                    num.trees = 200,
                    max.depth = 5),
  n_folds = 5,                   # 5-fold cross-fitting
  score = "partialling out"    # Frisch-Waugh partialling
)

obj_plr$fit()
obj_plr$summary()

# Aggregate over folds
obj_plr$coef           # theta estimate
obj_plr$se             # standard error
obj_plr$confint()      # 95% CI

INTERACTIVE REGRESSION MODEL (IRM) — BINARY TREATMENT

03_irm.R
library(DoubleML)
library(mlr3)
library(mlr3learners)

# Interactive regression model (IRM)
# For binary treatment — allows heterogeneous effects
# Y = g(D, X) + epsilon  (no linearity in D assumed)
# D ~ m(X) (binary treatment model via classification)

obj_irm <- DoubleMLIRM$new(
  data = dml_data,
  ml_g = lrn("regr.ranger"),        # outcome model
  ml_r = lrn("classif.ranger",      # treatment model (binary)
                    predict_type = "prob"),
  n_folds = 5,
  score = "ATE"                      # or "ATT"
)

obj_irm$fit()
obj_irm$summary()
DIAGNOSTICS & TUNING

Nuisance model RMSE

Check out-of-sample RMSE for both ml_l (outcome model) and ml_m (treatment model) across folds. High RMSE inflates the variance of θ. Try multiple learners and compare.

Sensitivity to learner choice

Re-estimate θ using different nuisance learners — lasso, random forest, gradient boosting. Substantial variation in θ across learners suggests the estimate is sensitive to nuisance model specification.

Number of folds

More folds reduce the efficiency loss from sample splitting. 5-fold is standard; 10-fold is better for small samples but slower. For very small N (<200), consider repeated cross-fitting with multiple random splits.

Aggregation across folds

DoubleML aggregates θ estimates across folds via a median (robust) or mean. Check that individual fold estimates are not wildly variable — high variance across folds suggests an unstable nuisance fit in some folds.

04_tuning.R
library(DoubleML)
library(mlr3)
library(mlr3tuning)
library(mlr3learners)

# Tune hyperparameters for the nuisance functions
# Using random search within cross-fitted folds

ml_l_tuned <- lrn("regr.ranger",
  num.trees = to_tune(100, 500),
  max.depth = to_tune(3, 8),
  mtry.ratio = to_tune(0.1, 0.9)
)

# Wrap in auto-tuner (inner CV within each fold)
at_l <- auto_tuner(
  tuner = tnr("random_search", batch_size = 20),
  learner = ml_l_tuned,
  resampling = rsmp("cv", folds = 3),
  measure = msr("regr.mse"),
  term_evals = 20
)

obj_tuned <- DoubleMLPLR$new(
  data = dml_data,
  ml_l = at_l,
  ml_m = at_l$clone(),
  n_folds = 5
)
obj_tuned$fit()
obj_tuned$summary()
OUTPUT INTERPRETATION

My DML estimate differs from OLS — which should I trust?

DML if X is high-dimensional and the conditional ignorability assumption holds. OLS is biased when regularization is used to handle many controls. The DML estimate is consistent under weaker conditions on the nuisance functions. If they agree, that is reassuring — if they disagree, investigate whether OLS is regularization-biased or whether the DML nuisance models are misfitting.

The standard errors are large — is that a problem with cross-fitting?

Large SEs from DML often reflect genuinely weak signal after partialling out controls — not a cross-fitting artifact. Cross-fitting can slightly increase variance relative to a full-sample nuisance fit, but the bias reduction it provides justifies this. If SEs are very large, check nuisance RMSE — poor nuisance fit directly inflates variance.

Can I use DML to estimate heterogeneous treatment effects?

Partially. PLR and IRM estimate an average θ, not unit-level effects. To estimate heterogeneous effects, use the DML framework combined with causal forests (the R-learner) or the DoubleMLIIVM class for interactive IV models. See Chapter 08 (Causal forest) for the full HTE workflow.

How do I report DML results alongside traditional estimates?

Report the DML estimate, SE, and 95% CI from the summary() output. Alongside it, report the nuisance RMSE for both models, the number of folds, and the learner type. If results are robust to learner choice, note that. If you also ran OLS, present both — the comparison is informative.