CHAPTER 09

AIPW / DR learner

Combines an outcome regression model and a propensity score model into a doubly robust estimator — consistent if either nuisance model is correctly specified. Achieves semiparametric efficiency when both models converge at fast enough rates.

IDENTIFICATION SETUP

01 — Double robustness

The AIPW estimator is consistent if either the outcome model Q̂(W, X) or the propensity score model ê(X) is correctly specified — not both. This is the key advantage over IPW (which needs only ê) or outcome regression (which needs only Q̂).

02 — The DR score

For each unit, the doubly robust pseudo-outcome Ψᵢ augments the outcome regression prediction with an inverse-probability-weighted residual. The ATE is simply the mean of Ψ across units. The DR learner extends this to estimate CATEs.

03 — Semiparametric efficiency

When both nuisance models converge at rate n^{−1/4}, the AIPW estimator achieves the semiparametric efficiency bound — the smallest asymptotic variance possible for any estimator of the ATE under conditional ignorability.

AIPW SCORE CONSTRUCTION

Q̂(W,X)Outcome regression: predicted Y given treatment and covariates

ê(X)Propensity score: predicted probability of treatment

ΨDR pseudo-outcome: augments Q̂ with IPW residual correction

ATEMean of Ψ — consistent if Q̂ or ê is correctly specified

Doubly robust pseudo-outcome (per unit i)

Ψᵢ = Q̂(1, Xᵢ) − Q̂(0, Xᵢ) + Wᵢ(Yᵢ − Q̂(1,Xᵢ)) / ê(Xᵢ) − (1−Wᵢ)(Yᵢ − Q̂(0,Xᵢ)) / (1−ê(Xᵢ))

ATE = mean(Ψ) · DR learner CATE = regress Ψ on X

ASSUMPTIONS

Conditional ignorabilityrequired

Given X, treatment is independent of potential outcomes. Double robustness relaxes the requirement that one model is correctly specified, but it does not relax this fundamental identification assumption.

HOW TO TEST

Theoretical argument. The same unobserved confounding concern as IPW or matching applies here.

Overlaprequired

Every unit must have a nonzero probability of treatment. The IPW correction term in the DR score divides by ê(X) and 1−ê(X) — values near 0 or 1 produce extreme weights that inflate variance and bias.

HOW TO TEST

Inspect cross-fitted propensity score distribution. Trim units with PS < 0.05 or > 0.95. Report sensitivity to the trimming threshold.

At least one nuisance model correctrequired

Double robustness guarantees consistency if either Q̂ or ê is correctly specified. If both are misspecified, AIPW is biased. In practice, using flexible ML for both models substantially reduces the risk of misspecification.

HOW TO TEST

Check nuisance model fit via cross-validated RMSE (Q̂) and AUC (ê). Use multiple learner types and inspect sensitivity of the ATE to learner choice.

Nuisance rate conditionsrecommended

For semiparametric efficiency, both nuisance models must converge at rate n^{−1/4}. This is satisfied by lasso and random forests under regularity conditions. Cross-fitting ensures this rate condition applies to the final estimator.

HOW TO TEST

Use flexible ML learners. Compare ATE across learner types. Large differences suggest one or both nuisance models are not converging fast enough.

DATA REQUIREMENTS

Binary treatment

AIPW in its standard form handles binary treatment W ∈ {0, 1}. Extensions exist for multi-valued and continuous treatments via generalized propensity scores, but the implementation is substantially more complex.

Sufficient overlap

AIPW is more sensitive to overlap violations than IPW because the augmentation term amplifies extreme weights. Trim or clip propensity scores before estimation and always report the trimming threshold.

Sample size

Cross-fitting requires enough observations in each fold for stable nuisance estimates. A minimum of ~500 observations is practical. With very small samples, consider a single cross-fit split or targeted learning (TMLE).

01_data_prep.R

library(tidyverse)

data <- read_csv("observational_data.csv")

# AIPW convention: Y = outcome, A = treatment, W = covariates
Y <- data$outcome
A <- data$treatment
W <- data |>
  select(age, income, education, region, employment,
         starts_with("covar_")) |>
  as.data.frame()

# Check overlap — AIPW is sensitive to extreme propensity scores
ps_model <- glm(A ~ ., data = W, family = binomial)
ps <- fitted(ps_model)

# Flag extreme scores
cat("PS < 0.05:", sum(ps < 0.05), "\n")
cat("PS > 0.95:", sum(ps > 0.95), "\n")

# Trim to common support if needed
keep <- ps > 0.05 & ps < 0.95
cat("Units retained after trimming:", sum(keep), "/", length(keep), "\n")

ESTIMATION

Two implementation paths: AIPW for ATE estimation, and the DR learner for CATE estimation. Both use the same doubly robust score — the DR learner extends it by regressing the pseudo-outcome Ψ on covariates X to estimate heterogeneous effects.

AIPW — ATE ESTIMATION

02_aipw.R

library(AIPW)
library(SuperLearner)

# AIPW with SuperLearner for nuisance functions
# SuperLearner ensembles multiple learners via cross-validation

sl_libs <- c("SL.ranger", "SL.glmnet", "SL.mean")

aipw_obj <- AIPW$new(
  Y = Y,
  A = A,               # binary treatment
  W = W,               # covariate data frame
  Q.SL.library = sl_libs,       # outcome model E[Y|A,W]
  g.SL.library = sl_libs,       # propensity model E[A|W]
  k_split = 5,               # 5-fold cross-fitting
  verbose = TRUE
)

aipw_obj$
  stratified_fit()$             # fit nuisance models
  summary()                     # ATE and RR estimates

# ATE with 95% CI
aipw_obj$result

DR LEARNER — ATE AND CATE ESTIMATION

03_dr_learner.R

library(DoubleML)
library(mlr3)
library(mlr3learners)

# DR learner via DoubleML's IRM (Interactive Regression Model)
# IRM uses the doubly robust (AIPW) score by default for ATE

dml_data <- DoubleMLData$new(
  data = cbind(W, outcome = Y, treatment = A),
  y_col = "outcome",
  d_cols = "treatment",
  x_cols = colnames(W)
)

irm <- DoubleMLIRM$new(
  data = dml_data,
  ml_g = lrn("regr.ranger", num.trees = 200),      # E[Y|D,X]
  ml_r = lrn("classif.ranger", predict_type="prob"), # E[D|X]
  n_folds = 5,
  score = "ATE",      # uses doubly robust / AIPW score
  normalize_ipw = TRUE  # stabilized weights
)

irm$fit()
irm$summary()
irm$confint()

DIAGNOSTICS

Nuisance model fit

Cross-validated RMSE for the outcome model and AUC for the propensity model. Poor fit in either nuisance model degrades the DR estimator — if both are misspecified, double robustness is no protection.

Propensity score overlap

Overlay histograms of cross-fitted PS for treated and control. Substantial non-overlap or extreme scores near 0 or 1 indicate trimming is needed. The IPW correction amplifies extreme PS values.

EIF mean check

The efficient influence function (EIF) should have mean zero if the nuisance models are well-specified. A nonzero mean indicates systematic bias — typically from a misspecified outcome or propensity model.

Sensitivity to trimming

Re-estimate the ATE at different propensity trimming thresholds (0.01, 0.05, 0.10). Stable ATE across thresholds supports robustness; instability indicates the estimate depends heavily on extreme-PS units.

04_diagnostics.R

library(AIPW)
library(tidyverse)

# 1. Nuisance model fit — check Q and g model performance
# For AIPW objects: access cross-fitted predictions
summary(aipw_obj)   # reports nuisance model performance by fold

# 2. Overlap / propensity score distribution
ps <- aipw_obj$obs.est$g.pred  # cross-fitted PS estimates
hist(ps, breaks = 40,
     main = "Cross-fitted propensity scores",
     xlab = "P(A=1|W)")
abline(v = c(0.05, 0.95), lty = 2, col = "red")

# 3. Efficient influence function (EIF) diagnostics
# The EIF should have mean zero and be uncorrelated with X
eif <- aipw_obj$obs.est$aipw.eif1
cat("EIF mean:", round(mean(eif), 5), "\n")
# If far from zero, nuisance models are misfitting

# 4. Sensitivity to propensity trimming
for (trim in c(0.01, 0.05, 0.10)) {
  keep <- ps > trim & ps < 1 - trim
  cat("Trim:", trim, "| N:", sum(keep),
      "| ATE:", round(mean(eif[keep]), 4), "\n")
}

OUTPUT INTERPRETATION

My AIPW ATE differs from IPW — which should I trust?

AIPW is generally preferred. If both nuisance models are well-specified, AIPW is more efficient than IPW. If the propensity model is misspecified, the outcome model augmentation in AIPW provides a correction that IPW lacks. The AIPW estimate is consistent under a weaker requirement — you need either model right, not both.

What does it mean for one nuisance model to be 'correctly specified'?

In a nonparametric setting, a model is correctly specified if it converges to the true conditional expectation as sample size grows. With flexible ML learners — random forests, lasso — this is plausible under smoothness or sparsity conditions. It does not require the model to be exactly right in small samples, only that it gets better with more data.

My DR learner CATEs and my causal forest CATEs are different — why?

They solve the same problem but differently. The DR learner regresses the DR pseudo-outcome Ψ on X using a final model — if that model is linear (LinearDRLearner), it estimates the best linear projection of the CATE onto X. The causal forest finds nonparametric local estimates. Agreement is a good sign; divergence suggests the effect surface is nonlinear in ways the final model of the DR learner doesn't capture.

Should I use AIPW, DML, or causal forest for my analysis?

Use AIPW when you want a doubly robust ATE estimate and interpretability of which model is doing the work. Use DML (PLR) when treatment is continuous or you want the Neyman-orthogonal framework explicitly. Use causal forest when HTE is the primary interest and you want nonparametric CATE estimates with honest confidence intervals. For a rigorous analysis, estimate all three and compare — agreement across methods strengthens credibility.

← Chapter 08: Causal forest Next: Heterogeneous treatment effects →