Heterogeneous treatment effects
A family of methods for estimating how treatment effects vary across units and covariates. Covers the meta-learner framework — S, T, X, R, and DR learners — and the validation tools needed to distinguish genuine heterogeneity from noise.
Instead of asking 'what is the average effect?' HTE analysis asks 'for whom does the effect differ?' The CATE τ(x) = E[Y(1) − Y(0) | X = x] is the object of interest — a function over the covariate space rather than a scalar.
Meta-learners are modular frameworks that combine off-the-shelf ML models to estimate CATEs. They differ in how they use the outcome model, propensity model, and pseudo-outcomes — each making different bias-variance tradeoffs.
CATEs are unidentified at the unit level — we never observe both Y(1) and Y(0) for the same unit. All CATE estimates carry noise. Validation tools (calibration, RATE, GATES) assess whether predicted heterogeneity reflects real variation.
META-LEARNER COMPARISON
Single model for both treated and control. Simple but prone to regularizing away small effects.
Use when: Large N, small expected heterogeneity
Separate models for treated and control, then subtracts predictions. Unbiased but high variance.
Use when: Very different treated / control distributions
Imputes individual effects using the other group's model, weighted by propensity score.
Use when: Imbalanced treated / control group sizes
Residualizes Y and W on X, regresses residuals. Neyman-orthogonal — low bias from nuisance.
Use when: High-dimensional X, flexible ML preferred
Uses doubly robust pseudo-outcome. Consistent if outcome model or propensity model is correct.
Use when: Robustness to nuisance misspecification
Nonparametric local CATE with honest CIs. Best for exploratory heterogeneity analysis.
Use when: Unknown effect surface, need valid CIs
Given X, treatment is independent of potential outcomes. All meta-learners inherit this assumption — HTE analysis does not relax the core identification requirement, it refines the estimand.
HOW TO TEST
Same as for ATE estimation. The same argument for unobserved confounding applies to CATE estimation.
All units must have a nonzero probability of treatment. In subgroups with very low or high treatment rates, CATE estimates are particularly unreliable. Inspect overlap within the subgroups of interest, not just overall.
HOW TO TEST
Check propensity scores within predicted high-CATE and low-CATE subgroups separately. Trim or flag low-overlap regions before reporting subgroup effects.
Meta-learners can overfit heterogeneity — finding differences that reflect noise rather than true variation. Honest estimation (causal forest), cross-fitting (DR/R learner), and out-of-sample validation (RATE, calibration) are the primary protections.
HOW TO TEST
Use the calibration test and RATE on held-out data. If RATE is not significantly positive, the predicted CATE ranking is not informative for targeting.
In S, T, X, and R learners, the final CATE model must be flexible enough to capture the true effect surface. If the final model is linear but the true CATE is nonlinear, the estimates will miss important heterogeneity.
HOW TO TEST
Try multiple final models (linear, random forest, BART). Compare CATE distributions across specifications. Substantial differences indicate sensitivity to the final model.
Large sample
Estimating heterogeneity is harder than estimating the ATE. A general rule: you need roughly 4× the sample size to detect the same effect at the subgroup level as at the population level. Below ~1,000 observations, HTE estimates are unreliable.
Moderating covariates
Include variables that theory suggests might moderate the treatment effect — not just confounders. An analysis that includes no plausible moderators will find no heterogeneity by construction, regardless of the method used.
Pre-registration
Post-hoc subgroup analysis is highly susceptible to multiple testing and cherry-picking. Pre-specify primary moderating variables and the estimand of interest before accessing the data. Report all subgroup analyses, not just significant ones.
library(tidyverse)
library(grf)
data <- read_csv("observational_data.csv")
Y <- data$outcome
W <- data$treatment
X <- data |>
select(age, income, education, region, employment,
starts_with("covar_")) |>
as.matrix()
# HTE analysis requires thinking carefully about:
# (1) Which covariates might moderate the treatment effect?
# (2) Is heterogeneity expected on theoretical grounds?
# (3) What is the policy-relevant subgroup?
# Pre-analysis: inspect treatment rate across strata
data |>
mutate(age_group = cut(age, breaks = c(0, 30, 50, 100),
labels = c("young", "middle", "old"))) |>
group_by(age_group) |>
summarise(
n = n(),
treat_rate = mean(treatment),
mean_Y1 = mean(outcome[treatment == 1]),
mean_Y0 = mean(outcome[treatment == 0]),
naive_diff = mean_Y1 - mean_Y0
)Two learners are highlighted here: the R-learner for its Neyman-orthogonal properties and practical flexibility, and the X-learner for imbalanced designs. See Chapter 08 for causal forest and Chapter 09 for the DR learner — both are also CATE estimators.
R-LEARNER
library(grf)
# R-learner: Robinson (1988) decomposition for CATE
# Residualizes both Y and W on X, then regresses Y-residual on W-residual
# Equivalent to the partially linear model with a heterogeneous coefficient
# Step 1: estimate nuisance functions
m_hat <- regression_forest(X, Y, num.trees = 500)$predictions # E[Y|X]
e_hat <- regression_forest(X, W, num.trees = 500)$predictions # E[W|X]
# Step 2: residualize
Y_tilde <- Y - m_hat
W_tilde <- W - e_hat
# Step 3: fit CATE model on pseudo-outcome Y_tilde / W_tilde
# Using a regression forest on the transformed outcome
tau_forest <- regression_forest(
X = X,
Y = Y_tilde / W_tilde, # R-learner pseudo-outcome
sample.weights = W_tilde^2, # weight by W_tilde^2
num.trees = 2000
)
tau_hat_r <- predict(tau_forest)$predictions
# Or: use causal_forest directly (implements a variant of R-learner)
cf <- causal_forest(X, Y, W, num.trees = 2000, tune.parameters = "all")
tau_hat_cf <- predict(cf)$predictionsX-LEARNER
library(grf)
# X-learner: effective when treated and control groups are very different in size
# Imputes individual treatment effects using the opposite group's outcome model
# Step 1: fit outcome models separately for treated and control
mu1 <- regression_forest(X[W==1,], Y[W==1], num.trees=500)
mu0 <- regression_forest(X[W==0,], Y[W==0], num.trees=500)
# Step 2: impute counterfactual outcomes
D1 <- Y[W==1] - predict(mu0, newdata=X[W==1,])$predictions # treated: Y1 - mu0(X)
D0 <- predict(mu1, newdata=X[W==0,])$predictions - Y[W==0] # control: mu1(X) - Y0
# Step 3: fit CATE models on each imputed effect
tau1 <- regression_forest(X[W==1,], D1, num.trees=1000)
tau0 <- regression_forest(X[W==0,], D0, num.trees=1000)
# Step 4: combine using propensity score weights
e_hat <- regression_forest(X, W, num.trees=500)$predictions
tau_hat_x <- e_hat * predict(tau0, X)$predictions +
(1 - e_hat) * predict(tau1, X)$predictionsValidating CATE estimates is as important as producing them. The fundamental problem is that individual-level treatment effects are unobserved — validation must proceed indirectly through group-level comparisons and calibration checks.
Calibration test
Regresses actual outcomes on predicted CATEs using the forest's internal weights. The mean prediction coefficient near 1 confirms well-calibrated ATE estimates. A significant differential prediction coefficient indicates that heterogeneous effects are statistically detectable beyond noise.
RATE (Rank-Weighted ATE)
Estimates the gain from targeting treatment to units with the highest predicted CATEs. A significantly positive RATE means the CATE ranking is policy-useful — treating top predicted responders outperforms random assignment.
GATES (Grouped ATE)
Divide units into bins by predicted CATE (quintiles or deciles) and estimate the actual ATE within each bin. A monotone pattern — low CATE bins have low actual effects, high CATE bins have high actual effects — validates the CATE ordering.
Best linear projection
Projects the CATE onto a small set of covariates using the forest's doubly robust weights. Provides valid standard errors for which variables linearly predict effect heterogeneity. More interpretable than raw CATE estimates.
library(grf)
# 1. Calibration test (causal forest)
test_calibration(cf)
# mean.forest.prediction coef ≈ 1 → well-calibrated ATE
# differential.forest.prediction coef > 0, significant → HTE exists
# 2. RATE (Rank-Weighted Average Treatment Effect)
# Measures value of targeting treatment to high-CATE units
rate <- rank_average_treatment_effect(cf, tau_hat_cf)
print(rate)
# 3. Best linear projection — which variables predict CATEs?
blp <- best_linear_projection(cf, A = X[, c("age", "income", "education")])
print(blp)
# 4. Sorted group average treatment effects (GATES)
# Divide into quintiles by predicted CATE; estimate ATT within each
data$quintile <- ntile(tau_hat_cf, 5)
for (q in 1:5) {
idx <- data$quintile == q
sub_cf <- causal_forest(X[idx,], Y[idx], W[idx], num.trees=500)
ate_q <- average_treatment_effect(sub_cf)
cat("Quintile", q, ": ATE =", round(ate_q["estimate"], 3), "\n")
}The R-learner and causal forest give different CATE distributions — which is right?
Neither is definitively right — they make different approximations. The R-learner uses a final model that can be linear or nonlinear; the causal forest is nonparametric and local. Agreement across methods strengthens confidence; disagreement is informative about the shape of the effect surface. Run calibration checks for both and compare RATE statistics. The better-calibrated estimator should be given more weight.
How large a sample do I need to detect heterogeneity?
A rough heuristic: detecting heterogeneity at the subgroup level requires approximately 16 times the sample needed to detect the ATE at the same power level. For an effect size that requires N=200 to detect the ATE, you need roughly N=3,200 to reliably characterize subgroup variation. With fewer observations, report the ATE with a pre-specified subgroup analysis rather than exploratory CATE estimation.
My RATE is positive but not significant — should I report subgroup effects?
With caution. A non-significant RATE means the predicted CATE ranking is not statistically distinguishable from random targeting. Reporting specific subgroup effects in this context risks misleading readers. Instead, report the RATE directly, acknowledge the uncertainty, and present subgroup effects as exploratory and hypothesis-generating for future work.
A specific subgroup has a large estimated CATE — is that actionable?
Check the confidence interval first. Individual CATE estimates have wide CIs — group-level estimates (GATES) are more reliable. If the GATES for a specific quintile significantly exceeds the ATE and the overlap in that group is good, the effect is more credible. Consider cost-effectiveness: treating a subgroup is only worthwhile if the expected CATE exceeds the cost of treatment per unit.