CHAPTER 10

HTE

Heterogeneous treatment effects

A family of methods for estimating how treatment effects vary across units and covariates. Covers the meta-learner framework — S, T, X, R, and DR learners — and the validation tools needed to distinguish genuine heterogeneity from noise.

IDENTIFICATION SETUP

01 — What HTE analysis asks

Instead of asking 'what is the average effect?' HTE analysis asks 'for whom does the effect differ?' The CATE τ(x) = E[Y(1) − Y(0) | X = x] is the object of interest — a function over the covariate space rather than a scalar.

02 — The meta-learner idea

Meta-learners are modular frameworks that combine off-the-shelf ML models to estimate CATEs. They differ in how they use the outcome model, propensity model, and pseudo-outcomes — each making different bias-variance tradeoffs.

03 — Validation is essential

CATEs are unidentified at the unit level — we never observe both Y(1) and Y(0) for the same unit. All CATE estimates carry noise. Validation tools (calibration, RATE, GATES) assess whether predicted heterogeneity reflects real variation.

META-LEARNER COMPARISON

S-learner

Single model for both treated and control. Simple but prone to regularizing away small effects.

Use when: Large N, small expected heterogeneity

T-learner

Separate models for treated and control, then subtracts predictions. Unbiased but high variance.

Use when: Very different treated / control distributions

X-learner

Imputes individual effects using the other group's model, weighted by propensity score.

Use when: Imbalanced treated / control group sizes

R-learner

Residualizes Y and W on X, regresses residuals. Neyman-orthogonal — low bias from nuisance.

Use when: High-dimensional X, flexible ML preferred

DR learner

Uses doubly robust pseudo-outcome. Consistent if outcome model or propensity model is correct.

Use when: Robustness to nuisance misspecification

Causal forest

Nonparametric local CATE with honest CIs. Best for exploratory heterogeneity analysis.

Use when: Unknown effect surface, need valid CIs

ASSUMPTIONS

Conditional ignorabilityrequired

Given X, treatment is independent of potential outcomes. All meta-learners inherit this assumption — HTE analysis does not relax the core identification requirement, it refines the estimand.

HOW TO TEST

Same as for ATE estimation. The same argument for unobserved confounding applies to CATE estimation.

Overlaprequired

All units must have a nonzero probability of treatment. In subgroups with very low or high treatment rates, CATE estimates are particularly unreliable. Inspect overlap within the subgroups of interest, not just overall.

HOW TO TEST

Check propensity scores within predicted high-CATE and low-CATE subgroups separately. Trim or flag low-overlap regions before reporting subgroup effects.

No overfitting of the effect surfacerequired

Meta-learners can overfit heterogeneity — finding differences that reflect noise rather than true variation. Honest estimation (causal forest), cross-fitting (DR/R learner), and out-of-sample validation (RATE, calibration) are the primary protections.

HOW TO TEST

Use the calibration test and RATE on held-out data. If RATE is not significantly positive, the predicted CATE ranking is not informative for targeting.

Correctly specified final modelrecommended

In S, T, X, and R learners, the final CATE model must be flexible enough to capture the true effect surface. If the final model is linear but the true CATE is nonlinear, the estimates will miss important heterogeneity.

HOW TO TEST

Try multiple final models (linear, random forest, BART). Compare CATE distributions across specifications. Substantial differences indicate sensitivity to the final model.

DATA REQUIREMENTS

Large sample

Estimating heterogeneity is harder than estimating the ATE. A general rule: you need roughly 4× the sample size to detect the same effect at the subgroup level as at the population level. Below ~1,000 observations, HTE estimates are unreliable.

Moderating covariates

Include variables that theory suggests might moderate the treatment effect — not just confounders. An analysis that includes no plausible moderators will find no heterogeneity by construction, regardless of the method used.

Pre-registration

Post-hoc subgroup analysis is highly susceptible to multiple testing and cherry-picking. Pre-specify primary moderating variables and the estimand of interest before accessing the data. Report all subgroup analyses, not just significant ones.

01_data_prep.R

library(tidyverse)
library(grf)

data <- read_csv("observational_data.csv")

Y <- data$outcome
W <- data$treatment
X <- data |>
  select(age, income, education, region, employment,
         starts_with("covar_")) |>
  as.matrix()

# HTE analysis requires thinking carefully about:
# (1) Which covariates might moderate the treatment effect?
# (2) Is heterogeneity expected on theoretical grounds?
# (3) What is the policy-relevant subgroup?

# Pre-analysis: inspect treatment rate across strata
data |>
  mutate(age_group = cut(age, breaks = c(0, 30, 50, 100),
                         labels = c("young", "middle", "old"))) |>
  group_by(age_group) |>
  summarise(
    n = n(),
    treat_rate = mean(treatment),
    mean_Y1 = mean(outcome[treatment == 1]),
    mean_Y0 = mean(outcome[treatment == 0]),
    naive_diff = mean_Y1 - mean_Y0
  )

ESTIMATION

Two learners are highlighted here: the R-learner for its Neyman-orthogonal properties and practical flexibility, and the X-learner for imbalanced designs. See Chapter 08 for causal forest and Chapter 09 for the DR learner — both are also CATE estimators.

R-LEARNER

02_r_learner.R

library(grf)

# R-learner: Robinson (1988) decomposition for CATE
# Residualizes both Y and W on X, then regresses Y-residual on W-residual
# Equivalent to the partially linear model with a heterogeneous coefficient

# Step 1: estimate nuisance functions
m_hat <- regression_forest(X, Y, num.trees = 500)$predictions   # E[Y|X]
e_hat <- regression_forest(X, W, num.trees = 500)$predictions   # E[W|X]

# Step 2: residualize
Y_tilde <- Y - m_hat
W_tilde <- W - e_hat

# Step 3: fit CATE model on pseudo-outcome Y_tilde / W_tilde
# Using a regression forest on the transformed outcome
tau_forest <- regression_forest(
  X = X,
  Y = Y_tilde / W_tilde,     # R-learner pseudo-outcome
  sample.weights = W_tilde^2,       # weight by W_tilde^2
  num.trees = 2000
)
tau_hat_r <- predict(tau_forest)$predictions

# Or: use causal_forest directly (implements a variant of R-learner)
cf <- causal_forest(X, Y, W, num.trees = 2000, tune.parameters = "all")
tau_hat_cf <- predict(cf)$predictions

X-LEARNER

03_x_learner.R

library(grf)

# X-learner: effective when treated and control groups are very different in size
# Imputes individual treatment effects using the opposite group's outcome model

# Step 1: fit outcome models separately for treated and control
mu1 <- regression_forest(X[W==1,], Y[W==1], num.trees=500)
mu0 <- regression_forest(X[W==0,], Y[W==0], num.trees=500)

# Step 2: impute counterfactual outcomes
D1 <- Y[W==1] - predict(mu0, newdata=X[W==1,])$predictions  # treated: Y1 - mu0(X)
D0 <- predict(mu1, newdata=X[W==0,])$predictions - Y[W==0]  # control: mu1(X) - Y0

# Step 3: fit CATE models on each imputed effect
tau1 <- regression_forest(X[W==1,], D1, num.trees=1000)
tau0 <- regression_forest(X[W==0,], D0, num.trees=1000)

# Step 4: combine using propensity score weights
e_hat <- regression_forest(X, W, num.trees=500)$predictions
tau_hat_x <- e_hat * predict(tau0, X)$predictions +
             (1 - e_hat) * predict(tau1, X)$predictions

VALIDATION & INFERENCE

Validating CATE estimates is as important as producing them. The fundamental problem is that individual-level treatment effects are unobserved — validation must proceed indirectly through group-level comparisons and calibration checks.

Calibration test

Regresses actual outcomes on predicted CATEs using the forest's internal weights. The mean prediction coefficient near 1 confirms well-calibrated ATE estimates. A significant differential prediction coefficient indicates that heterogeneous effects are statistically detectable beyond noise.

RATE (Rank-Weighted ATE)

Estimates the gain from targeting treatment to units with the highest predicted CATEs. A significantly positive RATE means the CATE ranking is policy-useful — treating top predicted responders outperforms random assignment.

GATES (Grouped ATE)

Divide units into bins by predicted CATE (quintiles or deciles) and estimate the actual ATE within each bin. A monotone pattern — low CATE bins have low actual effects, high CATE bins have high actual effects — validates the CATE ordering.

Best linear projection

Projects the CATE onto a small set of covariates using the forest's doubly robust weights. Provides valid standard errors for which variables linearly predict effect heterogeneity. More interpretable than raw CATE estimates.

04_hte_validation.R

library(grf)

# 1. Calibration test (causal forest)
test_calibration(cf)
# mean.forest.prediction coef ≈ 1 → well-calibrated ATE
# differential.forest.prediction coef > 0, significant → HTE exists

# 2. RATE (Rank-Weighted Average Treatment Effect)
# Measures value of targeting treatment to high-CATE units
rate <- rank_average_treatment_effect(cf, tau_hat_cf)
print(rate)

# 3. Best linear projection — which variables predict CATEs?
blp <- best_linear_projection(cf, A = X[, c("age", "income", "education")])
print(blp)

# 4. Sorted group average treatment effects (GATES)
# Divide into quintiles by predicted CATE; estimate ATT within each
data$quintile <- ntile(tau_hat_cf, 5)
for (q in 1:5) {
  idx <- data$quintile == q
  sub_cf <- causal_forest(X[idx,], Y[idx], W[idx], num.trees=500)
  ate_q  <- average_treatment_effect(sub_cf)
  cat("Quintile", q, ": ATE =", round(ate_q["estimate"], 3), "\n")
}

OUTPUT INTERPRETATION

The R-learner and causal forest give different CATE distributions — which is right?

Neither is definitively right — they make different approximations. The R-learner uses a final model that can be linear or nonlinear; the causal forest is nonparametric and local. Agreement across methods strengthens confidence; disagreement is informative about the shape of the effect surface. Run calibration checks for both and compare RATE statistics. The better-calibrated estimator should be given more weight.

How large a sample do I need to detect heterogeneity?

A rough heuristic: detecting heterogeneity at the subgroup level requires approximately 16 times the sample needed to detect the ATE at the same power level. For an effect size that requires N=200 to detect the ATE, you need roughly N=3,200 to reliably characterize subgroup variation. With fewer observations, report the ATE with a pre-specified subgroup analysis rather than exploratory CATE estimation.

My RATE is positive but not significant — should I report subgroup effects?

With caution. A non-significant RATE means the predicted CATE ranking is not statistically distinguishable from random targeting. Reporting specific subgroup effects in this context risks misleading readers. Instead, report the RATE directly, acknowledge the uncertainty, and present subgroup effects as exploratory and hypothesis-generating for future work.

A specific subgroup has a large estimated CATE — is that actionable?

Check the confidence interval first. Individual CATE estimates have wide CIs — group-level estimates (GATES) are more reliable. If the GATES for a specific quintile significantly exceeds the ATE and the overlap in that group is good, the effect is more credible. Consider cost-effectiveness: treating a subgroup is only worthwhile if the expected CATE exceeds the cost of treatment per unit.

← Chapter 09: AIPW / DR learner Reference: Estimator decision tree →