Causal Methods
CHAPTER 08
CF

Causal forest

Extends random forests to estimate heterogeneous treatment effects at the unit level. Uses honest sample-splitting and local centering to produce valid confidence intervals for the conditional average treatment effect (CATE) across the covariate space.

IDENTIFICATION SETUP
01What it estimates

The CATE: τ(x) = E[Y(1) − Y(0) | X = x] — the expected treatment effect for a unit with covariates x. Unlike DML which estimates a single average θ, causal forest estimates a separate effect for every unit.

02How it works

Trees are built to maximize heterogeneity in treatment effects across leaves, not to maximize prediction accuracy. Each leaf contains units that are similar in covariates but whose treatment effects may differ from other leaves.

03Honest splitting

Tree structure (which splits to make) is determined on one half of the data. Effect estimates within each leaf are computed on the other half. This prevents overfitting the effect surface to noise in the training data.

DISTRIBUTION OF ESTIMATED CATES (τ̂)

ATE-0.200.20.40.60.8estimated CATE (τ̂)low respondershigh responders
CATE τ(x)Unit-level treatment effect — varies across covariates
ATEAverage of CATEs — the single number DML gives you
BimodalityTwo subgroups respond differently — policy-relevant
τ̂ < 0Some units may be harmed — important for targeting
ASSUMPTIONS
Conditional ignorabilityrequired

Given X, treatment is independent of potential outcomes. Same as DML and matching — causal forest does not handle unobserved confounding. All variables jointly predicting treatment and outcome must be in X.

HOW TO TEST

Theoretical argument. Use domain knowledge to assess whether any important confounders are missing from X.

Overlaprequired

Every unit must have a nonzero probability of receiving either treatment. Causal forests are local estimators — leaves with units that all have the same treatment have no variation to learn from. Extreme propensity scores near 0 or 1 are particularly problematic.

HOW TO TEST

Inspect the propensity score distribution. Trim units outside the region of common support before fitting. grf reports overlap-weighted estimates that downweight low-overlap regions.

Honestyrequired

Honest splitting is required for valid confidence intervals. With honesty = FALSE, the forest will still estimate CATEs but the variance estimates are invalid. Always use honesty = TRUE for inference.

HOW TO TEST

Verify honesty = TRUE in the fit call. The effective sample size per leaf is halved by honesty — increase num.trees to compensate.

Sufficient sample size per leafrecommended

Each leaf needs enough observations to produce stable local effect estimates. The min.node.size parameter controls this. Too-small leaves overfit; too-large leaves underfit heterogeneity. Auto-tuning selects this via cross-validation.

HOW TO TEST

Check the distribution of leaf sizes via the fitted forest. If most leaves have very few observations, increase min.node.size or use tune.parameters = 'all'.

DATA REQUIREMENTS

Binary treatment

Causal forests are designed for binary treatment W ∈ {0, 1}. For continuous treatment, use instrumental forests or the R-learner framework. Multi-arm treatments require separate pairwise forests or the multi-arm causal forest.

Sample size

Reliable CATE estimates require substantially more data than ATE estimates. A minimum of ~1,000 observations is a practical floor; heterogeneity is difficult to detect below ~500. Honesty halves the effective sample per leaf.

Covariate matrix

X should include all potential confounders plus any variables believed to moderate treatment effects. grf handles many covariates well — variable importance output reveals which ones drive heterogeneity.

01_data_prep.R
library(tidyverse)
library(grf)

data <- read_csv("observational_data.csv")

# Causal forest requires matrices, not data frames
Y <- data$outcome                        # outcome vector
W <- data$treatment                      # binary treatment vector
X <- data |>
  select(age, income, education,
         region, employment,
         starts_with("covar_")) |>
  as.matrix()                            # covariate matrix

# grf handles missing data poorly — impute or remove first
stopifnot(!anyNA(X), !anyNA(Y), !anyNA(W))

# Check treatment balance
cat("Treatment rate:", mean(W), "\n")
cat("N treated:", sum(W), " | N control:", sum(1-W), "\n")
ESTIMATION

The causal forest fits a forest of trees, each of which partitions the covariate space to maximize treatment effect heterogeneity within leaves. Unit-level CATE estimates are computed as a weighted local average of outcomes, where weights are determined by how often each unit shares a leaf with the target unit across trees.

02_causal_forest.R
library(grf)

# Fit causal forest
# num.trees: more trees = lower variance, diminishing returns after ~2000
# honesty: sample-splits so effect estimates and tree structure use different obs
cf <- causal_forest(
  X = X,
  Y = Y,
  W = W,
  num.trees = 2000,
  honesty = TRUE,          # honest splitting (default)
  tune.parameters = "all"      # auto-tune min.node.size, mtry, etc.
)

# Overall ATE estimate
ate <- average_treatment_effect(cf, target.sample = "all")
cat("ATE:", round(ate["estimate"], 4),
    "| SE:", round(ate["std.err"], 4), "\n")

# ATT estimate
att <- average_treatment_effect(cf, target.sample = "treated")
cat("ATT:", round(att["estimate"], 4), "\n")

# Unit-level CATE predictions
tau_hat <- predict(cf)$predictions
HETEROGENEITY ANALYSIS

The primary output of a causal forest is the unit-level CATE vector. The next step is to characterize where and why treatment effects vary — which covariates drive heterogeneity, and whether predicted heterogeneity reflects genuine variation or noise.

Variable importance

Reports how frequently each covariate is used in tree splits. High importance means the covariate drives effect heterogeneity. Not a formal test — use best linear projection for inference on which variables predict CATEs.

Best linear projection

Regresses the CATE on a set of covariates of interest using the forest's internal weights. Provides valid standard errors for which variables linearly predict heterogeneity — the formal complement to variable importance.

Calibration test

test_calibration() in grf regresses actual outcomes on predicted CATEs. A coefficient near 1 on the mean prediction indicates the forest is well-calibrated. The differential prediction coefficient tests whether HTE is statistically significant.

RATE (Rank-Weighted ATE)

Measures the benefit of targeting treatment to the units with the highest predicted CATEs. If targeting the top quartile of predicted responders improves outcomes substantially, the heterogeneity is actionable.

VARIABLE IMPORTANCE

03_variable_importance.R
library(grf)
library(tidyverse)

# Variable importance — which covariates drive heterogeneity?
# Based on how often each variable is used in tree splits
vim <- variable_importance(cf)
var_names <- colnames(X)

tibble(variable = var_names, importance = vim) |>
  arrange(desc(importance)) |>
  mutate(variable = fct_reorder(variable, importance)) |>
  ggplot(aes(x = importance, y = variable)) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  labs(title = "Causal forest: variable importance",
       x = "Importance", y = NULL)

# Best linear projection — which variables linearly predict the CATE?
blp <- best_linear_projection(cf, A = X[, c("age", "income")])
print(blp)

SUBGROUP & CALIBRATION ANALYSIS

04_subgroup_analysis.R
library(grf)

# Subgroup ATE estimates using CATE predictions
tau_hat <- predict(cf)$predictions

# Split into quartiles by predicted CATE
data$tau_hat  <- tau_hat
data$quartile <- ntile(tau_hat, 4)

# ATE within each quartile
quartile_ates <- data |>
  group_by(quartile) |>
  summarise(
    mean_cate = mean(tau_hat),
    n = n()
  )
print(quartile_ates)

# Calibration test: do predicted CATEs predict actual heterogeneity?
# regresses actual outcomes on predicted CATEs
test_calibration(cf)
# Coefficient on mean.forest.prediction should ≈ 1 (well-calibrated)
# Coefficient on differential.forest.prediction tests HTE significance
OUTPUT INTERPRETATION

The CATE distribution is wide — does that mean there is real heterogeneity?

Not necessarily. Causal forests produce a distribution of estimated CATEs, but some spread is mechanical noise — even with no true heterogeneity, different units receive different local estimates. Use test_calibration() and the RATE statistic to assess whether the predicted heterogeneity explains actual variation in outcomes. A significant differential.forest.prediction coefficient from test_calibration is the most direct evidence.

Some units have negative CATE estimates — should I exclude them?

No — negative CATEs are informative. They indicate units for whom the treatment is estimated to be harmful. If the confidence interval for a negative CATE excludes zero, that is statistically meaningful. Excluding these units would bias the ATE upward and misrepresent the treatment effect distribution.

My ATE from the causal forest differs from the ATE from DML — which is right?

Both are consistent estimators under the same assumptions. Differences arise from their approach to local vs global averaging. Causal forests average CATEs computed locally; DML estimates a global linear coefficient. If treatment effects are genuinely heterogeneous, these can differ. Report both and explain the difference — the comparison is informative.

How do I use CATE estimates for policy targeting?

Sort units by predicted CATE (highest to lowest) and compute the policy-relevant treatment rule: treat only units with predicted CATE above a threshold (e.g. zero, or the cost of treatment). Use the RATE function in grf to estimate the value of this targeting rule and compare it to treating all or none. Report uncertainty in CATE rankings — unit-level CIs are wide; group-level targeting is more reliable.