Causal forest
Extends random forests to estimate heterogeneous treatment effects at the unit level. Uses honest sample-splitting and local centering to produce valid confidence intervals for the conditional average treatment effect (CATE) across the covariate space.
The CATE: τ(x) = E[Y(1) − Y(0) | X = x] — the expected treatment effect for a unit with covariates x. Unlike DML which estimates a single average θ, causal forest estimates a separate effect for every unit.
Trees are built to maximize heterogeneity in treatment effects across leaves, not to maximize prediction accuracy. Each leaf contains units that are similar in covariates but whose treatment effects may differ from other leaves.
Tree structure (which splits to make) is determined on one half of the data. Effect estimates within each leaf are computed on the other half. This prevents overfitting the effect surface to noise in the training data.
DISTRIBUTION OF ESTIMATED CATES (τ̂)
Given X, treatment is independent of potential outcomes. Same as DML and matching — causal forest does not handle unobserved confounding. All variables jointly predicting treatment and outcome must be in X.
HOW TO TEST
Theoretical argument. Use domain knowledge to assess whether any important confounders are missing from X.
Every unit must have a nonzero probability of receiving either treatment. Causal forests are local estimators — leaves with units that all have the same treatment have no variation to learn from. Extreme propensity scores near 0 or 1 are particularly problematic.
HOW TO TEST
Inspect the propensity score distribution. Trim units outside the region of common support before fitting. grf reports overlap-weighted estimates that downweight low-overlap regions.
Honest splitting is required for valid confidence intervals. With honesty = FALSE, the forest will still estimate CATEs but the variance estimates are invalid. Always use honesty = TRUE for inference.
HOW TO TEST
Verify honesty = TRUE in the fit call. The effective sample size per leaf is halved by honesty — increase num.trees to compensate.
Each leaf needs enough observations to produce stable local effect estimates. The min.node.size parameter controls this. Too-small leaves overfit; too-large leaves underfit heterogeneity. Auto-tuning selects this via cross-validation.
HOW TO TEST
Check the distribution of leaf sizes via the fitted forest. If most leaves have very few observations, increase min.node.size or use tune.parameters = 'all'.
Binary treatment
Causal forests are designed for binary treatment W ∈ {0, 1}. For continuous treatment, use instrumental forests or the R-learner framework. Multi-arm treatments require separate pairwise forests or the multi-arm causal forest.
Sample size
Reliable CATE estimates require substantially more data than ATE estimates. A minimum of ~1,000 observations is a practical floor; heterogeneity is difficult to detect below ~500. Honesty halves the effective sample per leaf.
Covariate matrix
X should include all potential confounders plus any variables believed to moderate treatment effects. grf handles many covariates well — variable importance output reveals which ones drive heterogeneity.
library(tidyverse)
library(grf)
data <- read_csv("observational_data.csv")
# Causal forest requires matrices, not data frames
Y <- data$outcome # outcome vector
W <- data$treatment # binary treatment vector
X <- data |>
select(age, income, education,
region, employment,
starts_with("covar_")) |>
as.matrix() # covariate matrix
# grf handles missing data poorly — impute or remove first
stopifnot(!anyNA(X), !anyNA(Y), !anyNA(W))
# Check treatment balance
cat("Treatment rate:", mean(W), "\n")
cat("N treated:", sum(W), " | N control:", sum(1-W), "\n")The causal forest fits a forest of trees, each of which partitions the covariate space to maximize treatment effect heterogeneity within leaves. Unit-level CATE estimates are computed as a weighted local average of outcomes, where weights are determined by how often each unit shares a leaf with the target unit across trees.
library(grf)
# Fit causal forest
# num.trees: more trees = lower variance, diminishing returns after ~2000
# honesty: sample-splits so effect estimates and tree structure use different obs
cf <- causal_forest(
X = X,
Y = Y,
W = W,
num.trees = 2000,
honesty = TRUE, # honest splitting (default)
tune.parameters = "all" # auto-tune min.node.size, mtry, etc.
)
# Overall ATE estimate
ate <- average_treatment_effect(cf, target.sample = "all")
cat("ATE:", round(ate["estimate"], 4),
"| SE:", round(ate["std.err"], 4), "\n")
# ATT estimate
att <- average_treatment_effect(cf, target.sample = "treated")
cat("ATT:", round(att["estimate"], 4), "\n")
# Unit-level CATE predictions
tau_hat <- predict(cf)$predictionsThe primary output of a causal forest is the unit-level CATE vector. The next step is to characterize where and why treatment effects vary — which covariates drive heterogeneity, and whether predicted heterogeneity reflects genuine variation or noise.
Variable importance
Reports how frequently each covariate is used in tree splits. High importance means the covariate drives effect heterogeneity. Not a formal test — use best linear projection for inference on which variables predict CATEs.
Best linear projection
Regresses the CATE on a set of covariates of interest using the forest's internal weights. Provides valid standard errors for which variables linearly predict heterogeneity — the formal complement to variable importance.
Calibration test
test_calibration() in grf regresses actual outcomes on predicted CATEs. A coefficient near 1 on the mean prediction indicates the forest is well-calibrated. The differential prediction coefficient tests whether HTE is statistically significant.
RATE (Rank-Weighted ATE)
Measures the benefit of targeting treatment to the units with the highest predicted CATEs. If targeting the top quartile of predicted responders improves outcomes substantially, the heterogeneity is actionable.
VARIABLE IMPORTANCE
library(grf)
library(tidyverse)
# Variable importance — which covariates drive heterogeneity?
# Based on how often each variable is used in tree splits
vim <- variable_importance(cf)
var_names <- colnames(X)
tibble(variable = var_names, importance = vim) |>
arrange(desc(importance)) |>
mutate(variable = fct_reorder(variable, importance)) |>
ggplot(aes(x = importance, y = variable)) +
geom_col(fill = "steelblue", alpha = 0.7) +
labs(title = "Causal forest: variable importance",
x = "Importance", y = NULL)
# Best linear projection — which variables linearly predict the CATE?
blp <- best_linear_projection(cf, A = X[, c("age", "income")])
print(blp)SUBGROUP & CALIBRATION ANALYSIS
library(grf)
# Subgroup ATE estimates using CATE predictions
tau_hat <- predict(cf)$predictions
# Split into quartiles by predicted CATE
data$tau_hat <- tau_hat
data$quartile <- ntile(tau_hat, 4)
# ATE within each quartile
quartile_ates <- data |>
group_by(quartile) |>
summarise(
mean_cate = mean(tau_hat),
n = n()
)
print(quartile_ates)
# Calibration test: do predicted CATEs predict actual heterogeneity?
# regresses actual outcomes on predicted CATEs
test_calibration(cf)
# Coefficient on mean.forest.prediction should ≈ 1 (well-calibrated)
# Coefficient on differential.forest.prediction tests HTE significanceThe CATE distribution is wide — does that mean there is real heterogeneity?
Not necessarily. Causal forests produce a distribution of estimated CATEs, but some spread is mechanical noise — even with no true heterogeneity, different units receive different local estimates. Use test_calibration() and the RATE statistic to assess whether the predicted heterogeneity explains actual variation in outcomes. A significant differential.forest.prediction coefficient from test_calibration is the most direct evidence.
Some units have negative CATE estimates — should I exclude them?
No — negative CATEs are informative. They indicate units for whom the treatment is estimated to be harmful. If the confidence interval for a negative CATE excludes zero, that is statistically meaningful. Excluding these units would bias the ATE upward and misrepresent the treatment effect distribution.
My ATE from the causal forest differs from the ATE from DML — which is right?
Both are consistent estimators under the same assumptions. Differences arise from their approach to local vs global averaging. Causal forests average CATEs computed locally; DML estimates a global linear coefficient. If treatment effects are genuinely heterogeneous, these can differ. Report both and explain the difference — the comparison is informative.
How do I use CATE estimates for policy targeting?
Sort units by predicted CATE (highest to lowest) and compute the policy-relevant treatment rule: treat only units with predicted CATE above a threshold (e.g. zero, or the cost of treatment). Use the RATE function in grf to estimate the value of this targeting rule and compare it to treating all or none. Report uncertainty in CATE rankings — unit-level CIs are wide; group-level targeting is more reliable.