Writing

Blog posts, perspective pieces, and research explainers.

Aggregation Hides Out-of-Distribution Generalization Failures from Spurious Correlations

measurement

generalization

By focusing on aggregated performance metrics, standard domain generalization benchmarks inadvertently mask critical failures in which models rely on spurious correlations, underscoring the need for more granular evaluation to ensure true robustness.

Jan 15, 2025

Measurement to Meaning: A Validity-Centered Framework for AI Evaluation

measurement

AI progress isn’t just about better scores, but about understanding what those scores mean, turning raw measurements of behavior into real insight about what AI systems actually know and can do. This post explores the shift from measurement to meaning in how we evaluate AI.

Jan 10, 2025

Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?

measurement

generalization

Domain generalization benchmarks with accuracy on the line aren’t testing the kinds of challenging shifts we actually care about. This post explores how such benchmarks can be misspecified, rewarding stability over true robustness and obscuring the harder problems the field aims to solve.

Jan 5, 2025

On Domain Generalization Datasets as Proxy Benchmarks for Causal Representation Learning

measurement

generalization

Not every distribution shift benchmark is a good test of causal representation learning? This post argues that while causal representations aim for worst-case robustness, many domain generalization datasets capture easier kinds of shifts, so using them as proxy benchmarks can misrepresent what causal representation learning is really meant to achieve.

Dec 1, 2024

Causally Inspired Regularization Enables Domain General Representations

intervention

generalization

This post explains how causally inspired regularization can promote representations that generalize across domains by encouraging models to capture stable, causally relevant features rather than spurious patterns tied to specific environments.

May 1, 2024

ImageNot: A contrast with ImageNet preserves model rankings

measurement

generalization

ImageNot tests a counterfactual universe of ImageNet to ask whether the scientific advances we make on public benchmarks truly generalize. Despite concerns about false discoveries, it finds surprising external validity, model rankings remain consistent even in a wholly new benchmark environment.

Apr 1, 2024

Adapting to Latent Subgroup Shifts via Concepts and Proxies

intervention

generalization

This post explores how concepts and proxies can help models handle latent subgroup shifts, cases where hidden subpopulations change in ways that degrade performance, by enabling adaptation to latent structure even without explicit group labels.

Apr 1, 2023

Categories

Aggregation Hides Out-of-Distribution Generalization Failures from Spurious Correlations

Measurement to Meaning: A Validity-Centered Framework for AI Evaluation

Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?

On Domain Generalization Datasets as Proxy Benchmarks for Causal Representation Learning

Causally Inspired Regularization Enables Domain General Representations

ImageNot: A contrast with ImageNet preserves model rankings

Adapting to Latent Subgroup Shifts via Concepts and Proxies