Blogs

On Domain Generalization Datasets as Proxy Benchmarks for Causal Representation Learning

(Forthcoming!) Not every distribution shift benchmark is a good test of causal representation learning? This post argues that while causal representations aim for worst-case robustness, many domain generalization datasets capture easier kinds of shifts—so using them as proxy benchmarks can misrepresent what causal representation learning is really meant to achieve.

Measurement to Meaning: A Validity-Centered Framework for AI Evaluation

AI progress isn’t just about better scores, but about understanding what those scores mean—turning raw measurements of behavior into real insight about what AI systems actually know and can do. This post explores the shift from measurement to meaning in how we evaluate AI.

Domain Generalization Benchmarks with Accuracy on the Line Misspecified

Domain generalization benchmarks with accuracy on the line aren’t testing the kinds of challenging shifts we actually care about. This post explores how such benchmarks can be misspecified—rewarding stability over true robustness and obscuring the harder problems the field aims to solve.

ImageNot: A contrast with ImageNet preserves model rankings

ImageNot tests a counterfactual universe of ImageNet to ask whether the scientific advances we make on public benchmarks truly generalize. Despite concerns about false discoveries, it finds surprising external validity—model rankings remain consistent even in a wholly new benchmark environment.

Adapting to Latent Subgroup Shifts via Concepts and Proxies

This post explores how concepts and proxies can help models handle latent subgroup shifts—cases where hidden subpopulations change in ways that degrade performance—by enabling adaptation to latent structure even without explicit group labels.

Causally Inspired Regularization Enables Domain General Representations

This post explains how causally inspired regularization can promote representations that generalize across domains—by encouraging models to capture stable, causally relevant features rather than spurious patterns tied to specific environments.

Page updated

Google Sites

Report abuse