AI progress isn’t just about better scores, but about understanding what those scores mean—turning raw measurements of behavior into real insight about what AI systems actually know and can do. This post explores the shift from measurement to meaning in how we evaluate AI.
Domain generalization benchmarks with accuracy on the line aren’t testing the kinds of challenging shifts we actually care about. This post explores how such benchmarks can be misspecified—rewarding stability over true robustness and obscuring the harder problems the field aims to solve.
(Incoming!) Not every distribution shift benchmark is a good test of causal representation learning? This post argues that while causal representations aim for worst-case robustness, many domain generalization datasets capture easier kinds of shifts—so using them as proxy benchmarks can misrepresent what causal representation learning is really meant to achieve.
ImageNot tests a counterfactual universe of ImageNet to ask whether the scientific advances we make on public benchmarks truly generalize. Despite concerns about false discoveries, it finds surprising external validity—model rankings remain consistent even in a wholly new benchmark environment.
This post explores how concepts and proxies can help models handle latent subgroup shifts—cases where hidden subpopulations change in ways that degrade performance—by enabling adaptation to latent structure even without explicit group labels.
This post explains how causally inspired regularization can promote representations that generalize across domains—by encouraging models to capture stable, causally relevant features rather than spurious patterns tied to specific environments.