Rethinking Robustness: Are Our AI Benchmarks Asking the Right Questions?
In the quest for artificial intelligence that we can trust in the real world, the goal of "domain generalization" is paramount. We aim to build models that can perform reliably when faced with new, unseen environments—a critical capability for applications from medical diagnostics to autonomous driving. A key obstacle is the problem of "spurious correlations," where models learn to rely on incidental features in the training data that are not truly related to the task.
A classic example is a model trained to diagnose disease from chest X-rays. If the training data comes from different hospitals, the model might learn that a specific marker placed on the X-ray by one hospital's machine is a strong predictor of disease, simply because that hospital treats sicker patients. This model fails when deployed to a new hospital that doesn't use that marker. The model learned a shortcut, not the actual pathology.
To combat this, the field has developed specialized algorithms designed to ignore these spurious patterns. Yet, a puzzling trend has emerged: on many popular benchmarks, standard models that are supposed to be vulnerable to these shortcuts often achieve the best out-of-distribution (OOD) performance. Furthermore, many of these benchmarks exhibit "accuracy on the line"—a strong, positive correlation where models that do better on the training data also do better on the OOD data.
This paradox has led some to question the necessity of targeted algorithms for domain generalization. In our work, we propose an alternative and fundamentally different perspective: The problem may not be with the algorithms, but with the benchmarks themselves.