Handling Hidden Shifts in Data: A New Strategy for Adaptation
One of the most persistent challenges in deploying machine learning models in the real world is distribution shift. A model trained in one environment—a "source" domain—often fails when applied to a new "target" domain. Consider a model trained to predict patient outcomes using data from Hospital P. When we try to use this model at Hospital Q, its performance may plummet because the two hospitals serve different patient populations, with underlying differences in demographics, socioeconomic status, and patterns of care.
This is the classic problem of unsupervised domain adaptation. Standard approaches often assume the shift is simple. Covariate shift assumes that while the features (X) change, the relationship between features and labels, p(Y|X), remains the same. Label shift assumes the label distribution p(Y) changes, but the conditional feature distribution p(X|Y) is stable.
But what happens when the shift is more complex? What if the patient populations in both hospitals are mixtures of underlying, unobserved subgroups (e.g., based on social determinants of health), and it's the prevalence of these subgroups that differs?. In this scenario, which we call latent subgroup shift, neither the covariate nor the label shift assumption holds, rendering standard methods ineffective.
We tackle this problem head-on. We show that even when the shifting subgroup is unobserved, we can still successfully adapt a model from the source to the target domain by leveraging other forms of auxiliary data.