I am broadly interested in reliable and safe AI, and I am generally working on the science of measuring and controlling AI capabilities. I develop the principles and practices of reliable AI evaluation. This includes studying the external validity of key benchmarks (ImageNet) in deep learning, the internal validity of benchmarks for out-of-distribution generalization, and frameworks for valid evaluation of AI capabilities. I also develop methods that enable AI capabilities to generalize and adapt to new environments that differ from their training data, ensuring that AI systems are reliable and safe in dynamic, real-world settings. Application areas of my work include health and medicine, algorithmic fairness, and AI policy.
My research has been supported by a Sloan Scholarship, Beckman Graduate Research Fellowship, GEM Associate Fellowship, NSF Miniature Brain Machinery Traineeship. Additionally, I have interned at Sandia National Laboratories (w/ Dr. Eric Goodman), Google Brain (now Google DeepMind) (w/ Dr. Alex D’Amour), Cruise LLC, and the Max Planck Institute for Intelligent Systems (w/ Dr. Moritz Hardt).
See Publications for more. * denotes equal contribution. α-β denotes alphabetical order.
Valid Measurement of AI Capabilities: AI evaluation is the science of linking measurements, such as benchmark scores, to the real-world capabilities they are meant to represent. My work develops validity-centered analyses of current practices, laying the foundation for more rigorous, scientifically grounded evaluation methods.
Measurement to Meaning: A Validity-Centered Framework for AI Evaluation
Olawale Salaudeen*, Anka Reuel*, Ahmed Ahmed, Suhana Bedi, Zachary Robertson, Sudharsan Sundar, Ben Domingue, Angelina Wang, Sanmi Koyejo
Working Paper
[arXiv] [webpage]
ImageNot: A contrast with ImageNet preserves model rankings
Olawale Salaudeen, Moritz Hardt
In review, 2025
[arXiv] [code] [webpage]
Toward an Evaluation Science for Generative AI Systems
Laura Weidinger, Inioluwa Deborah Raji, Hanna Wallach, Margaret Mitchell, Angelina Wang, Olawale Salaudeen, Rishi Bommasani, Deep Ganguli, Sanmi Koyejo, William Isaac
In The Bridge, National Academy of Engineering
[arXiv]
Building Generalizable AI Systems: Machine learning models often take shortcuts, failing when spurious background details mislead them in new settings. Domain generalization trains models to focus on core features rather than contextual noise, enabling safer behavior in unfamiliar environments. My work has redefined how OOD generalization is evaluated and developed state-of-the-art causal methods to address it.
Building Adaptable AI Systems: Models often encounter new domains, but environmental cues can guide their adjustment. Test-time domain adaptation uses such auxiliary information to keep models reliable and safe in unfamiliar contexts. My work established the first identifiability results for proxy-based adaptation and developed state-of-the-art algorithms for adapting models on the fly.
Proxy Methods for Domain Generalization
Katherine Tsai, Stephen R. Pfohl, O. Salaudeen, Nicole Chiou, Matt J. Kusner, Alexander D’Amour, Sanmi Koyejo, Arthur Gretton.
In AISTATS 2024
[arXiv] [code]
Adapting to Latent Subgroup Shifts via Concepts and Proxies
α–β. Ibrahim Alabdulmohsin*, Nicole Chiou*, Alexander D’Amour*, Arthur Gretton*, Sanmi Koyejo*, Matt J. Kusner*, Stephen R. Pfohl*, Olawale Salaudeen*, Jessica Schrouff*, Katherine Tsai*.
In AISTATS 2023
[arXiv] [code] [webpage]
I am very happy to discuss new research directions; please reach out if there is shared interest!