Location: NYC
On the 2025–26 academic job market, seeking tenure-track positions beginning Fall 2026.
I work on AI for society through the science of measuring latent AI traits and intervening to steer AI behavior. I bridge theory, algorithms, and real-world impact to enable robust evaluation of AI systems, uncover the spurious and causal mechanisms behind their behavior, and design adaptation methods that steer behavior safely in changing environments. My research has been supported by a Sloan Scholarship, Beckman Graduate Research Fellowship, GEM Associate Fellowship, NSF Miniature Brain Machinery Traineeship.
I am an AI Institute Fellow in Residence at Schmidt Sciences, a Postdoctoral Affiliate at the Massachusetts Institute of Technology (w/ Prof. Marzyeh Ghassemi), and a Postdoctoral Scholar at the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard. I received my Ph.D. in Computer Science from the University of Illinois at Urbana–Champaign in 2024 (w/ Prof. Sanmi Koyejo), where I was also a Visiting Ph.D. Student at Stanford University (2022–2024). I earned my B.S. in Mechanical Engineering with minors in Mathematics and Computer Science from Texas A&M University in 2019. Additionally, I have interned at Sandia National Laboratories (w/ Dr. Eric Goodman), Google Brain (now Google DeepMind) (w/ Dr. Alex D’Amour), Cruise LLC, and the Max Planck Institute for Intelligent Systems (w/ Dr. Moritz Hardt).
Please see CV for more or schedule a chat! Contact: olawale [at] mit [dot] edu.
See Publications (and related Blog Posts) for more. * denotes equal contribution. α-β denotes alphabetical order. I am also very happy to discuss new research directions; please reach out if there is shared interest!
AI systems exhibit jagged intelligence—they excel at some tasks but fail at others that share a common human capability. My recent work aims to develop measurements of AI-specific latent traits to enable less jagged, more predictable performance across real-world settings.
Toward an Evaluation Science for Generative AI Systems. Laura Weidinger, Inioluwa Deborah Raji, Hanna Wallach, Margaret Mitchell, Angelina Wang, Olawale Salaudeen, Rishi Bommasani, Deep Ganguli, Sanmi Koyejo, William Isaac. The Bridge 2025, National Academy of Engineering. [paper]
This paper asserts that evaluating generative AI requires more than benchmark scores—it demands a science of measurement that aligns metrics with real-world uses and evolves over time. It articulates how evaluation tools must shift from static, one-off tests to dynamic instruments embedded in institutional ecosystems. In doing so, it charts a path for generative AI evaluation to support trustworthy claims about capability, safety, and deployment in society.
Measurement to Meaning: A Validity-Centered Framework for AI Evaluation. Olawale Salaudeen*, Anka Reuel*, Ahmed Ahmed, Suhana Bedi, Zachary Robertson, Sudharsan Sundar, Ben Domingue, Angelina Wang, Sanmi Koyejo. Working Paper. Preliminary version is accepted at NeurIPS 2025 Workshop on LLM Evaluation (to appear). [paper] [webpage] [policy brief]
This paper reframes AI evaluation as a science of measurement. It proposes a validity framework that determines when evaluation results, such as benchmark scores, truly support claims about underlying capabilities like reasoning or intelligence. The same framework extends to assessing AI risks, offering a principled basis for reliable, meaningful evaluation.
ImageNot: A contrast with ImageNet preserves model rankings. Olawale Salaudeen, Moritz Hardt. Preprint. [paper] [code] [webpage]
This paper introduces ImageNot, a dataset intentionally designed to differ from ImageNet while matching its scale, to test whether the deep learning revolution sparked by ImageNet was benchmark-specific. Re-training the full trajectory of architectures from that era from scratch on ImageNot reproduces the same pattern of progress, showing that the common practice of iterative development on public benchmarks can yield externally valid methodological advances—even though similar practices often lead to false discoveries in other scientific fields.
AI models often rely on spurious correlations, latching onto easy but unreliable cues for decision-making. I design methods that help models focus on the stable, causal patterns instead, so they behave more reliably when conditions change.
Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified? Olawale Salaudeen, Nicole Chiou, Shiny Weng, Oluwasanmi Koyejo. TMLR 2025 (Awarded TMLR Journal to Conference [J2C] Certification). [paper] [code] [webpage] [news] (Preliminary version (On Domain Generalization Datasets as Proxy Benchmarks for Causal Representation Learning) appeared as an Oral at the NeurIPS Workshop on Causal Representation Learning 2024)
This paper revisits the design of domain generalization (DG) benchmarks and shows that the widely observed “Accuracy-on-the-Line” pattern—where in-domain and out-of-domain accuracy are strongly and positively correlated—reveals a deeper misspecification. It argues that current DG datasets capture simple or narrow distribution shifts that miss the complexity of real-world variation, such as geographical differences in COVID-19 diagnosis. The paper provides diagnostic analyses and guidance for constructing DG evaluations that genuinely test generalization beyond training domains.
Aggregation Hides OOD Generalization Failures from Spurious Correlations. Olawale Salaudeen, Haoran Zhang, Kumail Alhamoud, Sara Beery, Marzyeh Ghassemi. NeurIPS 2025 (to appear; Spotlight). [paper]
This paper shows that common aggregation practices in evaluating out-of-distribution (OOD) generalization can mask systematic failures caused by spurious correlations. When performance is averaged across heterogeneous subsets, models can appear robust even while failing sharply on specific environments or attributes. The paper introduces diagnostic methods to disentangle these hidden failures, showing that finer-grained evaluation is essential for detecting and mitigating OOD brittleness.
Causally Inspired Regularization Enables Domain General Representations. Olawale Salaudeen, Oluwasanmi Koyejo. AISTATS 2024. [paper] [code] [webpage]
This paper develops a causal regularization method that is effective across a broad class of domain generalization settings previously treated individually, defined by Reichenbach’s Common Cause Principle. The method yields representations that generalize across domains even when spurious factors are unobserved, improving robustness to distribution shifts across well-specified state-of-the-art benchmarks.
AI behaviors often become unreliable when they encounter new environments, but they can adapt if provided with the right cues. My work develops methods that utilize context available at inference time to adjust model behavior on the fly, ensuring systems remain reliable and safe when conditions change.
Adapting to Latent Subgroup Shifts via Concepts and Proxies. α–β. Ibrahim Alabdulmohsin*, Nicole Chiou*, Alexander D’Amour*, Arthur Gretton*, Sanmi Koyejo*, Matt J. Kusner*, Stephen R. Pfohl*, Olawale Salaudeen*, Jessica Schrouff*, Katherine Tsai*. AISTATS 2023. [paper] [code] [webpage]
This paper studies domain adaptation under latent subgroup shift, where unobserved subpopulations drive changes in both features and labels. It develops a causal framework showing that the Bayes-optimal target predictor is identifiable using observable concepts and proxy variables, even without target labels. The resulting method adapts across hidden subgroups that confound model performance, improving reliability in critical real-world applications.
Proxy Methods for Domain Adaptation. Katherine Tsai, Stephen R. Pfohl, Olawale Salaudeen, Nicole Chiou, Matt J. Kusner, Alexander D’Amour, Sanmi Koyejo, Arthur Gretton. AISTATS 2024. [paper] [code]
This paper studies domain adaptation under latent shift, where unobserved confounders jointly affect inputs and labels, invalidating standard covariate- or label-shift assumptions. It develops a proxy-based adaptation method grounded in proximal causal inference, using observable proxies or access to multiple environments to mitigate the effects of latent distribution shift and align domains without direct supervision.
Improving Single-round Active Adaptation: A Prediction Variability Perspective. Xiaoyang Wang, Yibo Jacky Zhang, Olawale E Salaudeen, Mingyuan Wu, Hongpeng Guo, Chaoyang He, Klara Nahrstedt, Sanmi Koyejo. TMLR 2025. [paper]
When environmental cues are insufficient, labeling data becomes the only way to adapt. We study this single-round adaptation setting—where only one batch can be labeled—and show that using prediction variability to guide selection improves long-term adaptation efficiency.
Fall 2025. Our [policy brief] on validating claims about AI is now available!
Fall 2025. Three [papers] are accepted to NeurIPS 2025 (main track), including one spotlight selection! (i) Aggregation Hides OOD Generalization Failures from Spurious Correlations (spotlight), (ii) On Group Sufficiency Under Label Bias, and (iii) Understanding challenges to the interpretation of disaggregated evaluations of algorithmic fairness.
Fall 2025. Two [papers] are accepted at the NeurIPS Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling, including one oral selection! (i) On Evaluating Methods vs. Evaluating Models(oral) and (ii) Measurement to Meaning: A Validity-Centered Framework for AI Evaluation.
Fall 2025. I am co-organizing the [workshop] on The Science of Benchmarking and Evaluating AI at EurIPS 25 in Copenhagen, Denmark with Yatong Chen, Moritz Hardt and Joaquiin Vanschoren!
Fall 2025. Our [paper] on single-round active learning – Improving Single-round Active Adaptation: A Prediction Variability Perspective – is accepted at TMLR!
Summer 2025. Our [paper] on the limitations of domain generalization benchmarks and solutions – Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified? – is accepted at TMLR!
Summer 2025. Our [preprint] on the limitations of evaluating AI systems with tests carefully designed for human populations – Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead – is now available on arXiv!
Summer 2025. Our [preprint] on interpreting disaggregated evaluations of algorithm fairness – Understanding challenges to the interpretation of disaggregated evaluations of algorithmic fairness – is now available on arXiv!
Summer 2025. [service]. I am serving as a program chair for the Machine Learning for Health (ML4H) conference in San Diego, CA, in December. Please reach out if you are interested in sponsoring this great conference!
Summer 2025. [honors/appointment]. I will spend the next year at Schmidt Sciences in NYC as a Visiting Scientist (previously titled AI Institute Fellow) starting this summer! Please reach out if you are in NYC!
Spring 2025. [honors/appointment]. I joined the Eric and Wendy Schmidt Center, led by Prof. Caroline Uhler at the Broad Institute of MIT and Harvard, as a postdoctoral scholar.
Spring 2025. Our [paper] Toward an Evaluation Science for Generative AI Systems appeared in The Bridge's (National Academy of Engineering) latest edition on "AI Promises & Risks."
Spring 2025. I gave a [talk] on addressing distribution shifts with varying levels of deployment distribution information at the MIT LIDS Postdoc NEXUS meeting!
Winter 2025. [service]. I am co-organizing the new AI for Society seminar at MIT.
Winter 2025. Our [paper] titled What’s in a Query: Examining Distribution-based Amortized Fair Ranking will appear at the International World Wide Web Conference (WWW), 2025.
Winter 2025. I was selected as an NYU Tandon Faculty First-Look Fellow; I look forward to visiting and giving a [honors/talk] on our work on distribution shifts at NYU in February; news!
Winter 2025. [service]. I am co-organizing the 30th Annual Sanjoy K. Mitter LIDS Student Conference at MIT.
Winter 2025. I was selected as a Georgia Tech FOCUS Fellow; I look forward to visiting and giving a [honors/talk] on our work on distribution shifts at Georgia Tech in January!
Fall 2024. Our [paper] titled On Domain Generalization Datasets as Proxy Benchmarks for Causal Representation Learning will appear at the Neurips 2024 workshop on causal representation learning as an oral Presentation.
Fall 2024. [appointment]. I joined the Healthy ML Lab, led by Prof. Marzyeh Ghassemi, at MIT as a postdoctoral associate!
Spring 2025. Our [preprint] on domain generalization benchmarks – Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified? – is now available on arXiv!
Summer 2024. I gave a talk on our work on distribution shift at Texas State's Computer Science seminar.
Summer 2024. I gave a [talk] on our work on distribution shift at UT Austin's Institute for Foundations of Machine Learning (IFML).
Summer 2024. I successfully defended my PhD dissertation titled “Towards External Valid Machine Learning: A Spurious Correlations Perspective”!
Spring 2024. I gave a [talk] on AI for critical systems at the MobiliT.AI forum (May 28-29)!
Spring 2024. I gave a [talk] at UIUC Machine Learning Seminar on our work on the external validity of ImageNet; artifacts here!
Spring 2024. Our [preprint] demonstrating the external validity of ImageNet model/architecture rankings – ImageNot: A contrast with ImageNet preserves model ranking – is now available on arXiv!
Winter 2024. Two [papers] on machine learning under distribution shift will appear at AISTATS 2024 (see Publications)!
Winter 2024. I have returned to Stanford from MPI!
Fall 2023. I will join the Social Foundations of Computation department at the Max Planck Institute for Intelligent Systems in Tübingen, Germany this fall as a Research Intern working with Dr. Moritz Hardt!
Spring 2023. I passed my PhD Preliminary Exam!
Spring 2023. I will join Cruise LLC's Autonomous Vehicles Behaviors team in San Francisco, CA this summer as a Machine Learning Intern!
Fall 2022. I have moved to Stanford University as a "student of new faculty (SNF)" with Professor Sanmi Koyejo!
Summer 2022. I am honored to be selected as a top reviewer (10%) of ICML 2022!
Summer 2022. I will join Google Brain (now Google Deepmind) in Cambridge, MA this summer as a Research Intern!
Spring 2021. I gave a [talk] on leveraging causal discovery for fMRI denoising at the Beckman Institute Graduate Student Seminar!
Fall 2021. Our [paper] titled Exploiting Causal Chains for Domain Generalization was accepted at the 2021 NeurIPS Workshop on Distribution Shift!
Fall 2021. I was selected as a Miniature Brain Machinery (MBM) NSF Research Trainee!
Summer 2021. I was selected to receive an Illinois GEM Associate Fellowship!
Spring 2021. I passed my Ph.D. qualifying exam!
Spring 2020. I was selected to receive a 2020 Beckman Institute Graduate Fellowship!
I am happy to mentor students with overlapping research interests. Particularly for undergrads at MIT, programs like UROP are a great mechanism for mentorship.
More generally, I am very happy and available to give advice and feedback on applying to and navigating both undergraduate and graduate programs in computer science and related disciplines – especially for those to whom this type of feedback and guidance would be otherwise unavailable.