Pang Wei Koh
twitter | github | google scholar

I'm a PhD student at Stanford working on machine learning with Percy Liang.


I received my BS and MS in Computer Science from Stanford University in 2013, where I worked with Andrew Ng and Daphne Koller in the Stanford AI Lab. I grew up in Singapore and served as an "AI" (armored infantry) officer before coming to Stanford.

In 2012, I joined Coursera as its third employee. I served as Director of Partnerships and Course Operations for two years, during which I built a team of 25 people working with thousands of instructors and staff from 100+ schools, and then as the product manager in charge of university-facing products. I returned to Stanford in 2015, working for a year with Anshul Kundaje on computational biology. In 2016, I started my PhD in Computer Science at Stanford, working with Percy Liang. I'm supported by a Facebook PhD Fellowship.


* = equal contribution.

WILDS: A benchmark of in-the-wild distribution shifts
Pang Wei Koh*, Shiori Sagawa*, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton A. Earnshaw, Imran S. Haque, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang
ICML 2021
Long talk
WILDS is a benchmark of 10 distribution shift datasets across diverse data modalities and real-world applications, from tumor identification to wildlife monitoring to poverty mapping. It is available as an open-source Python package that automates data downloading and processing and comes with standardized evaluators and default models for all datasets.
Just Train Twice: Improving group robustness without training group information
Evan Zheran Liu*, Behzad Haghgoo*, Annie S. Chen*, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn
ICML 2021
Long talk
We introduce a simple procedure for improving worst-group performance: train an ERM model, upweight the training points that it misclassifies, and then retrain the model.
Accuracy on the line: On the strong correlation between out-of-distribution and in-distribution generalization
John Miller, Rohan Taori, Aditi Raghunathan, Shiori Sagawa, Pang Wei Koh, Vaishaal Shankar, Percy Liang, Yair Carmon, and Ludwig Schmidt
ICML 2021
We show a surprising correlation between out-of-distribution performance and in-distribution performance for a wide range of models and distribution shifts, and discuss potential reasons, its implications, and several exceptions.
Supporting COVID-19 policy response with large-scale mobility-based modeling
Serina Chang, Mandy L. Wilson, Bryan Lewis, Zakaria Mehrab, Komal K. Dudakiya, Emma Pierson, Pang Wei Koh, Jaline Gerardin, Beth Redbird, David Grusky, Madhav Marathe, Jure Leskovec
KDD (Applied Data Science Track) 2021
Best paper award
We worked with the Virginia Department of Health to design a decision-support tool that utilizes large-scale data and epidemiological modeling to predict the impact of changes in mobility on COVID infection rates.
Selective classification can magnify disparities across groups
Erik Jones*, Shiori Sagawa*, Pang Wei Koh*, Ananya Kumar, and Percy Liang
ICLR 2021
Spotlight talk at the NeurIPS 2020 ICBINB Workshop
Also presented at the NeurIPS 2020 Algorithmic Fairness Workshop
We show that surprisingly, selective classification -- where models can abstain when they are not confident in a prediction -- can actually decrease model accuracy on minority groups. We study this phenomenon empirically across five datasets and characterize it theoretically as a function of the margin distribution.
Mobility network models of COVID-19 explain inequities and inform reopening
Serina Y Chang*, Emma Pierson*, Pang Wei Koh*, Jaline Gerardin, Beth Redbird, David Grusky, and Jure Leskovec
Nature 2020
Commentary in Nature News and Views by Kevin Ma and Marc Lipsitch
Interactive article in The New York Times by Yaryna Serkez
Other press by The New York Times; The Washington Post; The Telegraph; Bloomberg; CNN; MIT Technology Review; Wired; STAT; and Stanford News
Also presented at NetSci 2021 (oral), the NeurIPS 2020 ML for Health Workshop, and the NeurIPS 2020 COVID-19 Symposium (invited talk).
We develop epidemiological models on top of dynamic mobility networks, derived from US cell phone data, that capture the hourly movements of millions of people from local neighborhoods to points of interest such as restaurants, grocery stores, or religious establishments. These models correctly predict higher infection rates among disadvantaged racial and socioeconomic groups, and enable fine-grained analysis of disease spread that can inform more effective and equitable policy responses to COVID-19.
Concept bottleneck models
Pang Wei Koh*, Thao Nguyen*, Yew Siang Tang*, Steve Mussmann, Emma Pierson, Been Kim, and Percy Liang
ICML 2020
Spotlight talk at the ICML 2020 Workshop on Human Interpretability in Machine Learning
We revisit learning models that first predict concepts (e.g., the presence of a bone spur) and then the label. This enables interaction in terms of these high-level, human-provided concepts. For example, we can ask: would the model have predicted severe arthritis if it thought there was a bone spur in the x-ray?
An investigation of why overparameterization exacerbates spurious correlations
Shiori Sagawa*, Aditi Raghunathan*, Pang Wei Koh*, and Percy Liang
ICML 2020
We analyze why increasing model size can actually hurt model accuracy on minority groups, even though larger models are known to improve average accuracy.
ExpBERT: Representation engineering with natural language explanations
Shikhar Murty, Pang Wei Koh, and Percy Liang
ACL 2020
We use natural language explanations to specify high-level inductive biases like "Couples who go on honeymoons are married", and show that it improves accuracy and data efficiency on relation extraction tasks.
Toward trustworthy AI development: Mechanisms for supporting verifiable claims
Miles Brundage*, Shahar Avin*, Jasmine Wang*, Haydn Belfield*, Gretchen Krueger*, Gillian Hadfield, Heidy Khlaaf, Jingying Yang, Helen Toner, Ruth Fong, Tegan Maharaj, Pang Wei Koh, Sara Hooker, ..., Thomas Krendl Gilbert, Lisa Dyer, Saif Khan, Yoshua Bengio, and Markus Anderljung
arXiv 2020
A multi-institution survey of approaches for increasing trust and verifiability in AI systems.
Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization
Shiori Sagawa*, Pang Wei Koh*, Tatsunori B. Hashimoto, and Percy Liang
ICLR 2020
Overparameterized neural networks can be highly accurate on average yet consistently fail on atypical groups of the data (e.g., by learning spurious correlations that hold on average but not in such groups). We show that regularization is critical for worst-group performance in overparameterized models, even if it is not needed for average performance. By coupling strong regularization with distributionally robust optimization, we can learn models that attain substantially higher worst-group accuracies.
On the accuracy of influence functions for measuring group effects
Pang Wei Koh*, Kai-Siang Ang*, Hubert H. K. Teo*, and Percy Liang
NeurIPS 2019
Influence functions are based on a first-order Taylor approximation that is accurate for sufficiently small changes to the model. However, we often want to study the effects of removing large groups of training points, which can result in significant changes to the model. Surprisingly, we find that on many real-world datasets the influence approximation for groups is still strikingly accurate.
Temporal FiLM: Capturing long-range sequence dependencies with feature-wise modulations
Sawyer Birnbaum*, Volodymyr Kuleshov*, S. Zayd Enam, Pang Wei Koh, Stefano Ermon
NeurIPS 2019
We introduce a deep neural network architecture that captures long-range dependencies in sequential inputs, and apply it to the problems of text classification, audio super-resolution, and the enhancement of functional genomics assays.
Inferring multi-dimensional rates of aging from cross-sectional data
Emma Pierson*, Pang Wei Koh*, Tatsunori B. Hashimoto*, Daphne Koller, Jure Leskovec, Nicholas Eriksson, and Percy Liang
Contributed talk at the ICML/IJCAI 2018 Workshop on Computational Biology
Spotlight talk at the NeurIPS 2018 Workshop on Machine Learning for Health
We study how individuals change over time given only cross-sectional data, i.e., a single observation per person. While this task is impossible in general, we give assumptions under which we can correctly learn a model from cross-sectional data, and demonstrate our method on the UK Biobank human health dataset.
Stronger data poisoning attacks break data sanitization defenses
Pang Wei Koh*, Jacob Steinhardt*, and Percy Liang
arXiv 2018
Machine learning models can be corrupted by data poisoning attacks that inject malicious points into the models' training sets. We develop three new data poisoning attacks that can break data sanitization defenses, including commonly-used anomaly detectors based on nearest neighbors, training loss, and singular-value decomposition. Our results underscore the urgent need to develop more sophisticated and robust defenses against data poisoning attacks.
Certified defenses for data poisoning attacks
Jacob Steinhardt*, Pang Wei Koh*, and Percy Liang
NeurIPS 2017
Can we bound the worst-cass loss that a defender suffers against any data poisoning attack? We construct approximate upper bounds on the loss across a broad family of attacks for defenders that first perform outlier removal followed by empirical risk minimization. We also introduce an attack that nearly realizes the bound, giving us a powerful tool for quickly assessing defenses on a given dataset.
Understanding black-box predictions via influence functions
Pang Wei Koh and Percy Liang
ICML 2017
Best paper award
How can we explain the predictions of a black-box model? We use influence functions -- a classic technique from robust statistics -- to trace a model's prediction through the learning algorithm and back to its training data, identifying the points most responsible for a given prediction. Influence functions can be used to understand model behavior, debug models and detect dataset errors, and even identify and exploit vulnerabilities to training-set attacks.
Localized hepatic lobular regeneration by central-vein-associated lineage-restricted progenitors
Jonathan M. Tsai, Pang Wei Koh, Ania Stefanska, Liujing Xing, Graham G. Walmsley, Nicolas Poux, Irving L. Weissman, and Yuval Rinkevich
Proceedings of the National Academy of Sciences (PNAS) 2017
An adult liver can eventually recover organ mass after suffering acute tissue loss, but its morphology and architecture will be permanently damaged. We identify a specific time window after birth where the injured lobe can regenerate and become indistinguishable from normal ones. These results hint at a therapeutic window in which specific cells can undergo clonal expansion to give rise to normal structure and function in the face of injury.
An atlas of transcriptional, chromatin accessibility, and surface marker changes in human mesoderm development
Pang Wei Koh*, Rahul Sinha*, Amira A. Barkal, Rachel M. Morganti, Angela Chen, Irving L. Weissman, Lay Teng Ang, Anshul Kundaje, and Kyle M. Loh
Scientific Data 2016
We study the dynamics of translation, chromatin accessibility, and surface markers as pluripotent stem cells differentiate through mesoderm intermediates into bone, heart, and other cell types. As a companion to the biology-focused Cell paper below, this paper focuses on data processing, quality control, and computational analysis.
Mapping the pairwise choices leading from pluripotency to human bone, heart, and other mesoderm cell types
Kyle M. Loh*, Angela Chen*, Pang Wei Koh, Tianda Z. Deng, Rahul Sinha, Jonathan M. Tsai, Amira A. Barkal, Kimberle Y. Shen, Rajan Jain, Rachel M. Morganti, Ng Shyh-Chang, Nathaniel B. Fernhoff, Benson M. George, Gerlinde Wernig, Rachel E.A. Salomon, Zhenghao Chen, Hannes Vogel, Jonathan A. Epstein, Anshul Kundaje, William S. Talbot, Philip A. Beachy, Lay Teng Ang, and Irving L. Weissman
Cell 2016
We chart a developmental roadmap that allows us to differentiate pluripotent stem cells into twelve mesodermal lineages, including bone, muscle, and heart. We use this system to produce pure populations of human bone and heart progenitors that successfully engraft in in vivo mouse models, as well as study previously-unobservable events in human embryonic development such as somite segmentation.
Denoising genome-wide histone ChIP-seq with convolutional neural networks
Pang Wei Koh*, Emma Pierson*, and Anshul Kundaje.
Intelligent Systems for Molecular Biology (ISMB) / Bioinformatics 2017
Spotlight talk and best poster award at the ICML 2016 Workshop on Computational Biology
Top 10 papers of 2016-2017 in regulatory and systems genomics at RECOMB/ISMB
Can we use the structure in biological data to remove noise? On chromatin immunoprecipitation sequencing (ChIP-seq) experiments for histone modifications, we show that a convolutional neural network trained on matching pairs of noisy and high-quality data can significantly improve data quality. Our approach is applicable to biological problems where it is relatively easy to generate noisy versions of high-quality data, but difficult to analytically characterize the noise or underlying data distributions.
Dissecting an online intervention for cancer survivors
Zhenghao Chen, Pang Wei Koh, Philip L. Ritter, Kate Lorig, Erin O’Carroll Bantum, and Suchi Saria
Health Education & Behavior 2014
The debilitating effects of cancer can last long after initial treatment, even if the cancer is in remission. To cope with this, cancer survivors have increasingly turned towards online peer support groups. We study how online participation affects downstream health outcomes, with an eye towards being able to better design such peer support groups.
Peer and self assessment in massive online classes
Chinmay Kulkarni, Pang Wei Koh, Huy Le, Daniel Chia, Kathryn Papadopoulos, Justin Cheng, Daphne Koller, and Scott Klemmer
ACM Transactions on Computer-Human Interaction 2013
Can we use peer- and self-assessment in MOOCs to scale up assessment and learning in global classrooms? We analyzed data from the first MOOC to use peer- and self-assessment and showed that these forms of assessment are effective and scalable, with peer grades correlating highly with staff grades. We also experiment with giving graders automatic feedback and using data to design better rubrics.
Identifying genetic drivers of cancer morphology
Pang Wei Koh, Andrew Beck, and Daphne Koller.
Undergraduate honors thesis 2012
Firestone Medal for Excellence in Research
Ben Wegbreit Prize for Best Undergraduate Honors Thesis in Computer Science
David M. Kennedy Honors Thesis Prize (best thesis in Stanford Engineering & Appl. Sciences)
Undergraduate Award in Computer Science (an international research award)
Cancer cells have both abnormal morphology and anomalous gene expression. How are morphology and gene expression linked? We extract clinically-relevant features from tumor micrographs and develop new multi-task regression methods to associate these image features with gene expression. We use this method to study data from hundreds of breast cancer patients, deriving testable hypotheses about the effect of specific genes on tumor morphology.
Sparse filtering
Jiquan Ngiam, Pang Wei Koh, Zhenghao Chen, Sonia Bhaskar, and Andrew Y. Ng
NeurIPS 2011
Spotlight paper
Many algorithms for unsupervised feature learning require either extensive parameter tuning or are unable to scale to large input sizes. We introduce sparse filtering, a simple feature learning method that scales gracefully and has only one parameter to tune.
Learning deep energy models
Jiquan Ngiam, Zhenghao Chen, Pang Wei Koh, and Andrew Y. Ng
ICML 2011
We introduce deep energy models, a type of deep generative model which uses several layers of feedforward functions to model the probability distribution of data. Our model admits efficient inference and obtains good generative and classification performance on natural and synthetic image data.
On random weights and unsupervised feature learning
Andrew Saxe, Pang Wei Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, and Andrew Y. Ng
ICML 2011
Some feature learning architectures do well on object recognition tasks even when their feature weights are totally untrained and randomized. Why can random weights do so well? We show that certain architectures can be inherently frequency selective and translation invariant, even with random weights. Based on this, we show how random weights can be used to perform extremely fast architecture searches.
Tiled convolutional neural networks
Quoc V. Le, Jiquan Ngiam, Zhenghao Chen, Daniel Chia, Pang Wei Koh, and Andrew Y. Ng
NeurIPS 2010
Enforcing strict convolution in neural networks (i.e., where each filter is the same at every location) can be overly restrictive. We propose tiled convolution neural networks that use a regular tiled pattern of tied weights. This flexibility allows us to learn complex invariances and achieve competitive object classification results.
Lower bound on the time complexity of local adiabatic evolution
Zhenghao Chen, Pang Wei Koh, and Zhao Yan
Physical Review A 2006
We present two simple approaches for evaluating the time complexity of local adiabatic evolution using time-independent parameters, which avoids evaluating the entire time-dependent gap function.


At Coursera, we were fortunate to have troves of data on what makes for effective teaching. I spoke frequently at workshops and conferences about online education and worked with many instructors on their courses. My team designed authoring tools and analytics dashboards for our instructors.

In 2012, I was head TA for CS228 at Stanford, Daphne's class on Probabilistic Graphical Models. Together with 8 other TAs, we revamped the class to make it application-focused and auto-gradable, and successfully taught it to 200+ Stanford students and 100,000+ online learners on the Coursera platform.

Before college, Zhenghao Chen and I created and taught a series of 14 full-day workshops for 100+ high school students, covering introductions to programming, artificial intelligence, cryptography, and computer networking.