Pang Wei Koh
CV | linkedin | google scholar

I'm a fourth-year PhD student in Computer Science at Stanford working on machine learning with Percy Liang.


I received my BS and MS in Computer Science from Stanford University in 2013, where I worked with Andrew Ng and Daphne Koller in the Stanford AI Lab. I grew up in Singapore and served as an "AI" (armored infantry) officer before coming to Stanford.

In 2012, I joined Coursera as its third employee. I served as Director of Partnerships and Course Operations for two years, during which I built a team of 25 people working with thousands of instructors and staff from 100+ schools, and then as the product manager in charge of university-facing products. I returned to Stanford in 2015, working for a year with Anshul Kundaje on computational biology. In 2016, I started my PhD in Computer Science at Stanford, working with Percy Liang. I'm supported by a Facebook PhD Fellowship.


For more information on any project, please click on its title. * = equal contribution.

+ Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization
Sagawa S*, Koh PW*, Hashimoto T, Liang P. arXiv. (preprint)
Overparameterized neural networks can be highly accurate on average yet consistently fail on atypical groups of the data (e.g., by learning spurious correlations that hold on average but not in such groups). We show that regularization is critical for worst-group generalization in the overparameterized regime, even if it is not needed for average generalization. By coupling strong regularization with techniques from distributionally robust optimization, we can learn models that attain substantially higher worst-group accuracies.
+ On the accuracy of influence functions for measuring group effects
Koh PW*, Ang KS*, Teo H*, and Liang P. NeurIPS 2019. (paper) (Github) (Codalab)
Influence functions are based on a first-order Taylor approximation that is guaranteed to be accurate for sufficiently small changes to the model. However, we often want to study the effects of removing large groups of training points, which can result in significant changes to the model. Surprisingly, we find that on many real-world datasets the influence approximation for groups is strikingly accurate, even though our analysis shows that this need not hold in general.
+ Temporal FiLM: Capturing long-range sequence dependencies with feature-wise modulations
Birnbaum S, Kuleshov V, Enam Z, Koh PW, and Ermon S. NeurIPS 2019. (paper)
We introduce a deep neural network architecture that captures long-range dependencies in sequential inputs, and apply it to the problems of text classification, audio super-resolution, and the enhancement of functional genomics assays.
+ Stronger data poisoning attacks break data sanitization defenses
Koh PW*, Steinhardt J*, and Liang P. ICML 2019 Workshop on Security and Privacy of Machine Learning. (preprint) (Github)
Machine learning models trained on data from the outside world can be corrupted by data poisoning attacks that inject malicious points into the models' training sets. A common defense against these attacks is data sanitization: first filter out anomalous training points before training the model. Can data poisoning attacks break data sanitization defenses? In this paper, we develop three new attacks that can all bypass a broad range of data sanitization defenses, including commonly-used anomaly detectors based on nearest neighbors, training loss, and singular-value decomposition. Our results underscore the urgent need to develop more sophisticated and robust defenses against data poisoning attacks.
+ Inferring multi-dimensional rates of aging from cross-sectional data
Pierson E*, Koh PW*, Hashimoto T*, Koller D, Leskovec J, Eriksson N, and Liang P. AISTATS 2019. Contributed talk at the ICML/IJCAI 2018 Workshop on Computational Biology. Spotlight talk at the NeurIPS 2018 Workshop on Machine Learning for Health. (paper) (Github)
We study the task of learning how individuals change over time given only cross-sectional data, i.e., a single observation per person. While this task is impossible in general, we give a set of assumptions under which we can correctly learn a model from cross-sectional data, and we demonstrate that it gives reasonable results on the UK Biobank human health dataset.
+ Certified defenses for data poisoning attacks
Steinhardt J*, Koh PW*, and Liang P. NeurIPS 2017. Contributed talk at the ICML 2017 Workshop on Reliable Machine Learning in the Wild. (paper) (Github) (Codalab)
Machine learning systems trained on user-provided data are susceptible to data poisoning attacks, whereby malicious users inject false training data with the aim of corrupting the learned model. While recent work has proposed a number of attacks and defenses, little is understood about the worst-case loss of a defense in the face of a determined attacker. We address this by constructing approximate upper bounds on the loss across a broad family of attacks, for defenders that first perform outlier removal followed by empirical risk minimization. Our bound comes paired with a candidate attack that nearly realizes the bound, giving us a powerful tool for quickly assessing defenses on a given dataset. Empirically, we find that even under a simple defense, the MNIST-1-7 and Dogfish datasets are resilient to attack, while in contrast the IMDB sentiment dataset can be driven from 12% to 23% test error by adding only 3% poisoned data.
+ Understanding black-box predictions via influence functions
Koh PW and Liang P. ICML 2017. Best Paper Award. (paper) (Github) (Codalab)
How can we explain the predictions of a black-box model? In this paper, we use influence functions -- a classic technique from robust statistics -- to trace a model's prediction through the learning algorithm and back to its training data, identifying the points most responsible for a given prediction. On linear models and convolutional neural networks, we demonstrate that influence functions are useful for many different purposes: to understand model behavior, debug models and detect dataset errors, and even identify and exploit vulnerabilities to adversarial training-set attacks.
+ Localized hepatic lobular regeneration by central-vein-associated lineage-restricted progenitors
Tsai JM, Koh PW, Walmsley GG, Poux N, Weissman IL, Rinkevich Y. Proceedings of the National Academy of Sciences 2017. (paper)
When an adult mammalian liver undergoes acute tissue loss (e.g., injury to a lobe through partial hepatectomy), the remaining liver cells undergo a program of expansion and cell division that recovers organ mass but leaves liver morphology and architecture permanently altered. Here, we identify a specific time window after birth where similar injury results instead in regeneration that results in the injured lobe being indistinguishable from normal ones. We study this previously-unknown program of liver regeneration, using clonal analysis to track the fate of hepatocyte progenitors at the injured sites. These results hint at a therapeutic window in which specific cells can undergo clonal expansion to give rise to normal structure and function in the face of injury.
+ An atlas of transcriptional, chromatin accessibility, and surface marker changes in human mesoderm development
Koh PW*, Sinha R*, Barkal A, Morganti R, Chen A, Weissman I, Ang LT, Kundaje A, Loh K. Scientific Data 2016. (paper)
We study the dynamics of translation, chromatin accessibility, and surface markers as pluripotent stem cells differentiate through mesoderm intermediates into bone, heart, and other cell types. Using the mesoderm populations described in our related Cell paper, we run bulk-population RNA-seq, single-cell RNA-seq, ATAC-seq, and high-throughput surface marker screening to characterize changes across differentiation. In contrast to the biology-focused Cell paper, this paper focuses on the aspects of data processing, quality control, and computational analysis.
+ A comprehensive roadmap from pluripotency to human bone, heart and other mesoderm cell types
Loh KM*, Chen A*, Koh PW, Deng TZ, Sinha R, ..., Kundaje A, Talbot WS, Beachy PA, Ang LT, Weissman IL. Cell 2016. (paper)
We chart a developmental roadmap that allows us to differentiate pluripotent stem cells into twelve mesodermal lineages, including bone, muscle, and heart. We use this differentiation system to produce pure populations of human bone and heart progenitors that successfully engraft in in vivo mouse models. Our system also allows us to study previously-unobservable events in human embryonic development; using single-cell RNA-seq, we discovered a new genetic marker of somite segmentation.
+ Denoising genome-wide histone ChIP-seq with convolutional neural networks
Koh PW*, Pierson E*, Kundaje A. Spotlight talk at ICML 2016 Workshop on Computational Biology (Best Poster Award) and ISMB 2017, and published in Bioinformatics. Selected for reading list of top 10 papers of 2016-2017 in regulatory and systems genomics at RECOMB/ISMB. (paper)
Biological data is often extremely noisy. Can we make use of structure in the data to remove some of the noise? In this work, we focus on chromatin immunoprecipitation sequencing (ChIP-seq) experiments targeting histone modifications and show that a convolutional neural network trained on matching pairs of noisy and high-quality data can signifcantly improve data quality. This approach is generally applicable to biological problems where it is relatively easy to generate noisy versions of high-quality data, but difficult to analytically characterize the noise or underlying data distributions.
+ Identifying genetic drivers of cancer morphology
Koh PW, Beck A, and Koller D. Undergraduate honors thesis. (paper)

Awarded the Firestone Medal for Excellence in Research, the Ben Wegbreit Prize for Best Undergraduate Honors Thesis in CS, the David M. Kennedy Honors Thesis Prize for best thesis across Stanford engineering and applied sciences, and the 2012 Undergraduate Award in Computer Science and Information Technology, an international research award.
Cancer cells have both abnormal morphology and anomalous gene expression. How are morphology and gene expression linked? To answer this, we extracted clinically-relevant features from tumor micrographs, and then developed new multi-task regression methods to associate these image features with gene expression. We used this method to study data from hundreds of breast cancer patients, deriving testable hypotheses about the effect of specific genes on tumor morphology.
+ Peer and self assessment in massive online classes
Kulkarni C, Koh PW, Le H, Chia D, Papadopoulos K, Koller D, Klemmer S. ACM Transactions on Computer-Human Interaction 2013 and Design Thinking Research. (paper)
Can we use peer- and self-assessment in MOOCs to scale up assessment and learning in global classrooms? We analyzed data from the first MOOC to use peer- and self-assessment and showed that these forms of assessment are effective and scalable, with peer grades correlating highly with staff grades. We also experimented with giving graders automatic feedback and using data to design better rubrics, further increasing grading accuracy.
+ Dissecting an online intervention for cancer survivors
Chen Z, Koh PW, Ritter PL, Lorig K, Bantum E, Saria S. Health Education & Behavior 2014. (paper)
The debilitating effects of cancer can last long after initial treatment, even if the cancer is in remission. To cope with this, cancer survivors have increasingly turned towards online peer support groups. Using data from these groups, we studied how online participation affects downstream health outcomes, with an eye towards being able to better design such peer support groups.
+ Sparse filtering
Ngiam J, Koh PW, Chen Z, Bhaskar S, Ng AY. NeurIPS 2011. Spotlight paper. (paper)
Many existing algorithms for unsupervised feature learning require either extensive parameter tuning or are unable to scale to large input sizes. Here, we introduced sparse filtering, a simple new feature learning method that scales gracefully and has only one parameter to tune.
+ Learning deep energy models
Ngiam J, Chen Z, Koh PW, Ng AY. ICML 2011. (paper)
We introduced deep energy models, a type of deep generative model which uses several layers of feedforward functions to model the probability distribution of data. Our model admits efficient inference and obtains good generative and classification performance on natural and synthetic image data.
+ On random weights and unsupervised feature learning
Saxe A, Koh PW, Chen Z, Bhand M, Suresh B, Ng AY. ICML 2011. Previously appeared in the Workshop on Deep Learning and Unsupervised Feature Learning, NeurIPS 2010. (paper)
Some feature learning architectures do well on object recognition tasks even when their feature weights are totally untrained and randomized. Why can random weights do so well? We show that certain architectures can be inherently frequency selective and translation invariant, even with random weights. Indeed, a lot of the performance of certain state-of-the-art methods comes from the architecture and not the training. Based on this, we showed how random weights can be used to perform extremely fast architecture searches.
+ Tiled convolutional neural networks
Le Q, Ngiam J, Chen Z, Chia D, Koh PW, Ng AY. NeurIPS 2010. (paper) (visualizations) (code)
Convolutional neural networks, in which small patch-based filters are replicated across the whole image, have seen much success in tasks like digit and object recognition. However, enforcing strict convolution (i.e., each filter is the same at every location) may be unnecessarily restrictive. Here, we proposed tiled convolution neural networks that use a regular "tiled" pattern of tied weights, avoiding the need for adjacent filters to be identical. This flexibility allows us to learn complex invariances and achieve competitive object classification results.
+ Lower bound on the time complexity of local adiabatic evolution
Chen Z, Koh PW, Yan Z. Physical Review A, 2006. (pdf)
We presented two simple approaches for evaluating the time complexity of local adiabatic evolution using time-independent parameters. This lets us calculate the time complexity of algorithms using quantum adiabatic evolution without needing to evaluate the entire time-dependent gap function.


At Coursera, we were fortunate to have troves of data on what makes for effective teaching. I spoke frequently at workshops and conferences about online education and worked with many instructors on their courses. My team designed authoring tools and analytics dashboards for our instructors.

In 2012, I was head TA for CS228 at Stanford, Daphne's class on Probabilistic Graphical Models. Together with 8 other TAs, we revamped the class to make it application-focused and auto-gradable, and successfully taught it to 200+ Stanford students and 100,000+ online learners on the Coursera platform.

Before college, Zhenghao Chen and I created and taught a series of 14 full-day workshops for 100+ high school students, covering introductions to programming, artificial intelligence, cryptography, and computer networking.