CSE599J: Data-centric Machine Learning

Winter 2023-2024
Wednesdays and Fridays, 3pm to 4:20pm
CSE2 371
Gradescope | Ed

Akari Asai

akari@cs.washington.edu

Office hours by appointment.
To contact course staff, please make an Ed post or email both Pang Wei and Akari.
We welcome feedback on the course. If you prefer to leave it anonymously, use this form.

Many advances in machine learning over the past decade have been powered by the increasing availability of larger and more diverse datasets. Where do these datasets come from? What issues are present in these datasets, and how might we deal with them? We will study questions around how we can better use our available data for training, at inference, and for evaluation, as well as ethical, legal, and security issues around data use. The course will be primarily based around paper reading and discussions, with an open-ended course project.

This is a seminar designed for PhD students. Students are expected to be able to read and understand the assigned papers on their own, and they should be familiar with ML and NLP concepts at the level of having taken advanced undergraduate classes.

Schedule

Weekly due dates:

By Monday 11:59pm: Slides for Wednesday's papers (presenters only)
By Tuesday 11:59pm: Paper reflections for Wednesday's papers (everyone)
By Wednesday 11:59pm: Slides for Friday's papers (presenters only)
By Thursday 11:59pm: Paper reflections for Friday's papers (everyone)

1. Data for pretraining

Jan 3 (Wed)	No class
Jan 5 (Fri)	Course overview & dataset construction (slides) Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus DataComp: In search of the next generation of multimodal datasets Optional reading Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer LAION-5B: An open large-scale dataset for training next generation image-text models Demystifying CLIP Data
Jan 10 (Wed)	Scaling laws (slides) Training Compute-Optimal Large Language Models Scaling Data-Constrained Language Models Optional reading Scaling Laws for Neural Language Models
Jan 12 (Fri)	Data filtering (slides) Deduplicating Training Data Makes Language Models Better Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection Optional reading SemDeDup: Data-efficient learning at web-scale through semantic deduplication Beyond neural scaling laws: beating power law scaling via data pruning Data Filtering Networks Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks Confident Learning: Estimating Uncertainty in Dataset Labels
Jan 17 (Wed)	Dataset composition (slides) A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity Scaling Laws of Synthetic Images for Model Training ... for Now Optional reading Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP Data Determines Distributional Robustness in Contrastive Language-Image Pre-training (CLIP) Improving Multimodal Datasets with Image Captioning StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners
Jan 19 (Fri)	Biases in datasets (slides) From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models Towards Measuring the Representation of Subjective Global Opinions in Language Models Optional reading Whose Opinions Do Language Models Reflect? Can Large Language Models Capture Dissenting Human Voices? The "Problem" of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation

2. Data for tuning and evaluation

Jan 24 (Wed)

Generative evaluation (slides)

Optional reading

Jan 26 (Fri)

Data for alignment (slides)

Optional reading

Jan 31 (Wed)

Ambiguity and disagreement (slides)

Optional reading

AmbigQA: Answering Ambiguous Open-domain Questions

3. Adapting to different data distributions

Feb 2 (Fri)

Distribution shifts (project proposal due) (slides)

Optional reading

Feb 7 (Wed)

Reweighting data (slides)

Optional reading

Feb 9 (Fri)

Domain adaptation (slides)

Optional reading

4. Linking model output to training data

Feb 14 (Wed)

Data attribution (slides)

Optional reading

Feb 16 (Fri)

Retrieval-based models (slides)

Optional reading

Feb 21 (Wed)

Memorization (slides)

Optional reading

5. Legal, ethical, and security considerations

Feb 23 (Fri)	No class
Feb 28 (Wed)	Copyright (slides) Talkin' 'Bout AI Generation: Copyright and the Generative-AI Supply Chain NYT-OpenAI lawsuit Optional reading Foundation Models and Fair Use Glaze: Protecting Artists from Style Mimicry by Text-to-Image Models
Mar 1 (Fri)	Segregating data (slides) What Does it Mean for a Language Model to Preserve Privacy? SILO language models: Isolating legal risk in a nonparametric datastore
Mar 6 (Wed)	Data security and robustness Poisoning Web-Scale Training Datasets is Practical DECODINGTRUST: A Comprehensive Assessment of Trustworthiness in GPT Models Optional reading Poisoning Language Models During Instruction Tuning Universal and Transferable Adversarial Attacks on Aligned Language Models Are aligned neural networks adversarially aligned?
Mar 8 (Fri)	Last class; project presentations
Mar 11 (Mon)	No class; project writeups due

Acknowledgements

We are grateful to Sewon Min, Rulin Shao, Ludwig Schmidt, and Tatsunori Hashimoto for their feedback and assistance with this course.

CSE599J: Data-centric Machine Learning

Winter 2023-2024 Wednesdays and Fridays, 3pm to 4:20pm CSE2 371 Gradescope | Ed

Pang Wei Koh

Akari Asai

Schedule

1. Data for pretraining

2. Data for tuning and evaluation

3. Adapting to different data distributions

4. Linking model output to training data

5. Legal, ethical, and security considerations

Acknowledgements

Winter 2023-2024
Wednesdays and Fridays, 3pm to 4:20pm
CSE2 371
Gradescope | Ed