CSE599J: Data-centric Machine Learning

Winter 2023-2024
Wednesdays and Fridays, 3pm to 4:20pm
CSE2 371
Gradescope | Ed


Office hours by appointment.
To contact course staff, please make an Ed post or email both Pang Wei and Akari.
We welcome feedback on the course. If you prefer to leave it anonymously, use this form.

Many advances in machine learning over the past decade have been powered by the increasing availability of larger and more diverse datasets. Where do these datasets come from? What issues are present in these datasets, and how might we deal with them? We will study questions around how we can better use our available data for training, at inference, and for evaluation, as well as ethical, legal, and security issues around data use. The course will be primarily based around paper reading and discussions, with an open-ended course project.

This is a seminar designed for PhD students. Students are expected to be able to read and understand the assigned papers on their own, and they should be familiar with ML and NLP concepts at the level of having taken advanced undergraduate classes.

Schedule

Weekly due dates:
  • By Monday 11:59pm: Slides for Wednesday's papers (presenters only)
  • By Tuesday 11:59pm: Paper reflections for Wednesday's papers (everyone)
  • By Wednesday 11:59pm: Slides for Friday's papers (presenters only)
  • By Thursday 11:59pm: Paper reflections for Friday's papers (everyone)

1. Data for pretraining

Jan 3 (Wed) No class
Jan 5 (Fri) Course overview & dataset construction (slides)
Optional reading
Jan 10 (Wed) Scaling laws (slides)
Optional reading
Jan 12 (Fri) Data filtering (slides)
Optional reading
Jan 17 (Wed) Dataset composition (slides)
Optional reading
Jan 19 (Fri) Biases in datasets (slides)
Optional reading

2. Data for tuning and evaluation

3. Adapting to different data distributions

4. Linking model output to training data

Feb 14 (Wed) Data attribution (slides)
Optional reading
Feb 16 (Fri) Retrieval-based models (slides)
Optional reading
Feb 21 (Wed) Memorization (slides)
Optional reading

5. Legal, ethical, and security considerations


Acknowledgements

We are grateful to Sewon Min, Rulin Shao, Ludwig Schmidt, and Tatsunori Hashimoto for their feedback and assistance with this course.