CSE599J: Data-centric Machine Learning
Winter 2023-2024
Wednesdays and Fridays, 3pm to 4:20pm
CSE2 371
Gradescope | Ed
Office hours by appointment.
To contact course staff, please make an Ed post or email both Pang Wei and Akari.
We welcome feedback on the course. If you prefer to leave it anonymously, use this form.
Many advances in machine learning over the past decade have been powered by the increasing availability of larger and more diverse datasets. Where do these datasets come from? What issues are present in these datasets, and how might we deal with them? We will study questions around how we can better use our available data for training, at inference, and for evaluation, as well as ethical, legal, and security issues around data use. The course will be primarily based around paper reading and discussions, with an open-ended course project.
This is a seminar designed for PhD students. Students are expected to be able to read and understand the assigned papers on their own, and they should be familiar with ML and NLP concepts at the level of having taken advanced undergraduate classes.
Schedule
Weekly due dates:- By Monday 11:59pm: Slides for Wednesday's papers (presenters only)
- By Tuesday 11:59pm: Paper reflections for Wednesday's papers (everyone)
- By Wednesday 11:59pm: Slides for Friday's papers (presenters only)
- By Thursday 11:59pm: Paper reflections for Friday's papers (everyone)
1. Data for pretraining
2. Data for tuning and evaluation
Jan 24 (Wed) | Generative evaluation (slides) Optional reading |
Jan 26 (Fri) | Data for alignment (slides) Optional reading
|
Jan 31 (Wed) | Ambiguity and disagreement (slides) Optional reading |
3. Adapting to different data distributions
Feb 2 (Fri) | Distribution shifts (project proposal due) (slides) Optional reading |
Feb 7 (Wed) | Reweighting data (slides) Optional reading |
Feb 9 (Fri) | Domain adaptation (slides)
|
4. Linking model output to training data
5. Legal, ethical, and security considerations
Feb 23 (Fri) | No class |
Feb 28 (Wed) | Copyright (slides) Optional reading |
Mar 1 (Fri) | Segregating data (slides) |
Mar 6 (Wed) | Data security and robustness Optional reading |
Mar 8 (Fri) | Last class; project presentations |
Mar 11 (Mon) | No class; project writeups due |