SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Classification
Publication date: 7 Oct 2024
Topic: Image Classification
Paper:
https://arxiv.org/pdf/2410.05057v1.pdf
GitHub:
https://github.com/jimmyxu123/select
Description:
In this work, we take steps towards a formal evaluation of data curation strategies and introduce SELECT, the first large-scale benchmark of curation strategies for image classification. In order to generate baseline methods for the SELECT benchmark, we create a new dataset, ImageNet++, which constitutes the largest superset of ImageNet-1K to date. Our dataset extends ImageNet with 5 new training-data shifts, each approximately the size of ImageNet-1K itself, and each assembled using a distinct curation strategy. We evaluate our data curation baselines in two ways: (i) using each training-data shift to train identical image classification models from scratch (ii) using the data itself to fit a pretrained self-supervised representation.