Overview
We examine multi-task benchmarks in machine learning through the lens of social
choice theory.
We draw an analogy between benchmarks and electoral systems, where models are
candidates and tasks are voters.
Benchmarks are accordingly divided into two classes:
- Cardinal benchmarks that aggregate numerical scores into one
model ranking
- Ordinal benchmarks that aggregate rankings for each task.
Inspired by Arrow's theorem, we introduce two quantitative measures of benchmarks:
- Sensitivity that quantifies the impact that irrelevant changes to
tasks have on a benchmark.
- Diversity that captures the degree of disagreement in model
rankings across tasks.
We maintain a benchmark for evaluating the sensitivity and diversity of multi-task benchmarks, named BenchBench. Initially, we present results on seven cardinal benchmarks and eleven ordinal benchmarks, which demonstrate a clear trade-off between diversity and stability. The more diverse a multi-task benchmark, the more sensitive to trivial changes it is.