We examine multi-task benchmarks in machine learning through the lens of social choice
theory.
We draw an analogy between benchmarks and electoral systems, where models are candidates
and tasks are voters.
Benchmarks are accordingly divided into two classes:
- Cardinal benchmarks that aggregate numerical scores into one
model ranking
- Ordinal benchmarks that aggregate rankings for each task.
Inspired by Arrow's theorem, we introduce two quantitative measures of benchmarks:
-
Sensitivity that quantifies the impact that irrelevant changes
to tasks have on a benchmark.
-
Diversity that captures the degree of disagreement in model
rankings across tasks.
We maintain a benchmark for evaluating the sensitivity and diversity of multi-task
benchmarks, named BenchBench.
Initially, we present results on seven cardinal benchmarks
and eleven ordinal benchmarks, which demonstrate a clear trade-off between diversity
and stability.
The more diverse a multi-task benchmark, the more sensitive to trivial
changes it
is.