Overview

We examine multi-task benchmarks in machine learning through the lens of social choice theory. We draw an analogy between benchmarks and electoral systems, where models are candidates and tasks are voters. Benchmarks are accordingly divided into two classes:

Cardinal benchmarks that aggregate numerical scores into one model ranking
Ordinal benchmarks that aggregate rankings for each task.

Inspired by Arrow's theorem, we introduce two quantitative measures of benchmarks:

Sensitivity that quantifies the impact that irrelevant changes to tasks have on a benchmark.
Diversity that captures the degree of disagreement in model rankings across tasks.

We maintain a benchmark for evaluating the sensitivity and diversity of multi-task benchmarks, named BenchBench. Initially, we present results on seven cardinal benchmarks and eleven ordinal benchmarks, which demonstrate a clear trade-off between diversity and stability. The more diverse a multi-task benchmark, the more sensitive to trivial changes it is.