Data

benchbench.data

benchbench.data.load_cardinal_benchmark(dataset_name, do_rerank=True, **kwargs)[source]

Load a cardinal benchmark.

Parameters:
  • dataset_name (str) – Name for the benchmark.

  • do_rerank (bool) – Whether re-rank the data based on the average score.

  • **kwargs – Other arguments.

Returns:

pd.DataFrame: data. list: cols.

Return type:

tuple

benchbench.data.load_ordinal_benchmark(dataset_name, do_rerank=True, **kwargs)[source]

Load an ordinal benchmark.

Parameters:
  • dataset_name (str) – name for the benchmark

  • do_rerank (bool) – whether re-rank the data based on the winning rate

  • **kwargs – other arguments

Returns:

pd.DataFrame: data list: cols

Return type:

tuple

data.cardinal_benchmark_list = ['GLUE', 'SuperGLUE', 'OpenLLM', 'MMLU', 'BigBenchHard', 'MTEB', 'VTAB']
data.ordinal_benchmark_list = ['BigCode', 'HELM-accuracy', 'HELM-bias', 'HELM-calibration', 'HELM-fairness', 'HELM-efficiency', 'HELM-robustness', 'HELM-summarization', 'HELM-toxicity', 'HEIM-alignment_auto', 'HEIM-nsfw', 'HEIM-quality_auto', 'HEIM-aesthetics_auto', 'HEIM-alignment_human', 'HEIM-nudity', 'HEIM-quality_human', 'HEIM-aesthetics_human', 'HEIM-black_out', 'HEIM-originality']