Evaluate model calibration using folktextsο
Prerequisite: Install folktexts package with pip install folktexts or follow the setup guide in the README.
Summary: The script loads a language model from Huggingface and demonstrates how to use folktexts to get insights into model calibration, and plot the benchmark results.
1. Check folktexts is installedο
[1]:
import folktexts
print(f"{folktexts.__version__=}")
folktexts.__version__='0.6.0'
2. Load Model from Huggingfaceο
We use the Mistral 7B (instruct) model for this demo. The workflow can be similarly applied to any model/tokenizer pair.
Note: Set model_name_or_path to the modelβs name on huggingface or to the path to a saved pretrained model.
[2]:
from folktexts.llm_utils import load_model_tokenizer
# Note: make sure you have the necessary persmissions on Huggingface to download the model
# Note: use gpt2 for the demo if you need a smaller model
# Canonical HF id (gated): "mistralai/Mistral-7B-Instruct-v0.2"
# Using the pre-cached snapshot on this cluster:
model_name_or_path = "/fast/groups/sf/huggingface-models/mistralai--Mistral-7B-Instruct-v0.2"
# model_name_or_path = "gpt2"
model, tokenizer = load_model_tokenizer(model_name_or_path)
3. Create default benchmarking tasksο
We generate ACSIncome benchmark using folktexts.
NOTE: We will subsample the reference data for faster runtime. This should be removed for obtaining reproducible reslts.
Benchmark configurationο
The
subsamplingandnumeric_risk_promptingkey-word arguments are examples of optional benchmark configurations. See this page for a list of available configs.
[3]:
%%time
from folktexts.benchmark import Benchmark, BenchmarkConfig
# Note: This argument is optional. Omit, or set to 1 for reproducible benchmarking on the full data
subsampling_ratio = 0.01
bench = Benchmark.make_acs_benchmark(
model= model,
tokenizer=tokenizer,
task_name="ACSIncome",
subsampling=subsampling_ratio,
numeric_risk_prompting=True,
data_dir="/fast/groups/sf/data", # pre-cached folktables data on this cluster
)
Loading ACS data...
CPU times: user 23.3 s, sys: 13 s, total: 36.3 s
Wall time: 36.4 s
4. Run benchmarkο
Results will be saved in a folder RESULTS_DIR. There is
.jsonfile contains evaluated metrics.cvsfile contains risk scores of each datapointfolder called
imgs/contains figures
[4]:
RESULTS_DIR = "res"
bench.run(results_root_dir=RESULTS_DIR)
[4]:
{'threshold': 0.5,
'n_samples': 1665,
'n_positives': 605,
'n_negatives': 1060,
'model_name': 'mistralai--Mistral-7B-Instruct-v0.2',
'accuracy': 0.6804804804804805,
'tpr': 0.8578512396694215,
'fnr': 0.14214876033057852,
'fpr': 0.4207547169811321,
'tnr': 0.5792452830188679,
'balanced_accuracy': 0.7185482613441447,
'precision': 0.5378238341968912,
'ppr': 0.5795795795795796,
'log_loss': 0.5800311695412786,
'brier_score_loss': 0.19788954954954951,
'tpr_ratio': 0.8217391304347826,
'tpr_diff': 0.1673469387755102,
'fpr_ratio': 0.53475935828877,
'fpr_diff': 0.21750000000000003,
'fnr_ratio': 0.26785714285714285,
'fnr_diff': 0.1673469387755102,
'balanced_accuracy_ratio': 0.9022154478717292,
'balanced_accuracy_diff': 0.07612318751952207,
'precision_ratio': 0.7150197628458498,
'precision_diff': 0.19565807327001355,
'ppr_ratio': 0.5807696212813483,
'ppr_diff': 0.27008110936682367,
'accuracy_ratio': 0.8606341840680588,
'accuracy_diff': 0.10720447379380094,
'tnr_ratio': 0.71,
'tnr_diff': 0.21750000000000003,
'equalized_odds_ratio': 0.26785714285714285,
'equalized_odds_diff': 0.21750000000000003,
'roc_auc': 0.8150732886324653,
'ece': 0.1637657657657665,
'ece_quantile': None,
'threshold_fitted_on': 0,
'sensitive_attribute': 'RAC1P',
'predictions_path': '/lustre/home/acruz/folktexts/notebooks/res/mistralai--Mistral-7B-Instruct-v0.2_bench-3002019411/ACSIncome_subsampled-0.01_seed-42_hash-4175979538.test_predictions.csv',
'config': {'numeric_risk_prompting': True,
'cot_prompting': False,
'enable_thinking': False,
'few_shot_config': None,
'use_chat_template': False,
'chat_prompt': 'default',
'system_prompt': 'default',
'batch_size': None,
'context_size': None,
'correct_order_bias': True,
'feature_subset': None,
'population_filter': None,
'seed': 42,
'prompt_variation': None,
'model_name': 'mistralai--Mistral-7B-Instruct-v0.2',
'model_hash': 3959077460,
'task_name': 'ACSIncome',
'task_hash': 3606936155,
'dataset_name': 'ACSIncome_subsampled-0.01_seed-42_hash-4175979538',
'dataset_subsampling': 0.01,
'dataset_hash': 4175979538},
'benchmark_hash': 3002019411,
'results_dir': '/lustre/home/acruz/folktexts/notebooks/res/mistralai--Mistral-7B-Instruct-v0.2_bench-3002019411',
'results_root_dir': '/lustre/home/acruz/folktexts/notebooks/res',
'current_time': '2026.06.09-16.45.52',
'plots': {'roc_curve_path': '/lustre/home/acruz/folktexts/notebooks/res/mistralai--Mistral-7B-Instruct-v0.2_bench-3002019411/imgs/roc_curve.pdf',
'calibration_curve_path': '/lustre/home/acruz/folktexts/notebooks/res/mistralai--Mistral-7B-Instruct-v0.2_bench-3002019411/imgs/calibration_curve.pdf',
'score_distribution_path': '/lustre/home/acruz/folktexts/notebooks/res/mistralai--Mistral-7B-Instruct-v0.2_bench-3002019411/imgs/score_distribution.pdf',
'score_distribution_per_label_path': '/lustre/home/acruz/folktexts/notebooks/res/mistralai--Mistral-7B-Instruct-v0.2_bench-3002019411/imgs/score_distribution_per_label.pdf',
'roc_curve_per_subgroup_path': '/lustre/home/acruz/folktexts/notebooks/res/mistralai--Mistral-7B-Instruct-v0.2_bench-3002019411/imgs/roc_curve_per_subgroup.pdf',
'calibration_curve_per_subgroup_path': '/lustre/home/acruz/folktexts/notebooks/res/mistralai--Mistral-7B-Instruct-v0.2_bench-3002019411/imgs/calibration_curve_per_subgroup.pdf'}}
4. Visualize resultsο
We can also visualize the results inline:
[5]:
bench.plot_results()
[5]:
{'roc_curve_path': '/lustre/home/acruz/folktexts/notebooks/res/mistralai--Mistral-7B-Instruct-v0.2_bench-3002019411/imgs/roc_curve.pdf',
'calibration_curve_path': '/lustre/home/acruz/folktexts/notebooks/res/mistralai--Mistral-7B-Instruct-v0.2_bench-3002019411/imgs/calibration_curve.pdf',
'score_distribution_path': '/lustre/home/acruz/folktexts/notebooks/res/mistralai--Mistral-7B-Instruct-v0.2_bench-3002019411/imgs/score_distribution.pdf',
'score_distribution_per_label_path': '/lustre/home/acruz/folktexts/notebooks/res/mistralai--Mistral-7B-Instruct-v0.2_bench-3002019411/imgs/score_distribution_per_label.pdf',
'roc_curve_per_subgroup_path': '/lustre/home/acruz/folktexts/notebooks/res/mistralai--Mistral-7B-Instruct-v0.2_bench-3002019411/imgs/roc_curve_per_subgroup.pdf',
'calibration_curve_per_subgroup_path': '/lustre/home/acruz/folktexts/notebooks/res/mistralai--Mistral-7B-Instruct-v0.2_bench-3002019411/imgs/calibration_curve_per_subgroup.pdf'}