Evaluate model calibration using folktexts
Prerequisite: Install folktexts
package with optional model API dependencies: pip install 'folktexts[apis]'
Summary: The script demonstrates how to use folktexts to get insights into model calibration on a model hosted through a web API.
1. Check folktexts is installed
[1]:
import folktexts
print(f"{folktexts.__version__=}")
folktexts.__version__='0.0.21'
2. Load model API using litellm
We use OpenAI’s GPT-4o-mini model for this demo. The workflow can be similarly applied to any compatible model.
Note: Set model_name
to the model’s name. See the litellm
list of compatible web-API providers and models.
[2]:
model_name = "openai/gpt-4o-mini"
3. Set OPENAI_API_KEY
(or key to respective API provider)
[ ]:
import os
os.environ["OPENAI_API_KEY"] = "your-key-here" # NOTE: Substitute with your key here!
3. Create default benchmarking tasks
We generate ACSIncome benchmark using folktexts.
NOTE: We will subsample the reference data for faster runtime. This should be removed for obtaining reproducible reslts.
Benchmark configuration
The
subsampling
andnumeric_risk_prompting
key-word arguments are examples of optional benchmark configurations. See this page for a list of available configs.
[4]:
%%time
from folktexts.benchmark import Benchmark, BenchmarkConfig
# Note: This argument is optional. Omit, or set to 1 for reproducible benchmarking on the full data
subsampling_ratio = 0.005
bench = Benchmark.make_acs_benchmark(
model=model_name,
task_name="ACSIncome",
subsampling=subsampling_ratio,
numeric_risk_prompting=True,
)
WARNING:root:Received non-standard ACS argument 'subsampling' (using subsampling=0.005 instead of default subsampling=None). This may affect reproducibility.
Loading ACS data...
Using zero-shot prompting.
CPU times: user 52.6 s, sys: 1min 23s, total: 2min 16s
Wall time: 2min 21s
4. Run benchmark
Results will be saved in a folder RESULTS_DIR. There is
.json
file contains evaluated metrics.cvs
file contains risk scores of each datapointfolder called
imgs/
contains figures
[5]:
RESULTS_DIR = "res"
bench.run(results_root_dir=RESULTS_DIR)
WARNING:root:Failed to compute ECE quantile: The smallest edge difference is numerically 0.
[5]:
{'threshold': 0.5,
'n_samples': 832,
'n_positives': 305,
'n_negatives': 527,
'model_name': 'openai/gpt-4o-mini',
'accuracy': 0.7884615384615384,
'tpr': 0.6885245901639344,
'fnr': 0.3114754098360656,
'fpr': 0.15370018975332067,
'tnr': 0.8462998102466793,
'balanced_accuracy': 0.7674122002053069,
'precision': 0.7216494845360825,
'ppr': 0.34975961538461536,
'log_loss': 0.8249689807687466,
'brier_score_loss': np.float64(0.15596153846153846),
'tpr_ratio': 0.0,
'tpr_diff': 0.782608695652174,
'precision_ratio': 0.0,
'precision_diff': 0.9,
'tnr_ratio': 0.8177339901477833,
'tnr_diff': 0.18226600985221675,
'fnr_ratio': 0.21739130434782608,
'fnr_diff': 0.782608695652174,
'ppr_ratio': 0.0,
'ppr_diff': 0.47619047619047616,
'accuracy_ratio': 0.7,
'accuracy_diff': 0.30000000000000004,
'balanced_accuracy_ratio': 0.5961800818553888,
'balanced_accuracy_diff': 0.33867276887871856,
'fpr_ratio': 0.0,
'fpr_diff': 0.18226600985221675,
'equalized_odds_ratio': 0.0,
'equalized_odds_diff': 0.782608695652174,
'roc_auc': np.float64(0.8337263197187919),
'ece': 0.032091346153846276,
'ece_quantile': None,
'predictions_path': '/lustre/home/acruz/folktexts/notebooks/res/openai/gpt-4o-mini_bench-551234521/ACSIncome_subsampled-0.005_seed-42_hash-305608976.test_predictions.csv',
'config': {'numeric_risk_prompting': True,
'few_shot': None,
'reuse_few_shot_examples': False,
'batch_size': None,
'context_size': None,
'correct_order_bias': True,
'feature_subset': None,
'population_filter': None,
'seed': 42,
'model_name': 'openai/gpt-4o-mini',
'model_hash': 920159687,
'task_name': 'ACSIncome',
'task_hash': 127998692,
'dataset_name': 'ACSIncome_subsampled-0.005_seed-42_hash-305608976',
'dataset_hash': 305608976},
'plots': {'roc_curve_path': '/lustre/home/acruz/folktexts/notebooks/res/openai/gpt-4o-mini_bench-551234521/imgs/roc_curve.pdf',
'calibration_curve_path': '/lustre/home/acruz/folktexts/notebooks/res/openai/gpt-4o-mini_bench-551234521/imgs/calibration_curve.pdf',
'score_distribution_path': '/lustre/home/acruz/folktexts/notebooks/res/openai/gpt-4o-mini_bench-551234521/imgs/score_distribution.pdf',
'score_distribution_per_label_path': '/lustre/home/acruz/folktexts/notebooks/res/openai/gpt-4o-mini_bench-551234521/imgs/score_distribution_per_label.pdf',
'roc_curve_per_subgroup_path': '/lustre/home/acruz/folktexts/notebooks/res/openai/gpt-4o-mini_bench-551234521/imgs/roc_curve_per_subgroup.pdf',
'calibration_curve_per_subgroup_path': '/lustre/home/acruz/folktexts/notebooks/res/openai/gpt-4o-mini_bench-551234521/imgs/calibration_curve_per_subgroup.pdf'}}
4. Visualize results
We can also visualize the results inline:
[6]:
bench.plot_results()
data:image/s3,"s3://crabby-images/9236d/9236d680fe3d47c945628a7d63f32457f15975e3" alt="../_images/notebooks_minimal-example_web-API-model_12_0.png"
data:image/s3,"s3://crabby-images/a7b14/a7b144e06650a258d0dea48eb8aef270c3f04d4c" alt="../_images/notebooks_minimal-example_web-API-model_12_1.png"
data:image/s3,"s3://crabby-images/1cfd8/1cfd83af972952f08f7c2b107cbf8d6447096f7f" alt="../_images/notebooks_minimal-example_web-API-model_12_2.png"
data:image/s3,"s3://crabby-images/7f357/7f357edfea82c62863d4a6a06f0cbfdf3ef59cab" alt="../_images/notebooks_minimal-example_web-API-model_12_3.png"
data:image/s3,"s3://crabby-images/2dad4/2dad42f4e140a9e2879ab3ffb589401618e82264" alt="../_images/notebooks_minimal-example_web-API-model_12_4.png"
data:image/s3,"s3://crabby-images/6e411/6e41143ba36837dfaafb797de0f02295dde5fe10" alt="../_images/notebooks_minimal-example_web-API-model_12_5.png"
[6]:
{'roc_curve_path': '/lustre/home/acruz/folktexts/notebooks/res/openai/gpt-4o-mini_bench-551234521/imgs/roc_curve.pdf',
'calibration_curve_path': '/lustre/home/acruz/folktexts/notebooks/res/openai/gpt-4o-mini_bench-551234521/imgs/calibration_curve.pdf',
'score_distribution_path': '/lustre/home/acruz/folktexts/notebooks/res/openai/gpt-4o-mini_bench-551234521/imgs/score_distribution.pdf',
'score_distribution_per_label_path': '/lustre/home/acruz/folktexts/notebooks/res/openai/gpt-4o-mini_bench-551234521/imgs/score_distribution_per_label.pdf',
'roc_curve_per_subgroup_path': '/lustre/home/acruz/folktexts/notebooks/res/openai/gpt-4o-mini_bench-551234521/imgs/roc_curve_per_subgroup.pdf',
'calibration_curve_per_subgroup_path': '/lustre/home/acruz/folktexts/notebooks/res/openai/gpt-4o-mini_bench-551234521/imgs/calibration_curve_per_subgroup.pdf'}