Notebooks Gallery

Evaluate model calibration using folktexts

Example: Run ACS benchmark task

Configuring prompts — typed & composable prompt variations

Run folktexts benchmark with a different data source

Run folktexts benchmark with a custom ACS task

Render paper plots and tables

Example: ACS benchmark with a web-API model (gpt-5-mini)

Fetch and parse ACS benchmark results under a given directory