folktexts package

Subpackages

Submodules

folktexts.benchmark module

A benchmark class for measuring and evaluating LLM calibration.

class folktexts.benchmark.Benchmark(llm_clf, dataset, config=BenchmarkConfig(numeric_risk_prompting=False, cot_prompting=False, enable_thinking=False, few_shot=None, reuse_few_shot_examples=False, balance_few_shot_examples=False, use_chat_template=False, chat_prompt=None, system_prompt=None, batch_size=None, context_size=None, correct_order_bias=True, feature_subset=None, population_filter=None, seed=42))[source]

Bases: object

Measures and evaluates risk scores produced by an LLM.

A benchmark object to measure and evaluate risk scores produced by an LLM.

Parameters:
  • llm_clf (LLMClassifier) – A language model classifier object (can be local or web-hosted).

  • dataset (Dataset) – The dataset object to use for the benchmark.Γ·

  • config (BenchmarkConfig, optional) – The configuration object used to create the benchmark parameters. NOTE: This is used to uniquely identify the benchmark object for reproducibility; it will not be used to change the benchmark behavior. To configure the benchmark, pass a configuration object to the Benchmark.make_benchmark method.

ACS_DATASET_CONFIGS = {'horizon': '1-Year', 'seed': 42, 'subsampling': None, 'survey': 'person', 'survey_year': '2018', 'test_size': 0.1, 'val_size': 0.1}
property configs_dict: dict
classmethod make_acs_benchmark(task_name, *, model, tokenizer=None, data_dir=None, max_api_rpm=None, config=BenchmarkConfig(numeric_risk_prompting=False, cot_prompting=False, enable_thinking=False, few_shot=None, reuse_few_shot_examples=False, balance_few_shot_examples=False, use_chat_template=False, chat_prompt=None, system_prompt=None, batch_size=None, context_size=None, correct_order_bias=True, feature_subset=None, population_filter=None, seed=42), backend=None, model_name_or_path=None, **kwargs)[source]

Create a standardized calibration benchmark on ACS data.

Parameters:
  • task_name (str) – The name of the ACS task to use.

  • model (AutoModelForCausalLM | str) – The transformers language model to use, or the model ID for a webAPI hosted model (e.g., β€œopenai/gpt-4o-mini”).

  • tokenizer (AutoTokenizer, optional) – The tokenizer used to train the model (if using a transformers model). Not required for webAPI models.

  • data_dir (str | Path, optional) – Path to the directory to load data from and save data in.

  • max_api_rpm (int, optional) – The maximum number of API requests per minute for webAPI models.

  • config (BenchmarkConfig, optional) – Extra benchmark configurations, by default will use BenchmarkConfig.default_config().

  • **kwargs – Additional arguments passed to ACSDataset and BenchmarkConfig. By default will use a set of standardized configurations for reproducibility.

Returns:

bench – The ACS calibration benchmark object.

Return type:

Benchmark

classmethod make_benchmark(*, task, dataset, model, tokenizer=None, max_api_rpm=None, config=BenchmarkConfig(numeric_risk_prompting=False, cot_prompting=False, enable_thinking=False, few_shot=None, reuse_few_shot_examples=False, balance_few_shot_examples=False, use_chat_template=False, chat_prompt=None, system_prompt=None, batch_size=None, context_size=None, correct_order_bias=True, feature_subset=None, population_filter=None, seed=42), backend=None, model_name_or_path=None, **kwargs)[source]

Create a calibration benchmark from a given configuration.

Parameters:
  • task (TaskMetadata | str) – The task metadata object or name of the task to use.

  • dataset (Dataset) – The dataset to use for the benchmark.

  • model (AutoModelForCausalLM | str) – The transformers language model to use, or the model ID for a webAPI hosted model (e.g., β€œopenai/gpt-4o-mini”).

  • tokenizer (AutoTokenizer, optional) – The tokenizer used to train the model (if using a transformers model). Not required for webAPI models.

  • max_api_rpm (int, optional) – The maximum number of API requests per minute for webAPI models.

  • config (BenchmarkConfig, optional) – Extra benchmark configurations, by default will use BenchmarkConfig.default_config().

  • **kwargs – Additional arguments for easier configuration of the benchmark. Will simply use these values to update the config object.

Returns:

bench – The calibration benchmark object.

Return type:

Benchmark

property model_name
plot_results(*, show_plots=True)[source]

Render evaluation plots and save to disk.

Parameters:

show_plots (bool, optional) – Whether to show plots, by default True.

Returns:

plots_paths – The paths to the saved plots.

Return type:

dict[str, str]

property results
property results_dir: Path

Get the results directory for this benchmark.

property results_root_dir: Path
run(results_root_dir, fit_threshold=0)[source]

Run the calibration benchmark experiment.

Parameters:
  • results_root_dir (str | Path) – Path to root directory under which results will be saved.

  • fit_threshold (int | bool, optional) – Whether to fit the binarization threshold on a given number of training samples, by default 0 (will not fit the threshold).

Returns:

The benchmark metric value. By default this is the ECE score.

Return type:

float

save_results(results_root_dir=None)[source]

Save the benchmark results to disk.

Parameters:

results_root_dir (str | Path, optional) – Path to root directory under which results will be saved. By default will use self.results_root_dir.

property task
class folktexts.benchmark.BenchmarkConfig(numeric_risk_prompting=False, cot_prompting=False, enable_thinking=False, few_shot=None, reuse_few_shot_examples=False, balance_few_shot_examples=False, use_chat_template=False, chat_prompt=None, system_prompt=None, batch_size=None, context_size=None, correct_order_bias=True, feature_subset=None, population_filter=None, seed=42)[source]

Bases: object

A dataclass to hold the configuration for risk-score benchmark.

numeric_risk_prompting

Whether to prompt for numeric risk-estimates instead of multiple-choice Q&A, by default False.

Type:

bool, optional

cot_prompting

Whether to use chain-of-thought prompting: the model generates free-form reasoning text and ends with a Probability: X% line that is recovered via regex. Works on any model regardless of chat template. By default False.

Type:

bool, optional

enable_thinking

Whether to enable thinking mode for tokenizers that support it (e.g., Qwen3). Only applies when cot_prompting=True. When enabled, calls apply_chat_template(…, enable_thinking=True) and the resulting <think>…</think> block is stripped before regex extraction. Default is False.

Type:

bool, optional

few_shot

Whether to use few-shot prompting with a given number of examples, by default None.

Type:

int | None, optional

reuse_few_shot_examples

Whether to reuse the same samples for few-shot prompting (or sample new ones every time), by default False.

Type:

bool, optional

balance_few_shot_examples

Whether to balance the samples for few-shot prompting with respect to their labels, by default False.

Type:

bool, optional

use_chat_template

Whether to format prompts using the tokenizer’s chat template, by default False. Only supported for local transformers models.

Type:

bool, optional

chat_prompt

The assistant prefill text to use with chat templates. If None, uses the appropriate default for the prompting mode (ANTHROPIC_CHAT_PROMPT for multiple-choice, NUMERIC_CHAT_PROMPT for numeric).

Type:

str | None, optional

system_prompt

Custom system prompt text to use with chat templates. If None, uses the appropriate default for the prompting mode (SYSTEM_PROMPT for multiple-choice, NUMERIC_SYSTEM_PROMPT for numeric).

Type:

str | None, optional

batch_size

The batch size to use for inference.

Type:

int | None, optional

context_size

The maximum context size when prompting the LLM.

Type:

int | None, optional

correct_order_bias

Whether to correct the ordering bias in multiple-choice Q&A when prompting the LLM, by default True.

Type:

bool, optional

feature_subset

Whether to use a subset of the standard feature set for the task. The list should contain the names of the columns of features to use.

Type:

list[str] | None, optional

population_filter

Optional population filter for this benchmark; must follow the format {β€œcolumn_name”: β€œvalue”}.

Type:

dict | None, optional

seed

Random seed – to set for reproducibility.

Type:

int, optional

balance_few_shot_examples: bool = False
batch_size: int | None = None
chat_prompt: str | None = None
context_size: int | None = None
correct_order_bias: bool = True
cot_prompting: bool = False
classmethod default_config(**changes)[source]

Returns the default configuration with optional changes.

enable_thinking: bool = False
feature_subset: list[str] | None = None
few_shot: int | None = None
classmethod load_from_disk(path)[source]

Load the configuration from disk.

numeric_risk_prompting: bool = False
population_filter: dict | None = None
reuse_few_shot_examples: bool = False
save_to_disk(path)[source]

Save the configuration to disk.

seed: int = 42
system_prompt: str | None = None
update(**changes)[source]

Update the configuration with new values.

Return type:

BenchmarkConfig

use_chat_template: bool = False

folktexts.col_to_text module

class folktexts.col_to_text.ColumnToText(name, short_description, value_map=None, question=None, connector_verb='is:', missing_value_fill='N/A', use_value_map_only=False)[source]

Bases: object

Maps a single column’s values to natural text.

Constructs a ColumnToText object.

Parameters:
  • name (str) – The column’s name.

  • short_description (str) – A short description of the column to be used before different values. For example, short_description=”yearly income” will result in β€œThe yearly income is […]”.

  • value_map (dict[int | str, str] | Callable, optional) – A map between column values and their textual meaning. If not provided, will try to infer a mapping from the question.

  • question (QAInterface, optional) – A question associated with the column. If not provided, will try to infer a multiple-choice question from the value_map.

  • connector_verb (str, optional) – Which verb to use when connecting the column’s description to its value; by default β€œis”.

  • missing_value_fill (str, optional) – The value to use when the column’s value is not found in the value_map, by default β€œN/A”.

  • use_value_map_only (bool, optional) – Whether to only use the value_map for mapping values to text, or whether natural language representation should be generated using the connector_verb and short_description as well. By default (False) will construct a natural language representation of the form: β€œThe [short_description] [connector_verb] [value_map.get(val)]”.

get_text(value)[source]

Returns the natural text representation of the given data value.

Return type:

str

property name: str
property question: QAInterface
property short_description: str
property value_map: Callable

Returns the value map function for this column.

folktexts.dataset module

General Dataset functionality for text-based datasets.

class folktexts.dataset.Dataset(data, task, test_size=0.1, val_size=0.1, subsampling=None, seed=42)[source]

Bases: object

Construct a Dataset object.

Parameters:
  • data (pd.DataFrame) – The dataset’s data in pandas DataFrame format.

  • task (TaskMetadata) – The metadata for the prediction task.

  • test_size (float, optional) – The size of the test set, as a fraction of the total dataset size, by default 0.1.

  • val_size (float, optional) – The size of the validation set, as a fraction of the total dataset size, by default 0.1.

  • subsampling (float, optional) – Whether to use sub-sampling, and which fraction of the data to keep. By default will not use sub-sampling (subsampling=None).

  • seed (int, optional) – The random state seed, by default 42.

property data: DataFrame
filter(population_feature_values)[source]

Filter dataset rows in-place.

get_data_split(split)[source]
Return type:

tuple[DataFrame, Series]

get_features_data()[source]
Return type:

DataFrame

get_sensitive_attribute_data()[source]
Return type:

Series

get_target_data()[source]
Return type:

Series

get_test()[source]
get_train()[source]
get_val()[source]
property name: str

A unique name for this dataset.

sample_n_train_examples(n, reuse_examples=False, class_balancing=False)[source]

Return a set of samples from the training set.

Parameters:
  • n (int) – The number of example rows to return.

  • reuse_examples (bool, optional) – Whether to reuse the same examples for consistency. By default will sample new examples each time (reuse_examples=False).

Returns:

X, y – The features and target data for the sampled examples.

Return type:

tuple[pd.DataFrame, pd.Series]

property seed: int
subsample(subsampling)[source]

Subsamples this dataset in-place.

property subsampling: float
property task: TaskMetadata
property test_size: float
property train_size: float
property val_size: float

folktexts.evaluation module

Module to map risk-estimates to a variety of evaluation metrics.

Notes

Code based on the error_parity.evaluation module, at: https://github.com/socialfoundations/error-parity/blob/main/error_parity/evaluation.py

folktexts.evaluation.bootstrap_estimate(eval_func, *, y_true, y_pred_scores, sensitive_attribute=None, k=200, confidence_pct=95, seed=42)[source]

Computes bootstrap estimates of the given evaluation function.

Parameters:
  • eval_func (Callable[[np.ndarray, np.ndarray, np.ndarray], dict[str, float]]) – The evaluation function to run for each bootstrap sample. Must follow the signature eval_func(y_true, y_pred_scores, sensitive_attribute).

  • y_true (np.ndarray) – The true labels.

  • y_pred_scores (np.ndarray) – The predicted scores.

  • sensitive_attribute (np.ndarray, optional) – Optionally, provide the sensitive attribute data to compute fairness metrics, by default None.

  • k (int, optional) – How many bootstrap samples to draw, by default 200.

  • confidence_pct (float, optional) – The confidence interval to use, in percentage, by default 95.

  • seed (int, optional) – The random seed, by default 42.

Returns:

results – A dictionary containing bootstrap estimates for a variety of metrics.

Return type:

dict[str, float]

folktexts.evaluation.compute_best_threshold(y_true, y_pred_scores, *, false_pos_cost=1.0, false_neg_cost=1.0)[source]

Computes the binarization threshold that maximizes accuracy.

Parameters:
  • y_true (np.ndarray) – The true class labels.

  • y_pred_scores (np.ndarray) – The predicted risk scores.

  • false_pos_cost (float, optional) – The cost of a false positive error, by default 1.0

  • false_neg_cost (float, optional) – The cost of a false negative error, by default 1.0

Returns:

best_threshold – The threshold value that maximizes accuracy for the given predictions.

Return type:

float

folktexts.evaluation.evaluate_binary_predictions(y_true, y_pred)[source]

Evaluates the provided binary predictions on common performance metrics.

Parameters:
  • y_true (np.ndarray) – The true class labels.

  • y_pred (np.ndarray) – The binary predictions.

Returns:

A dictionary with key-value pairs of (metric name, metric value).

Return type:

dict

folktexts.evaluation.evaluate_binary_predictions_fairness(y_true, y_pred, sensitive_attribute, return_groupwise_metrics=False, min_group_size=0.04)[source]

Evaluates fairness of the given predictions.

Fairness metrics are computed as the ratios between group-wise performance metrics.

Parameters:
  • y_true (np.ndarray) – The true class labels.

  • y_pred (np.ndarray) – The discretized predictions.

  • sensitive_attribute (np.ndarray) – The sensitive attribute (protected group membership).

  • return_groupwise_metrics (bool, optional) – Whether to return group-wise performance metrics (bool: True) or only the ratios between these metrics (bool: False), by default False.

  • min_group_size (float, optional) – The minimum fraction of samples (as a fraction of the total number of samples) that a group must have to be considered for fairness evaluation, by default 0.04. This is meant to avoid evaluating metrics on very small groups which leads to noisy and inconsistent results.

Returns:

A dictionary with key-value pairs of (metric name, metric value).

Return type:

dict

folktexts.evaluation.evaluate_predictions(y_true, y_pred_scores, *, sensitive_attribute=None, threshold='best', model_name=None)[source]

Evaluates predictions on common performance and fairness metrics.

Parameters:
  • y_true (np.ndarray) – The true class labels.

  • y_pred_scores (np.ndarray) – The predicted scores.

  • sensitive_attribute (np.ndarray, optional) – The sensitive attribute data. Will compute fairness metrics if provided.

  • threshold (float | str, optional) – The threshold to use for binarizing the predictions, or β€œbest” to infer which threshold maximizes accuracy.

  • model_name (str, optional) – The name of the model to be used on the plots, by default None.

Returns:

results – A dictionary with key-value pairs of (metric name, metric value).

Return type:

dict

folktexts.evaluation.evaluate_predictions_bootstrap(y_true, y_pred_scores, *, sensitive_attribute=None, threshold='best', k=200, confidence_pct=95, seed=42)[source]

Computes bootstrap estimates of classification metrics for the given predictions.

Parameters:
  • y_true (np.ndarray) – The true labels.

  • y_pred_scores (np.ndarray) – The score predictions.

  • sensitive_attribute (np.ndarray, optional) – The sensitive attribute data. Will compute fairness metrics if provided.

  • threshold (float | str, optional) – The threshold to use for binarizing the predictions, or β€œbest” to infer which threshold maximizes accuracy, by default β€œbest”.

  • k (int, optional) – How many bootstrap samples to draw, by default 200.

  • confidence_pct (float, optional) – How large of a confidence interval to use when reporting lower and upper bounds, by default 95 (i.e., 2.5 to 97.5 percentile of results).

  • seed (int, optional) – The random seed, by default 42.

Returns:

results – A dictionary containing bootstrap estimates for a variety of metrics.

Return type:

dict[str, float]

folktexts.llm_utils module

Common functions to use with transformer LLMs.

folktexts.llm_utils.add_pad_token(tokenizer)[source]

Add a pad token to the model and tokenizer if it doesn’t already exist.

Here we’re using the end-of-sentence token as the pad token. Both the model weights and tokenizer vocabulary are untouched.

Another possible way would be to add a new token [PAD] to the tokenizer and update the tokenizer vocabulary and model weight embeddings accordingly. The embedding for the new pad token would be the average of all other embeddings.

folktexts.llm_utils.decode_topk_logprobs_to_risk_estimate(per_pass_topk, *, tokenizer_vocab, vocab_dim, question)[source]

Convert top-K log-probabilities into a single risk-estimate float.

Parameters:
  • per_pass_topk (list[dict[int, float]]) – One dict per generated token position, mapping token_id -> log-prob. The token_ids must match the values in tokenizer_vocab. Tokens absent from the top-K are assumed to have probability ~0.

  • tokenizer_vocab (dict[str, int]) – Token string -> token_id map used by the QA decoder for prefix-variant lookup (MultipleChoiceQA) or digit/decimal lookup (DirectNumericQA).

  • vocab_dim (int) – Size of the linear-probability array’s vocab axis. For local backends this is model.config.vocab_size (the logits axis); for the synthetic WebAPI path it is the size of the synthesised vocab.

  • question (MultipleChoiceQA | DirectNumericQA) – The QA interface used to interpret the probabilities.

Returns:

risk_estimate – Risk score in [0, 1] from question.get_answer_from_model_output.

Return type:

float

Notes

Both the WebAPI backend (top_logprobs=20 from OpenAI-style responses) and the vLLM backend (top-K logprobs from SamplingParams(logprobs=K)) call this helper. The transformers backend reads the full softmax directly and bypasses this path; see query_model_batch_multiple_passes.

folktexts.llm_utils.generate_text_batch(text_inputs, model, tokenizer, max_new_tokens=1024, context_size=None, enable_thinking=None)[source]

Generate text completions for a batch of prompts.

Uses the model’s generate() method for autoregressive text generation, suitable for chain-of-thought Q&A where the model needs to produce free-form text before outputting a probability estimate. Generation is greedy (do_sample=False) so runs are reproducible β€” matches the web-API path’s temperature=0 contract.

Parameters:
  • text_inputs (list[str]) – The input prompts as a list of strings.

  • model (AutoModelForCausalLM) – The model to use for generation.

  • tokenizer (AutoTokenizer) – The tokenizer used to encode/decode text.

  • max_new_tokens (int, optional) – Maximum number of new tokens to generate, by default 1024.

  • context_size (int, optional) – The maximum context size for input tokens. If None, no truncation is applied to inputs.

  • enable_thinking (bool, optional) –

    Controls chat template application and thinking mode: - None: Do not apply chat template (use raw prompts, for base models) - False: Apply chat template WITHOUT thinking mode (for instruction-tuned models) - True: Apply chat template WITH thinking mode, and extract response

    content after </think> marker (for thinking models like Qwen3)

Returns:

generated_texts – The generated text completions for each input prompt. Only the newly generated tokens are returned (not the input prompt).

Return type:

list[str]

folktexts.llm_utils.get_model_folder_path(model_name, root_dir='/tmp')[source]

Returns the folder where the model is saved.

Return type:

str

folktexts.llm_utils.get_model_size_B(model_name, default=None)[source]

Get the model size from the model name, in Billions of parameters.

Return type:

int

folktexts.llm_utils.is_bf16_compatible()[source]

Checks if the current environment is bfloat16 compatible.

Return type:

bool

folktexts.llm_utils.load_model_tokenizer(model_name_or_path, **kwargs)[source]

Load a model and tokenizer from the given local path (or using the model name).

Parameters:
  • model_name_or_path (str | Path) – Model name or local path to the model folder.

  • kwargs (dict) – Additional keyword arguments to pass to the model from_pretrained call.

Returns:

The loaded model and tokenizer, respectively.

Return type:

tuple[AutoModelForCausalLM, AutoTokenizer]

folktexts.llm_utils.load_vllm_model(model_name_or_path, *, dtype='auto', gpu_memory_utilization=0.85, max_model_len=None, tensor_parallel_size=1, trust_remote_code=True, seed=42, max_logprobs=50, **kwargs)[source]

Load a vLLM LLM engine and its tokenizer.

Mirrors load_model_tokenizer for the vLLM backend. vLLM allocates the KV cache statically at startup based on gpu_memory_utilization and max_model_len; tune these per-GPU. vllm is an optional install β€” if it is not importable, this function raises a pointed error.

Parameters:
  • model_name_or_path (str | Path) – Model name or local path to the model folder. Pre-cached snapshots under /fast/groups/sf/huggingface-models/ work without download.

  • dtype (str, optional) – Compute dtype: "auto" (default; vLLM picks bf16/fp16 from the config), "bfloat16", "float16", or "float32".

  • gpu_memory_utilization (float, optional) – Fraction of GPU VRAM vLLM may use for weights + KV cache. Default 0.85 (vLLM’s own default is 0.9, which is aggressive on shared cluster nodes). vLLM fails fast at startup if this isn’t enough β€” bump down if you hit OOM at LLM().

  • max_model_len (int, optional) – Maximum number of tokens (input + output) per request. If None, vLLM reads it from the model config β€” which on some Llama checkpoints is 131072 and will allocate enormous KV cache. Pass an explicit value sized as context_size + max_new_tokens + buffer for the workload.

  • tensor_parallel_size (int, optional) – Number of GPUs to shard the model across; default 1. Set higher when the cluster job grants multiple GPUs and the model fits with tensor-parallel sharding.

  • trust_remote_code (bool, optional) – Forwarded to vLLM (mirrors load_model_tokenizer).

  • seed (int, optional) – Random seed for vLLM. Doesn’t affect greedy (temperature=0) decoding but pinned for safety.

  • max_logprobs (int, optional) – Engine-level cap on top-K logprobs SamplingParams may request. Default 50 β€” must be β‰₯ VLLMClassifier._TOPK_LOGPROBS or the engine rejects the request at predict time (VLLMValidationError: Requested sample logprobs of K, which is greater than max allowed).

  • **kwargs – Additional keyword arguments forwarded verbatim to vllm.LLM(...).

Returns:

Loaded engine and its tokenizer. The tokenizer has had add_pad_token applied so it matches the transformers path’s tokenizer state.

Return type:

tuple[vllm.LLM, AutoTokenizer]

folktexts.llm_utils.query_model_batch(text_inputs, model, tokenizer, context_size)[source]

Queries the model with a batch of text inputs.

Parameters:
  • text_inputs (list[str]) – The inputs to the model as a list of strings.

  • model (AutoModelForCausalLM) – The model to query.

  • tokenizer (AutoTokenizer) – The tokenizer used to encode the text inputs.

  • context_size (int) – The maximum context size to consider for each input (in tokens).

Returns:

last_token_probs – Model’s last token linear probabilities for each input as an np.array of shape (batch_size, vocab_size).

Return type:

np.array

folktexts.llm_utils.query_model_batch_multiple_passes(text_inputs, model, tokenizer, context_size, n_passes, digits_only=False)[source]

Queries an LM for multiple forward passes.

Greedy token search over multiple forward passes: Each forward pass takes the highest likelihood token from the previous pass.

NOTE: could use model.generate in the future!

Parameters:
  • text_inputs (list[str]) – The batch inputs to the model as a list of strings.

  • model (AutoModelForCausalLM) – The model to query.

  • tokenizer (AutoTokenizer) – The tokenizer used to encode the text inputs.

  • context_size (int) – The maximum context size to consider for each input (in tokens).

  • n_passes (int, optional) – The number of forward passes to run.

  • digits_only (bool, optional) – Whether to only sample for digit tokens.

Returns:

last_token_probs – Last token linear probabilities for each forward pass, for each text in the input batch. The output has shape (batch_size, n_passes, vocab_size).

Return type:

np.array

folktexts.plotting module

Module to plot evaluation results.

folktexts.plotting.render_evaluation_plots(y_true, y_pred_scores, *, eval_results={}, model_name=None, imgs_dir=None, show_plots=False)[source]

Renders evaluation plots for the given predictions.

Return type:

dict

folktexts.plotting.render_fairness_plots(y_true, y_pred_scores, *, sensitive_attribute, eval_results={}, model_name=None, group_value_map, group_size_threshold=0.04, imgs_dir=None, show_plots=False)[source]

Renders fairness plots for the given predictions.

Return type:

dict

folktexts.plotting.save_fig(fig, fig_name, imgs_dir, format='pdf')[source]

Helper to save a matplotlib figure to disk.

Return type:

str

folktexts.prompting module

Module to map risk-estimation questions to different prompting techniques.

e.g., - multiple-choice Q&A vs direct numeric Q&A; - zero-shot vs few-shot vs CoT;

folktexts.prompting.apply_chat_template(tokenizer, user_prompt, system_prompt=None, chat_prompt=None, **kwargs)[source]

Apply the tokenizer’s chat template to assemble a single prompt string.

Return type:

str

Notes

system_prompt is treated as β€œinclude” iff it is not None. This means an empty string β€œβ€ will inject an empty system message rather than be treated as β€œno system role” β€” pass None (or omit the argument) to skip the system role entirely.

chat_prompt is the assistant prefill. When provided, the returned prompt is trimmed so it ends exactly with chat_prompt, preserving the last-token scoring contract relied on by LLMClassifier. If the chat template mutates or strips the prefill (so it cannot be located verbatim in the rendered output), a ValueError is raised rather than silently returning a corrupted prompt.

When chat_prompt is None, add_generation_prompt=True is used and the model is left to generate freely; this is not appropriate for the benchmark scoring path (the last token will be a template-emitted role header, not the prefill).

folktexts.prompting.encode_row_prompt(row, task, question=None, custom_prompt_prefix=None, add_task_description=True, with_answer_prefill=True)[source]

Encode a question regarding a given row.

with_answer_prefill is forwarded to question.get_question_prompt. The chat-template path passes False so the prefill is supplied as a separate assistant turn rather than baked into the user message.

Return type:

str

folktexts.prompting.encode_row_prompt_chat(row, task, tokenizer, system_prompt=<object object>, chat_prompt=<object object>, numeric=False, question=None, custom_prompt_prefix=None)[source]

Encode a row prompt using the tokenizer’s chat template.

Parameters:
  • row (pd.Series) – The row that the question will be about.

  • task (TaskMetadata) – The task metadata object.

  • tokenizer (AutoTokenizer) – The tokenizer whose chat template will be applied.

  • system_prompt (str | None, optional) – System prompt text. If omitted, the mode-appropriate default selected by numeric is used. Pass None explicitly to disable the system role (e.g. for Gemma-style templates that reject it).

  • chat_prompt (str | None, optional) – Assistant prefill text. If omitted, the mode-appropriate default selected by numeric is used. Pass None explicitly to skip the assistant prefill β€” note that this routes inference through add_generation_prompt=True and breaks the last-token scoring assumption used by LLMClassifier, so it is not appropriate for the benchmark path.

  • numeric (bool, optional) – Whether numeric risk prompting is being used. Selects which default prompts are applied when system_prompt / chat_prompt are omitted.

  • question (QAInterface, optional) – The question interface to use.

  • custom_prompt_prefix (str, optional) – A custom prompt prefix to prepend.

Returns:

The fully formatted chat-template prompt.

Return type:

str

folktexts.prompting.encode_row_prompt_few_shot(row, task, dataset, n_shots, question=None, reuse_examples=False, class_balancing=False, custom_prompt_prefix=None)[source]

Encode a question regarding a given row using few-shot prompting.

Parameters:
  • row (pd.Series) – The row that the question will be about.

  • task (TaskMetadata) – The task that the row belongs to.

  • n_shots (int, optional) – The number of example questions and answers to use before prompting about the given row, by default 3.

  • reuse_examples (bool, optional) – Whether to reuse the same examples for consistency. By default will resample new examples each time (reuse_examples=False).

Returns:

prompt – The encoded few-shot prompt.

Return type:

str

folktexts.prompting.resolve_chat_defaults(numeric, system_prompt=None, chat_prompt=None)[source]

Resolve default system_prompt / chat_prompt for chat-template prompting.

A None value means β€œuse the default for this mode”. To explicitly disable a role downstream, override the resolved value with None after calling this function (which is what Benchmark.make_benchmark does for tokenizers that reject the system role).

Return type:

tuple[str, str]

folktexts.prompting.tokenizer_supports_system_prompt(tokenizer)[source]

Check whether the tokenizer’s chat template supports system messages.

Some models (e.g. Gemma) raise a TemplateError when a system role is used. Other templates surface this with different exception types depending on transformers / Jinja versions (e.g. RuntimeError, KeyError, or a template-defined exception macro), so we treat any failure of the probe as β€œsystem role not supported” rather than letting it propagate and crash the benchmark.

Return type:

bool

folktexts.qa_interface module

Interface for question-answering with LLMs.

  • Create different types of questions (direct numeric, multiple-choice, chain-of-thought).

  • Encode questions and decode model outputs.

  • Compute risk-estimate from model outputs.

class folktexts.qa_interface.ChainOfThoughtQA(column, text, num_forward_passes=-1, max_new_tokens=8000, enable_thinking=False)[source]

Bases: QAInterface

A chain-of-thought (CoT) question interface.

The model is instructed to reason step-by-step in free-form text and end with an explicit Probability: X% line; the probability is recovered via regex. This works on any model regardless of chat template.

Orthogonal to the tokenizer’s enable_thinking chat-template kwarg: CoT prompting always uses free-form generation, and enable_thinking=True additionally activates the <think>…</think> block on tokenizers that support it (e.g., Qwen3-Thinking) β€” the block is stripped before regex extraction.

Notes

Unlike DirectNumericQA and MultipleChoiceQA which use token probabilities, this interface uses full text generation. The num_forward_passes is set to -1 to signal text-generation mode instead of token-probability extraction.

The regex extraction is flexible and accepts multiple formats: - β€œProbability: 80%” -> 0.80 - β€œProbability: 0.80” -> 0.80 - β€œProbability: 80 percent” -> 0.80 - β€œβ€¦ 75%” (at end of text) -> 0.75

enable_thinking

Whether to enable thinking mode for tokenizers that support it (e.g., Qwen3). When True, the tokenizer’s apply_chat_template is called with enable_thinking=True. Default is False.

Type:

bool

enable_thinking: bool = False
static extract_probability_from_text(generated_text)[source]

Extract a probability value from generated text using regex patterns.

The extraction prioritizes (in order): the explicit β€œProbability: X[%]” anchor, last loose percentage, β€œX percent”, then a bare 0.XX decimal. Returns a float in [0, 1] or None if nothing matched.

Return type:

float | None

get_answer_from_model_output(generated_text, tokenizer_vocab=None)[source]

Extract the probability answer from the model’s generated text.

Parameters:
  • generated_text (str) – The full text generated by the model, including reasoning and the final probability estimate.

  • tokenizer_vocab (dict[str, int], optional) – The tokenizer’s vocabulary. Not used for ChainOfThoughtQA but included for interface compatibility.

Returns:

answer – The extracted probability as a float between 0 and 1.

Return type:

float

Raises:

ValueError – If no valid probability could be extracted from the generated text.

get_question_prompt(with_answer_prefill=True)[source]

Returns the CoT question prompt.

The with_answer_prefill parameter is accepted for interface compatibility with QAInterface but has no effect: CoT prompts produce free-form text and have no answer prefill to strip.

Return type:

str

max_new_tokens: int = 8000
num_forward_passes: int = -1
class folktexts.qa_interface.Choice(text, data_value, numeric_value=None)[source]

Bases: object

Represents a choice in multiple-choice Q&A.

text

The text of the choice. E.g., β€œ25-34 years old”.

Type:

str

data_value

The categorical value corresponding to this choice in the data.

Type:

object

numeric_value

A meaningful numeric value for the choice. E.g., if the choice is β€œ25-34 years old”, the numeric value could be 30. The choice with the highest numeric value can be used as a proxy for the positive class. If not provided, will try to use the choice.value.

Type:

float, optional

data_value: object
get_numeric_value()[source]

Returns the numeric value of the choice.

Return type:

float

numeric_value: float = None
text: str
class folktexts.qa_interface.DirectNumericQA(column, text, num_forward_passes=2, answer_probability=True)[source]

Bases: QAInterface

Represents a direct numeric question.

Notes

For example, the prompt could be ” Q: What is 2 + 2? A: ” With the expected answer being β€œ4”.

If looking for a direct numeric probability, the answer prompt will be framed as so: ” Q: What is the probability, between 0 and 1, of getting heads on a coin flip? A: 0.” So that we can extract a numeric answer with at most 2 forward passes. This is done automatically by passing the kwarg answer_probability=True.

Note that some models have multi-digit tokens in their vocabulary, so we need to correctly assess which tokens in the vocabulary correspond to valid numeric answers.

answer_probability: bool = True
get_answer_from_model_output(last_token_probs, tokenizer_vocab)[source]

Outputs a numeric answer inferred from the model’s output.

Parameters:
  • last_token_probs (np.ndarray) – The last token probabilities of the model for the question. The first dimension must correspond to the number for forward passes as specified by num_forward_passes.

  • tokenizer_vocab (dict[str, int],) – The tokenizer’s vocabulary.

Returns:

answer – The numeric answer to the question.

Return type:

float | int

Notes

Eventually we could run a search algorithm to find the most likely answer over multiple forward passes, but for now we’ll just take the argmax on each forward pass.

get_question_prompt(with_answer_prefill=True)[source]

Returns the question text.

with_answer_prefill=True (the default) bakes the answer prefill into the returned string β€” required by the zero-shot / few-shot last-token scoring path, which reads probabilities from the very next token after the prefill. Set to False for chat-template prompting, where the prefill is supplied separately as the assistant turn (otherwise the same string ends up emitted twice and silently degrades scoring).

Return type:

str

num_forward_passes: int = 2
class folktexts.qa_interface.MultipleChoiceQA(column, text, num_forward_passes=1, choices=<factory>, _answer_keys_source=<factory>)[source]

Bases: QAInterface

Represents a multiple-choice question and its answer keys.

property answer_keys: tuple[str, ...]
property choice_to_key: dict[Choice, str]
choices: tuple[Choice]
classmethod create_answer_keys_permutations(question)[source]

Yield questions with all permutations of answer keys.

Parameters:

question (Question) – The template question whose answer keys will be permuted.

Returns:

permutations – A generator of questions with all permutations of answer keys.

Return type:

Iterator[Question]

classmethod create_question_from_value_map(column, value_map, attribute, **kwargs)[source]

Constructs a question from a value map.

Return type:

MultipleChoiceQA

get_answer_from_model_output(last_token_probs, tokenizer_vocab)[source]

Decodes the model’s output into an answer for the given question.

Parameters:
  • last_token_probs (np.ndarray) – The model’s last token probabilities for the question. The first dimension corresponds to the number of forward passes as specified by self.num_forward_passes.

  • tokenizer_vocab (dict[str, int],) – The tokenizer’s vocabulary.

Returns:

answer – The answer to the question.

Return type:

float

get_answer_from_text(text)[source]
Return type:

Choice

get_answer_key_from_value(value)[source]

Returns the answer key corresponding to the given data value.

Return type:

str

get_question_prompt(with_answer_prefill=True)[source]

Returns the question text.

with_answer_prefill=True (the default) bakes the answer prefill into the returned string β€” required by the zero-shot / few-shot last-token scoring path, which reads probabilities from the very next token after the prefill. Set to False for chat-template prompting, where the prefill is supplied separately as the assistant turn (otherwise the same string ends up emitted twice and silently degrades scoring).

Return type:

str

get_value_to_text_map()[source]

Returns the map from choice data value to choice textual representation.

Return type:

dict[object, str]

property key_to_choice: dict[str, Choice]
num_forward_passes: int = 1
class folktexts.qa_interface.QAInterface(column, text, num_forward_passes)[source]

Bases: ABC

An interface for a question-answering system.

column: str
get_answer_from_model_output(last_token_probs, tokenizer_vocab)[source]

Decodes the model’s output into an answer for the given question.

Parameters:
  • last_token_probs (np.ndarray) – The model’s last token probabilities for the question. The first dimension corresponds to the number of forward passes as specified by self.num_forward_passes.

  • tokenizer (dict[str, int]) – The tokenizer’s vocabulary.

Returns:

answer – The answer to the question.

Return type:

float

get_question_prompt(with_answer_prefill=True)[source]

Returns the question text.

with_answer_prefill=True (the default) bakes the answer prefill into the returned string β€” required by the zero-shot / few-shot last-token scoring path, which reads probabilities from the very next token after the prefill. Set to False for chat-template prompting, where the prefill is supplied separately as the assistant turn (otherwise the same string ends up emitted twice and silently degrades scoring).

Return type:

str

num_forward_passes: int
text: str

folktexts.task module

Definition of a generic TaskMetadata class.

class folktexts.task.TaskMetadata(name, features, target, cols_to_text, sensitive_attribute=None, target_threshold=None, multiple_choice_qa=None, direct_numeric_qa=None, cot_qa=None, description=None, _use_numeric_qa=False, _use_cot_qa=False)[source]

Bases: object

A base class to hold information on a prediction task.

check_task_columns_are_available(available_cols, raise_=True)[source]

Checks if all columns required by this task are available.

Parameters:
  • available_cols (list[str]) – The list of column names available in the dataset.

  • raise (bool, optional) – Whether to raise an error if some columns are missing, by default True.

Returns:

all_available – True if all required columns are present in the given list of available columns, False otherwise.

Return type:

bool

cols_to_text: dict[str, ColumnToText]

A mapping between column names and their textual descriptions.

cot_qa: ChainOfThoughtQA = None

The chain-of-thought (CoT) question and answer interface for this task.

create_task_with_feature_subset(feature_subset)[source]

Creates a new task with a subset of the original features.

description: str = None

A description of the task, including the population to which the task pertains to.

direct_numeric_qa: DirectNumericQA = None

The direct numeric question and answer interface for this task.

features: list[str]

The names of the features used in the task.

get_row_description(row)[source]

Encode a description of a given data row in textual form.

Return type:

str

get_target()[source]

Resolves the name of the target column depending on self.target_threshold.

Return type:

str

classmethod get_task(name, use_numeric_qa=False)[source]

Fetches a previously created task by its name.

Parameters:
  • name (str) – The name of the task to fetch.

  • use_numeric_qa (bool, optional) – Whether to set the retrieved task to use verbalized numeric Q&A instead of the default multiple-choice Q&A prompts. Default is False.

Returns:

task – The task object with the given name.

Return type:

TaskMetadata

Raises:

ValueError – Raised if the task with the given name has not been created yet.

multiple_choice_qa: MultipleChoiceQA = None

The multiple-choice question and answer interface for this task.

name: str

The name of the task.

property question: QAInterface

Getter for the Q&A interface for this task.

sensitive_attribute: str = None

The name of the column used as the sensitive attribute data (if provided).

sensitive_attribute_value_map()[source]

Returns a mapping between sensitive attribute values and their descriptions.

Return type:

Callable

set_question(question)[source]

Sets the Q&A interface for this task.

target: str

The name of the target column.

target_threshold: Threshold = None

The threshold used to binarize the target column (if provided).

property use_cot_qa: bool

Getter for whether to use chain-of-thought (CoT) Q&A prompts.

property use_numeric_qa: bool

Getter for whether to use numeric Q&A instead of multiple-choice Q&A prompts.

folktexts.threshold module

Helper function for defining binarization thresholds.

class folktexts.threshold.Threshold(value, op)[source]

Bases: object

A class to represent a threshold value and its comparison operator.

value

The threshold value to compare against.

Type:

float | int

op

The comparison operator to use. One of β€˜>’, β€˜<’, β€˜>=’, β€˜<=’, β€˜==’, β€˜!=’.

Type:

str

apply_to_column_data(data)[source]

Applies the threshold operation to a pandas Series or scalar value.

Return type:

int | Series

apply_to_column_name(column_name)[source]

Standardizes naming of thresholded columns.

Return type:

str

op: str
valid_ops: ClassVar[dict] = {'!=': <built-in function ne>, '<': <built-in function lt>, '<=': <built-in function le>, '==': <built-in function eq>, '>': <built-in function gt>, '>=': <built-in function ge>}
value: float | int

Module contents