folktexts packageο
Subpackagesο
- folktexts.acs package
- folktexts.classifier package
- Submodules
- folktexts.classifier.base module
LLMClassifierLLMClassifier.DEFAULT_INFERENCE_KWARGSLLMClassifier.compute_risk_estimates_for_dataframe()LLMClassifier.compute_risk_estimates_for_dataset()LLMClassifier.correct_order_biasLLMClassifier.custom_prompt_prefixLLMClassifier.encode_rowLLMClassifier.fit()LLMClassifier.inference_kwargsLLMClassifier.model_nameLLMClassifier.predict()LLMClassifier.predict_proba()LLMClassifier.seedLLMClassifier.set_fit_request()LLMClassifier.set_inference_kwargs()LLMClassifier.set_predict_proba_request()LLMClassifier.set_predict_request()LLMClassifier.set_score_request()LLMClassifier.taskLLMClassifier.threshold
- folktexts.classifier.transformers_classifier module
- folktexts.classifier.vllm_classifier module
- folktexts.classifier.web_api_classifier module
- Module contents
- folktexts.cli package
Submodulesο
folktexts.benchmark moduleο
A benchmark class for measuring and evaluating LLM calibration.
- class folktexts.benchmark.Benchmark(llm_clf, dataset, config=BenchmarkConfig(numeric_risk_prompting=False, cot_prompting=False, enable_thinking=False, few_shot=None, reuse_few_shot_examples=False, balance_few_shot_examples=False, use_chat_template=False, chat_prompt=None, system_prompt=None, batch_size=None, context_size=None, correct_order_bias=True, feature_subset=None, population_filter=None, seed=42))[source]ο
Bases:
objectMeasures and evaluates risk scores produced by an LLM.
A benchmark object to measure and evaluate risk scores produced by an LLM.
- Parameters:
llm_clf (LLMClassifier) β A language model classifier object (can be local or web-hosted).
dataset (Dataset) β The dataset object to use for the benchmark.Γ·
config (BenchmarkConfig, optional) β The configuration object used to create the benchmark parameters. NOTE: This is used to uniquely identify the benchmark object for reproducibility; it will not be used to change the benchmark behavior. To configure the benchmark, pass a configuration object to the Benchmark.make_benchmark method.
- ACS_DATASET_CONFIGS = {'horizon': '1-Year', 'seed': 42, 'subsampling': None, 'survey': 'person', 'survey_year': '2018', 'test_size': 0.1, 'val_size': 0.1}ο
- property configs_dict: dictο
- classmethod make_acs_benchmark(task_name, *, model, tokenizer=None, data_dir=None, max_api_rpm=None, config=BenchmarkConfig(numeric_risk_prompting=False, cot_prompting=False, enable_thinking=False, few_shot=None, reuse_few_shot_examples=False, balance_few_shot_examples=False, use_chat_template=False, chat_prompt=None, system_prompt=None, batch_size=None, context_size=None, correct_order_bias=True, feature_subset=None, population_filter=None, seed=42), backend=None, model_name_or_path=None, **kwargs)[source]ο
Create a standardized calibration benchmark on ACS data.
- Parameters:
task_name (str) β The name of the ACS task to use.
model (AutoModelForCausalLM | str) β The transformers language model to use, or the model ID for a webAPI hosted model (e.g., βopenai/gpt-4o-miniβ).
tokenizer (AutoTokenizer, optional) β The tokenizer used to train the model (if using a transformers model). Not required for webAPI models.
data_dir (str | Path, optional) β Path to the directory to load data from and save data in.
max_api_rpm (int, optional) β The maximum number of API requests per minute for webAPI models.
config (BenchmarkConfig, optional) β Extra benchmark configurations, by default will use BenchmarkConfig.default_config().
**kwargs β Additional arguments passed to ACSDataset and BenchmarkConfig. By default will use a set of standardized configurations for reproducibility.
- Returns:
bench β The ACS calibration benchmark object.
- Return type:
- classmethod make_benchmark(*, task, dataset, model, tokenizer=None, max_api_rpm=None, config=BenchmarkConfig(numeric_risk_prompting=False, cot_prompting=False, enable_thinking=False, few_shot=None, reuse_few_shot_examples=False, balance_few_shot_examples=False, use_chat_template=False, chat_prompt=None, system_prompt=None, batch_size=None, context_size=None, correct_order_bias=True, feature_subset=None, population_filter=None, seed=42), backend=None, model_name_or_path=None, **kwargs)[source]ο
Create a calibration benchmark from a given configuration.
- Parameters:
task (TaskMetadata | str) β The task metadata object or name of the task to use.
dataset (Dataset) β The dataset to use for the benchmark.
model (AutoModelForCausalLM | str) β The transformers language model to use, or the model ID for a webAPI hosted model (e.g., βopenai/gpt-4o-miniβ).
tokenizer (AutoTokenizer, optional) β The tokenizer used to train the model (if using a transformers model). Not required for webAPI models.
max_api_rpm (int, optional) β The maximum number of API requests per minute for webAPI models.
config (BenchmarkConfig, optional) β Extra benchmark configurations, by default will use BenchmarkConfig.default_config().
**kwargs β Additional arguments for easier configuration of the benchmark. Will simply use these values to update the config object.
- Returns:
bench β The calibration benchmark object.
- Return type:
- property model_nameο
- plot_results(*, show_plots=True)[source]ο
Render evaluation plots and save to disk.
- Parameters:
show_plots (bool, optional) β Whether to show plots, by default True.
- Returns:
plots_paths β The paths to the saved plots.
- Return type:
dict[str, str]
- property resultsο
- property results_dir: Pathο
Get the results directory for this benchmark.
- property results_root_dir: Pathο
- run(results_root_dir, fit_threshold=0)[source]ο
Run the calibration benchmark experiment.
- Parameters:
results_root_dir (str | Path) β Path to root directory under which results will be saved.
fit_threshold (int | bool, optional) β Whether to fit the binarization threshold on a given number of training samples, by default 0 (will not fit the threshold).
- Returns:
The benchmark metric value. By default this is the ECE score.
- Return type:
float
- save_results(results_root_dir=None)[source]ο
Save the benchmark results to disk.
- Parameters:
results_root_dir (str | Path, optional) β Path to root directory under which results will be saved. By default will use self.results_root_dir.
- property taskο
- class folktexts.benchmark.BenchmarkConfig(numeric_risk_prompting=False, cot_prompting=False, enable_thinking=False, few_shot=None, reuse_few_shot_examples=False, balance_few_shot_examples=False, use_chat_template=False, chat_prompt=None, system_prompt=None, batch_size=None, context_size=None, correct_order_bias=True, feature_subset=None, population_filter=None, seed=42)[source]ο
Bases:
objectA dataclass to hold the configuration for risk-score benchmark.
- numeric_risk_promptingο
Whether to prompt for numeric risk-estimates instead of multiple-choice Q&A, by default False.
- Type:
bool, optional
- cot_promptingο
Whether to use chain-of-thought prompting: the model generates free-form reasoning text and ends with a Probability: X% line that is recovered via regex. Works on any model regardless of chat template. By default False.
- Type:
bool, optional
- enable_thinkingο
Whether to enable thinking mode for tokenizers that support it (e.g., Qwen3). Only applies when cot_prompting=True. When enabled, calls apply_chat_template(β¦, enable_thinking=True) and the resulting <think>β¦</think> block is stripped before regex extraction. Default is False.
- Type:
bool, optional
- few_shotο
Whether to use few-shot prompting with a given number of examples, by default None.
- Type:
int | None, optional
- reuse_few_shot_examplesο
Whether to reuse the same samples for few-shot prompting (or sample new ones every time), by default False.
- Type:
bool, optional
- balance_few_shot_examplesο
Whether to balance the samples for few-shot prompting with respect to their labels, by default False.
- Type:
bool, optional
- use_chat_templateο
Whether to format prompts using the tokenizerβs chat template, by default False. Only supported for local transformers models.
- Type:
bool, optional
- chat_promptο
The assistant prefill text to use with chat templates. If None, uses the appropriate default for the prompting mode (ANTHROPIC_CHAT_PROMPT for multiple-choice, NUMERIC_CHAT_PROMPT for numeric).
- Type:
str | None, optional
- system_promptο
Custom system prompt text to use with chat templates. If None, uses the appropriate default for the prompting mode (SYSTEM_PROMPT for multiple-choice, NUMERIC_SYSTEM_PROMPT for numeric).
- Type:
str | None, optional
- batch_sizeο
The batch size to use for inference.
- Type:
int | None, optional
- context_sizeο
The maximum context size when prompting the LLM.
- Type:
int | None, optional
- correct_order_biasο
Whether to correct the ordering bias in multiple-choice Q&A when prompting the LLM, by default True.
- Type:
bool, optional
- feature_subsetο
Whether to use a subset of the standard feature set for the task. The list should contain the names of the columns of features to use.
- Type:
list[str] | None, optional
- population_filterο
Optional population filter for this benchmark; must follow the format {βcolumn_nameβ: βvalueβ}.
- Type:
dict | None, optional
- seedο
Random seed β to set for reproducibility.
- Type:
int, optional
-
balance_few_shot_examples:
bool= Falseο
-
batch_size:
int|None= Noneο
-
chat_prompt:
str|None= Noneο
-
context_size:
int|None= Noneο
-
correct_order_bias:
bool= Trueο
-
cot_prompting:
bool= Falseο
- classmethod default_config(**changes)[source]ο
Returns the default configuration with optional changes.
-
enable_thinking:
bool= Falseο
-
feature_subset:
list[str] |None= Noneο
-
few_shot:
int|None= Noneο
-
numeric_risk_prompting:
bool= Falseο
-
population_filter:
dict|None= Noneο
-
reuse_few_shot_examples:
bool= Falseο
-
seed:
int= 42ο
-
system_prompt:
str|None= Noneο
-
use_chat_template:
bool= Falseο
folktexts.col_to_text moduleο
- class folktexts.col_to_text.ColumnToText(name, short_description, value_map=None, question=None, connector_verb='is:', missing_value_fill='N/A', use_value_map_only=False)[source]ο
Bases:
objectMaps a single columnβs values to natural text.
Constructs a ColumnToText object.
- Parameters:
name (str) β The columnβs name.
short_description (str) β A short description of the column to be used before different values. For example, short_description=βyearly incomeβ will result in βThe yearly income is [β¦]β.
value_map (dict[int | str, str] | Callable, optional) β A map between column values and their textual meaning. If not provided, will try to infer a mapping from the question.
question (QAInterface, optional) β A question associated with the column. If not provided, will try to infer a multiple-choice question from the value_map.
connector_verb (str, optional) β Which verb to use when connecting the columnβs description to its value; by default βisβ.
missing_value_fill (str, optional) β The value to use when the columnβs value is not found in the value_map, by default βN/Aβ.
use_value_map_only (bool, optional) β Whether to only use the value_map for mapping values to text, or whether natural language representation should be generated using the connector_verb and short_description as well. By default (False) will construct a natural language representation of the form: βThe [short_description] [connector_verb] [value_map.get(val)]β.
- get_text(value)[source]ο
Returns the natural text representation of the given data value.
- Return type:
str
- property name: strο
- property question: QAInterfaceο
- property short_description: strο
- property value_map: Callableο
Returns the value map function for this column.
folktexts.dataset moduleο
General Dataset functionality for text-based datasets.
- class folktexts.dataset.Dataset(data, task, test_size=0.1, val_size=0.1, subsampling=None, seed=42)[source]ο
Bases:
objectConstruct a Dataset object.
- Parameters:
data (pd.DataFrame) β The datasetβs data in pandas DataFrame format.
task (TaskMetadata) β The metadata for the prediction task.
test_size (float, optional) β The size of the test set, as a fraction of the total dataset size, by default 0.1.
val_size (float, optional) β The size of the validation set, as a fraction of the total dataset size, by default 0.1.
subsampling (float, optional) β Whether to use sub-sampling, and which fraction of the data to keep. By default will not use sub-sampling (subsampling=None).
seed (int, optional) β The random state seed, by default 42.
- property data: DataFrameο
- property name: strο
A unique name for this dataset.
- sample_n_train_examples(n, reuse_examples=False, class_balancing=False)[source]ο
Return a set of samples from the training set.
- Parameters:
n (int) β The number of example rows to return.
reuse_examples (bool, optional) β Whether to reuse the same examples for consistency. By default will sample new examples each time (reuse_examples=False).
- Returns:
X, y β The features and target data for the sampled examples.
- Return type:
tuple[pd.DataFrame, pd.Series]
- property seed: intο
- property subsampling: floatο
- property task: TaskMetadataο
- property test_size: floatο
- property train_size: floatο
- property val_size: floatο
folktexts.evaluation moduleο
Module to map risk-estimates to a variety of evaluation metrics.
Notes
Code based on the error_parity.evaluation module, at: https://github.com/socialfoundations/error-parity/blob/main/error_parity/evaluation.py
- folktexts.evaluation.bootstrap_estimate(eval_func, *, y_true, y_pred_scores, sensitive_attribute=None, k=200, confidence_pct=95, seed=42)[source]ο
Computes bootstrap estimates of the given evaluation function.
- Parameters:
eval_func (Callable[[np.ndarray, np.ndarray, np.ndarray], dict[str, float]]) β The evaluation function to run for each bootstrap sample. Must follow the signature eval_func(y_true, y_pred_scores, sensitive_attribute).
y_true (np.ndarray) β The true labels.
y_pred_scores (np.ndarray) β The predicted scores.
sensitive_attribute (np.ndarray, optional) β Optionally, provide the sensitive attribute data to compute fairness metrics, by default None.
k (int, optional) β How many bootstrap samples to draw, by default 200.
confidence_pct (float, optional) β The confidence interval to use, in percentage, by default 95.
seed (int, optional) β The random seed, by default 42.
- Returns:
results β A dictionary containing bootstrap estimates for a variety of metrics.
- Return type:
dict[str, float]
- folktexts.evaluation.compute_best_threshold(y_true, y_pred_scores, *, false_pos_cost=1.0, false_neg_cost=1.0)[source]ο
Computes the binarization threshold that maximizes accuracy.
- Parameters:
y_true (np.ndarray) β The true class labels.
y_pred_scores (np.ndarray) β The predicted risk scores.
false_pos_cost (float, optional) β The cost of a false positive error, by default 1.0
false_neg_cost (float, optional) β The cost of a false negative error, by default 1.0
- Returns:
best_threshold β The threshold value that maximizes accuracy for the given predictions.
- Return type:
float
- folktexts.evaluation.evaluate_binary_predictions(y_true, y_pred)[source]ο
Evaluates the provided binary predictions on common performance metrics.
- Parameters:
y_true (np.ndarray) β The true class labels.
y_pred (np.ndarray) β The binary predictions.
- Returns:
A dictionary with key-value pairs of (metric name, metric value).
- Return type:
dict
- folktexts.evaluation.evaluate_binary_predictions_fairness(y_true, y_pred, sensitive_attribute, return_groupwise_metrics=False, min_group_size=0.04)[source]ο
Evaluates fairness of the given predictions.
Fairness metrics are computed as the ratios between group-wise performance metrics.
- Parameters:
y_true (np.ndarray) β The true class labels.
y_pred (np.ndarray) β The discretized predictions.
sensitive_attribute (np.ndarray) β The sensitive attribute (protected group membership).
return_groupwise_metrics (bool, optional) β Whether to return group-wise performance metrics (bool: True) or only the ratios between these metrics (bool: False), by default False.
min_group_size (float, optional) β The minimum fraction of samples (as a fraction of the total number of samples) that a group must have to be considered for fairness evaluation, by default 0.04. This is meant to avoid evaluating metrics on very small groups which leads to noisy and inconsistent results.
- Returns:
A dictionary with key-value pairs of (metric name, metric value).
- Return type:
dict
- folktexts.evaluation.evaluate_predictions(y_true, y_pred_scores, *, sensitive_attribute=None, threshold='best', model_name=None)[source]ο
Evaluates predictions on common performance and fairness metrics.
- Parameters:
y_true (np.ndarray) β The true class labels.
y_pred_scores (np.ndarray) β The predicted scores.
sensitive_attribute (np.ndarray, optional) β The sensitive attribute data. Will compute fairness metrics if provided.
threshold (float | str, optional) β The threshold to use for binarizing the predictions, or βbestβ to infer which threshold maximizes accuracy.
model_name (str, optional) β The name of the model to be used on the plots, by default None.
- Returns:
results β A dictionary with key-value pairs of (metric name, metric value).
- Return type:
dict
- folktexts.evaluation.evaluate_predictions_bootstrap(y_true, y_pred_scores, *, sensitive_attribute=None, threshold='best', k=200, confidence_pct=95, seed=42)[source]ο
Computes bootstrap estimates of classification metrics for the given predictions.
- Parameters:
y_true (np.ndarray) β The true labels.
y_pred_scores (np.ndarray) β The score predictions.
sensitive_attribute (np.ndarray, optional) β The sensitive attribute data. Will compute fairness metrics if provided.
threshold (float | str, optional) β The threshold to use for binarizing the predictions, or βbestβ to infer which threshold maximizes accuracy, by default βbestβ.
k (int, optional) β How many bootstrap samples to draw, by default 200.
confidence_pct (float, optional) β How large of a confidence interval to use when reporting lower and upper bounds, by default 95 (i.e., 2.5 to 97.5 percentile of results).
seed (int, optional) β The random seed, by default 42.
- Returns:
results β A dictionary containing bootstrap estimates for a variety of metrics.
- Return type:
dict[str, float]
folktexts.llm_utils moduleο
Common functions to use with transformer LLMs.
- folktexts.llm_utils.add_pad_token(tokenizer)[source]ο
Add a pad token to the model and tokenizer if it doesnβt already exist.
Here weβre using the end-of-sentence token as the pad token. Both the model weights and tokenizer vocabulary are untouched.
Another possible way would be to add a new token [PAD] to the tokenizer and update the tokenizer vocabulary and model weight embeddings accordingly. The embedding for the new pad token would be the average of all other embeddings.
- folktexts.llm_utils.decode_topk_logprobs_to_risk_estimate(per_pass_topk, *, tokenizer_vocab, vocab_dim, question)[source]ο
Convert top-K log-probabilities into a single risk-estimate float.
- Parameters:
per_pass_topk (list[dict[int, float]]) β One dict per generated token position, mapping token_id -> log-prob. The token_ids must match the values in tokenizer_vocab. Tokens absent from the top-K are assumed to have probability ~0.
tokenizer_vocab (dict[str, int]) β Token string -> token_id map used by the QA decoder for prefix-variant lookup (MultipleChoiceQA) or digit/decimal lookup (DirectNumericQA).
vocab_dim (int) β Size of the linear-probability arrayβs vocab axis. For local backends this is model.config.vocab_size (the logits axis); for the synthetic WebAPI path it is the size of the synthesised vocab.
question (MultipleChoiceQA | DirectNumericQA) β The QA interface used to interpret the probabilities.
- Returns:
risk_estimate β Risk score in [0, 1] from question.get_answer_from_model_output.
- Return type:
float
Notes
Both the WebAPI backend (top_logprobs=20 from OpenAI-style responses) and the vLLM backend (top-K logprobs from SamplingParams(logprobs=K)) call this helper. The transformers backend reads the full softmax directly and bypasses this path; see query_model_batch_multiple_passes.
- folktexts.llm_utils.generate_text_batch(text_inputs, model, tokenizer, max_new_tokens=1024, context_size=None, enable_thinking=None)[source]ο
Generate text completions for a batch of prompts.
Uses the modelβs generate() method for autoregressive text generation, suitable for chain-of-thought Q&A where the model needs to produce free-form text before outputting a probability estimate. Generation is greedy (do_sample=False) so runs are reproducible β matches the web-API pathβs temperature=0 contract.
- Parameters:
text_inputs (list[str]) β The input prompts as a list of strings.
model (AutoModelForCausalLM) β The model to use for generation.
tokenizer (AutoTokenizer) β The tokenizer used to encode/decode text.
max_new_tokens (int, optional) β Maximum number of new tokens to generate, by default 1024.
context_size (int, optional) β The maximum context size for input tokens. If None, no truncation is applied to inputs.
enable_thinking (bool, optional) β
Controls chat template application and thinking mode: - None: Do not apply chat template (use raw prompts, for base models) - False: Apply chat template WITHOUT thinking mode (for instruction-tuned models) - True: Apply chat template WITH thinking mode, and extract response
content after </think> marker (for thinking models like Qwen3)
- Returns:
generated_texts β The generated text completions for each input prompt. Only the newly generated tokens are returned (not the input prompt).
- Return type:
list[str]
- folktexts.llm_utils.get_model_folder_path(model_name, root_dir='/tmp')[source]ο
Returns the folder where the model is saved.
- Return type:
str
- folktexts.llm_utils.get_model_size_B(model_name, default=None)[source]ο
Get the model size from the model name, in Billions of parameters.
- Return type:
int
- folktexts.llm_utils.is_bf16_compatible()[source]ο
Checks if the current environment is bfloat16 compatible.
- Return type:
bool
- folktexts.llm_utils.load_model_tokenizer(model_name_or_path, **kwargs)[source]ο
Load a model and tokenizer from the given local path (or using the model name).
- Parameters:
model_name_or_path (str | Path) β Model name or local path to the model folder.
kwargs (dict) β Additional keyword arguments to pass to the model from_pretrained call.
- Returns:
The loaded model and tokenizer, respectively.
- Return type:
tuple[AutoModelForCausalLM, AutoTokenizer]
- folktexts.llm_utils.load_vllm_model(model_name_or_path, *, dtype='auto', gpu_memory_utilization=0.85, max_model_len=None, tensor_parallel_size=1, trust_remote_code=True, seed=42, max_logprobs=50, **kwargs)[source]ο
Load a vLLM LLM engine and its tokenizer.
Mirrors load_model_tokenizer for the vLLM backend. vLLM allocates the KV cache statically at startup based on gpu_memory_utilization and max_model_len; tune these per-GPU. vllm is an optional install β if it is not importable, this function raises a pointed error.
- Parameters:
model_name_or_path (str | Path) β Model name or local path to the model folder. Pre-cached snapshots under /fast/groups/sf/huggingface-models/ work without download.
dtype (str, optional) β Compute dtype:
"auto"(default; vLLM picks bf16/fp16 from the config),"bfloat16","float16", or"float32".gpu_memory_utilization (float, optional) β Fraction of GPU VRAM vLLM may use for weights + KV cache. Default 0.85 (vLLMβs own default is 0.9, which is aggressive on shared cluster nodes). vLLM fails fast at startup if this isnβt enough β bump down if you hit OOM at LLM().
max_model_len (int, optional) β Maximum number of tokens (input + output) per request. If
None, vLLM reads it from the model config β which on some Llama checkpoints is 131072 and will allocate enormous KV cache. Pass an explicit value sized ascontext_size + max_new_tokens + bufferfor the workload.tensor_parallel_size (int, optional) β Number of GPUs to shard the model across; default 1. Set higher when the cluster job grants multiple GPUs and the model fits with tensor-parallel sharding.
trust_remote_code (bool, optional) β Forwarded to vLLM (mirrors load_model_tokenizer).
seed (int, optional) β Random seed for vLLM. Doesnβt affect greedy (temperature=0) decoding but pinned for safety.
max_logprobs (int, optional) β Engine-level cap on top-K logprobs SamplingParams may request. Default 50 β must be β₯
VLLMClassifier._TOPK_LOGPROBSor the engine rejects the request at predict time (VLLMValidationError: Requested sample logprobs of K, which is greater than max allowed).**kwargs β Additional keyword arguments forwarded verbatim to
vllm.LLM(...).
- Returns:
Loaded engine and its tokenizer. The tokenizer has had add_pad_token applied so it matches the transformers pathβs tokenizer state.
- Return type:
tuple[vllm.LLM, AutoTokenizer]
- folktexts.llm_utils.query_model_batch(text_inputs, model, tokenizer, context_size)[source]ο
Queries the model with a batch of text inputs.
- Parameters:
text_inputs (list[str]) β The inputs to the model as a list of strings.
model (AutoModelForCausalLM) β The model to query.
tokenizer (AutoTokenizer) β The tokenizer used to encode the text inputs.
context_size (int) β The maximum context size to consider for each input (in tokens).
- Returns:
last_token_probs β Modelβs last token linear probabilities for each input as an np.array of shape (batch_size, vocab_size).
- Return type:
np.array
- folktexts.llm_utils.query_model_batch_multiple_passes(text_inputs, model, tokenizer, context_size, n_passes, digits_only=False)[source]ο
Queries an LM for multiple forward passes.
Greedy token search over multiple forward passes: Each forward pass takes the highest likelihood token from the previous pass.
NOTE: could use model.generate in the future!
- Parameters:
text_inputs (list[str]) β The batch inputs to the model as a list of strings.
model (AutoModelForCausalLM) β The model to query.
tokenizer (AutoTokenizer) β The tokenizer used to encode the text inputs.
context_size (int) β The maximum context size to consider for each input (in tokens).
n_passes (int, optional) β The number of forward passes to run.
digits_only (bool, optional) β Whether to only sample for digit tokens.
- Returns:
last_token_probs β Last token linear probabilities for each forward pass, for each text in the input batch. The output has shape (batch_size, n_passes, vocab_size).
- Return type:
np.array
folktexts.plotting moduleο
Module to plot evaluation results.
- folktexts.plotting.render_evaluation_plots(y_true, y_pred_scores, *, eval_results={}, model_name=None, imgs_dir=None, show_plots=False)[source]ο
Renders evaluation plots for the given predictions.
- Return type:
dict
folktexts.prompting moduleο
Module to map risk-estimation questions to different prompting techniques.
e.g., - multiple-choice Q&A vs direct numeric Q&A; - zero-shot vs few-shot vs CoT;
- folktexts.prompting.apply_chat_template(tokenizer, user_prompt, system_prompt=None, chat_prompt=None, **kwargs)[source]ο
Apply the tokenizerβs chat template to assemble a single prompt string.
- Return type:
str
Notes
system_prompt is treated as βincludeβ iff it is not None. This means an empty string ββ will inject an empty system message rather than be treated as βno system roleβ β pass None (or omit the argument) to skip the system role entirely.
chat_prompt is the assistant prefill. When provided, the returned prompt is trimmed so it ends exactly with chat_prompt, preserving the last-token scoring contract relied on by LLMClassifier. If the chat template mutates or strips the prefill (so it cannot be located verbatim in the rendered output), a ValueError is raised rather than silently returning a corrupted prompt.
When chat_prompt is None, add_generation_prompt=True is used and the model is left to generate freely; this is not appropriate for the benchmark scoring path (the last token will be a template-emitted role header, not the prefill).
- folktexts.prompting.encode_row_prompt(row, task, question=None, custom_prompt_prefix=None, add_task_description=True, with_answer_prefill=True)[source]ο
Encode a question regarding a given row.
with_answer_prefill is forwarded to question.get_question_prompt. The chat-template path passes False so the prefill is supplied as a separate assistant turn rather than baked into the user message.
- Return type:
str
- folktexts.prompting.encode_row_prompt_chat(row, task, tokenizer, system_prompt=<object object>, chat_prompt=<object object>, numeric=False, question=None, custom_prompt_prefix=None)[source]ο
Encode a row prompt using the tokenizerβs chat template.
- Parameters:
row (pd.Series) β The row that the question will be about.
task (TaskMetadata) β The task metadata object.
tokenizer (AutoTokenizer) β The tokenizer whose chat template will be applied.
system_prompt (str | None, optional) β System prompt text. If omitted, the mode-appropriate default selected by numeric is used. Pass None explicitly to disable the system role (e.g. for Gemma-style templates that reject it).
chat_prompt (str | None, optional) β Assistant prefill text. If omitted, the mode-appropriate default selected by numeric is used. Pass None explicitly to skip the assistant prefill β note that this routes inference through add_generation_prompt=True and breaks the last-token scoring assumption used by LLMClassifier, so it is not appropriate for the benchmark path.
numeric (bool, optional) β Whether numeric risk prompting is being used. Selects which default prompts are applied when system_prompt / chat_prompt are omitted.
question (QAInterface, optional) β The question interface to use.
custom_prompt_prefix (str, optional) β A custom prompt prefix to prepend.
- Returns:
The fully formatted chat-template prompt.
- Return type:
str
- folktexts.prompting.encode_row_prompt_few_shot(row, task, dataset, n_shots, question=None, reuse_examples=False, class_balancing=False, custom_prompt_prefix=None)[source]ο
Encode a question regarding a given row using few-shot prompting.
- Parameters:
row (pd.Series) β The row that the question will be about.
task (TaskMetadata) β The task that the row belongs to.
n_shots (int, optional) β The number of example questions and answers to use before prompting about the given row, by default 3.
reuse_examples (bool, optional) β Whether to reuse the same examples for consistency. By default will resample new examples each time (reuse_examples=False).
- Returns:
prompt β The encoded few-shot prompt.
- Return type:
str
- folktexts.prompting.resolve_chat_defaults(numeric, system_prompt=None, chat_prompt=None)[source]ο
Resolve default system_prompt / chat_prompt for chat-template prompting.
A None value means βuse the default for this modeβ. To explicitly disable a role downstream, override the resolved value with None after calling this function (which is what Benchmark.make_benchmark does for tokenizers that reject the system role).
- Return type:
tuple[str,str]
- folktexts.prompting.tokenizer_supports_system_prompt(tokenizer)[source]ο
Check whether the tokenizerβs chat template supports system messages.
Some models (e.g. Gemma) raise a TemplateError when a system role is used. Other templates surface this with different exception types depending on transformers / Jinja versions (e.g. RuntimeError, KeyError, or a template-defined exception macro), so we treat any failure of the probe as βsystem role not supportedβ rather than letting it propagate and crash the benchmark.
- Return type:
bool
folktexts.qa_interface moduleο
Interface for question-answering with LLMs.
Create different types of questions (direct numeric, multiple-choice, chain-of-thought).
Encode questions and decode model outputs.
Compute risk-estimate from model outputs.
- class folktexts.qa_interface.ChainOfThoughtQA(column, text, num_forward_passes=-1, max_new_tokens=8000, enable_thinking=False)[source]ο
Bases:
QAInterfaceA chain-of-thought (CoT) question interface.
The model is instructed to reason step-by-step in free-form text and end with an explicit Probability: X% line; the probability is recovered via regex. This works on any model regardless of chat template.
Orthogonal to the tokenizerβs enable_thinking chat-template kwarg: CoT prompting always uses free-form generation, and enable_thinking=True additionally activates the <think>β¦</think> block on tokenizers that support it (e.g., Qwen3-Thinking) β the block is stripped before regex extraction.
Notes
Unlike DirectNumericQA and MultipleChoiceQA which use token probabilities, this interface uses full text generation. The num_forward_passes is set to -1 to signal text-generation mode instead of token-probability extraction.
The regex extraction is flexible and accepts multiple formats: - βProbability: 80%β -> 0.80 - βProbability: 0.80β -> 0.80 - βProbability: 80 percentβ -> 0.80 - ββ¦ 75%β (at end of text) -> 0.75
- enable_thinkingο
Whether to enable thinking mode for tokenizers that support it (e.g., Qwen3). When True, the tokenizerβs apply_chat_template is called with enable_thinking=True. Default is False.
- Type:
bool
-
enable_thinking:
bool= Falseο
- static extract_probability_from_text(generated_text)[source]ο
Extract a probability value from generated text using regex patterns.
The extraction prioritizes (in order): the explicit βProbability: X[%]β anchor, last loose percentage, βX percentβ, then a bare 0.XX decimal. Returns a float in [0, 1] or None if nothing matched.
- Return type:
float|None
- get_answer_from_model_output(generated_text, tokenizer_vocab=None)[source]ο
Extract the probability answer from the modelβs generated text.
- Parameters:
generated_text (str) β The full text generated by the model, including reasoning and the final probability estimate.
tokenizer_vocab (dict[str, int], optional) β The tokenizerβs vocabulary. Not used for ChainOfThoughtQA but included for interface compatibility.
- Returns:
answer β The extracted probability as a float between 0 and 1.
- Return type:
float
- Raises:
ValueError β If no valid probability could be extracted from the generated text.
- get_question_prompt(with_answer_prefill=True)[source]ο
Returns the CoT question prompt.
The with_answer_prefill parameter is accepted for interface compatibility with QAInterface but has no effect: CoT prompts produce free-form text and have no answer prefill to strip.
- Return type:
str
-
max_new_tokens:
int= 8000ο
-
num_forward_passes:
int= -1ο
- class folktexts.qa_interface.Choice(text, data_value, numeric_value=None)[source]ο
Bases:
objectRepresents a choice in multiple-choice Q&A.
- textο
The text of the choice. E.g., β25-34 years oldβ.
- Type:
str
- data_valueο
The categorical value corresponding to this choice in the data.
- Type:
object
- numeric_valueο
A meaningful numeric value for the choice. E.g., if the choice is β25-34 years oldβ, the numeric value could be 30. The choice with the highest numeric value can be used as a proxy for the positive class. If not provided, will try to use the choice.value.
- Type:
float, optional
-
data_value:
objectο
-
numeric_value:
float= Noneο
-
text:
strο
- class folktexts.qa_interface.DirectNumericQA(column, text, num_forward_passes=2, answer_probability=True)[source]ο
Bases:
QAInterfaceRepresents a direct numeric question.
Notes
For example, the prompt could be β Q: What is 2 + 2? A: β With the expected answer being β4β.
If looking for a direct numeric probability, the answer prompt will be framed as so: β Q: What is the probability, between 0 and 1, of getting heads on a coin flip? A: 0.β So that we can extract a numeric answer with at most 2 forward passes. This is done automatically by passing the kwarg answer_probability=True.
Note that some models have multi-digit tokens in their vocabulary, so we need to correctly assess which tokens in the vocabulary correspond to valid numeric answers.
-
answer_probability:
bool= Trueο
- get_answer_from_model_output(last_token_probs, tokenizer_vocab)[source]ο
Outputs a numeric answer inferred from the modelβs output.
- Parameters:
last_token_probs (np.ndarray) β The last token probabilities of the model for the question. The first dimension must correspond to the number for forward passes as specified by num_forward_passes.
tokenizer_vocab (dict[str, int],) β The tokenizerβs vocabulary.
- Returns:
answer β The numeric answer to the question.
- Return type:
float | int
Notes
Eventually we could run a search algorithm to find the most likely answer over multiple forward passes, but for now weβll just take the argmax on each forward pass.
- get_question_prompt(with_answer_prefill=True)[source]ο
Returns the question text.
with_answer_prefill=True (the default) bakes the answer prefill into the returned string β required by the zero-shot / few-shot last-token scoring path, which reads probabilities from the very next token after the prefill. Set to False for chat-template prompting, where the prefill is supplied separately as the assistant turn (otherwise the same string ends up emitted twice and silently degrades scoring).
- Return type:
str
-
num_forward_passes:
int= 2ο
-
answer_probability:
- class folktexts.qa_interface.MultipleChoiceQA(column, text, num_forward_passes=1, choices=<factory>, _answer_keys_source=<factory>)[source]ο
Bases:
QAInterfaceRepresents a multiple-choice question and its answer keys.
- property answer_keys: tuple[str, ...]ο
- classmethod create_answer_keys_permutations(question)[source]ο
Yield questions with all permutations of answer keys.
- Parameters:
question (Question) β The template question whose answer keys will be permuted.
- Returns:
permutations β A generator of questions with all permutations of answer keys.
- Return type:
Iterator[Question]
- classmethod create_question_from_value_map(column, value_map, attribute, **kwargs)[source]ο
Constructs a question from a value map.
- Return type:
- get_answer_from_model_output(last_token_probs, tokenizer_vocab)[source]ο
Decodes the modelβs output into an answer for the given question.
- Parameters:
last_token_probs (np.ndarray) β The modelβs last token probabilities for the question. The first dimension corresponds to the number of forward passes as specified by self.num_forward_passes.
tokenizer_vocab (dict[str, int],) β The tokenizerβs vocabulary.
- Returns:
answer β The answer to the question.
- Return type:
float
- get_answer_key_from_value(value)[source]ο
Returns the answer key corresponding to the given data value.
- Return type:
str
- get_question_prompt(with_answer_prefill=True)[source]ο
Returns the question text.
with_answer_prefill=True (the default) bakes the answer prefill into the returned string β required by the zero-shot / few-shot last-token scoring path, which reads probabilities from the very next token after the prefill. Set to False for chat-template prompting, where the prefill is supplied separately as the assistant turn (otherwise the same string ends up emitted twice and silently degrades scoring).
- Return type:
str
- get_value_to_text_map()[source]ο
Returns the map from choice data value to choice textual representation.
- Return type:
dict[object,str]
-
num_forward_passes:
int= 1ο
- class folktexts.qa_interface.QAInterface(column, text, num_forward_passes)[source]ο
Bases:
ABCAn interface for a question-answering system.
-
column:
strο
- get_answer_from_model_output(last_token_probs, tokenizer_vocab)[source]ο
Decodes the modelβs output into an answer for the given question.
- Parameters:
last_token_probs (np.ndarray) β The modelβs last token probabilities for the question. The first dimension corresponds to the number of forward passes as specified by self.num_forward_passes.
tokenizer (dict[str, int]) β The tokenizerβs vocabulary.
- Returns:
answer β The answer to the question.
- Return type:
float
- get_question_prompt(with_answer_prefill=True)[source]ο
Returns the question text.
with_answer_prefill=True (the default) bakes the answer prefill into the returned string β required by the zero-shot / few-shot last-token scoring path, which reads probabilities from the very next token after the prefill. Set to False for chat-template prompting, where the prefill is supplied separately as the assistant turn (otherwise the same string ends up emitted twice and silently degrades scoring).
- Return type:
str
-
num_forward_passes:
intο
-
text:
strο
-
column:
folktexts.task moduleο
Definition of a generic TaskMetadata class.
- class folktexts.task.TaskMetadata(name, features, target, cols_to_text, sensitive_attribute=None, target_threshold=None, multiple_choice_qa=None, direct_numeric_qa=None, cot_qa=None, description=None, _use_numeric_qa=False, _use_cot_qa=False)[source]ο
Bases:
objectA base class to hold information on a prediction task.
- check_task_columns_are_available(available_cols, raise_=True)[source]ο
Checks if all columns required by this task are available.
- Parameters:
available_cols (list[str]) β The list of column names available in the dataset.
raise (bool, optional) β Whether to raise an error if some columns are missing, by default True.
- Returns:
all_available β True if all required columns are present in the given list of available columns, False otherwise.
- Return type:
bool
-
cols_to_text:
dict[str,ColumnToText]ο A mapping between column names and their textual descriptions.
-
cot_qa:
ChainOfThoughtQA= Noneο The chain-of-thought (CoT) question and answer interface for this task.
- create_task_with_feature_subset(feature_subset)[source]ο
Creates a new task with a subset of the original features.
-
description:
str= Noneο A description of the task, including the population to which the task pertains to.
-
direct_numeric_qa:
DirectNumericQA= Noneο The direct numeric question and answer interface for this task.
-
features:
list[str]ο The names of the features used in the task.
- get_row_description(row)[source]ο
Encode a description of a given data row in textual form.
- Return type:
str
- get_target()[source]ο
Resolves the name of the target column depending on self.target_threshold.
- Return type:
str
- classmethod get_task(name, use_numeric_qa=False)[source]ο
Fetches a previously created task by its name.
- Parameters:
name (str) β The name of the task to fetch.
use_numeric_qa (bool, optional) β Whether to set the retrieved task to use verbalized numeric Q&A instead of the default multiple-choice Q&A prompts. Default is False.
- Returns:
task β The task object with the given name.
- Return type:
- Raises:
ValueError β Raised if the task with the given name has not been created yet.
-
multiple_choice_qa:
MultipleChoiceQA= Noneο The multiple-choice question and answer interface for this task.
-
name:
strο The name of the task.
- property question: QAInterfaceο
Getter for the Q&A interface for this task.
-
sensitive_attribute:
str= Noneο The name of the column used as the sensitive attribute data (if provided).
- sensitive_attribute_value_map()[source]ο
Returns a mapping between sensitive attribute values and their descriptions.
- Return type:
Callable
-
target:
strο The name of the target column.
-
target_threshold:
Threshold= Noneο The threshold used to binarize the target column (if provided).
- property use_cot_qa: boolο
Getter for whether to use chain-of-thought (CoT) Q&A prompts.
- property use_numeric_qa: boolο
Getter for whether to use numeric Q&A instead of multiple-choice Q&A prompts.
folktexts.threshold moduleο
Helper function for defining binarization thresholds.
- class folktexts.threshold.Threshold(value, op)[source]ο
Bases:
objectA class to represent a threshold value and its comparison operator.
- valueο
The threshold value to compare against.
- Type:
float | int
- opο
The comparison operator to use. One of β>β, β<β, β>=β, β<=β, β==β, β!=β.
- Type:
str
- apply_to_column_data(data)[source]ο
Applies the threshold operation to a pandas Series or scalar value.
- Return type:
int|Series
- apply_to_column_name(column_name)[source]ο
Standardizes naming of thresholded columns.
- Return type:
str
-
op:
strο
-
valid_ops:
ClassVar[dict] = {'!=': <built-in function ne>, '<': <built-in function lt>, '<=': <built-in function le>, '==': <built-in function eq>, '>': <built-in function gt>, '>=': <built-in function ge>}ο
-
value:
float|intο