Configuring prompts

Overview

You configure every prompt through two frozen dataclasses, built once and passed down the call stack unchanged (they replace the older approach of threading individual keyword arguments through every call):

  • PromptConfig — how one row is rendered: value mapping, ordering, the label↔value connector, the final layout, an optional custom prefix/suffix, and the system prompt.

  • FewShotConfig — whether and how in-context examples are prepended.

Every prompt is then composed of three independently-built parts:

[PREFIX]  task description                (constant across rows)
[INFO]    serialized feature-value pairs  (row-specific)
[SUFFIX]  question text + answer prefill  (constant)

The answer prefill is the short lead-in the prompt ends on, so the model’s next token is the answer we score; its exact text depends on the question type (multiple-choice ends on Answer:, numeric on Answer (between 0 and 1): 0., chain-of-thought generates free-form). In chat mode it becomes the assistant’s opening turn instead.

Both configs are hashable, so each distinct configuration gets its own results-file name and runs never silently overwrite one another. The defaults reproduce the original paper’s prompts exactly; read on only to change how prompts are rendered. The command-line equivalents are summarized in the README.

Question modes

A run asks the model one of three kinds of question. The mode is a single choice that determines what the model is asked to produce and how that output becomes a probability:

Mode

Activate with

What the model does

Multiple-choice (default)

(nothing — it is the default)

Answers a multiple-choice question; we score the answer-letter tokens and read the probability off them. Order-bias correction is on by default (correct_order_bias).

Numeric

numeric_risk_prompting=True · --numeric-risk-prompting

Reports the probability directly — the prompt ends on Answer (between 0 and 1): 0. and we read the digit tokens.

Chain-of-thought

cot_prompting=True · --cot-prompting

Generates free-form reasoning and ends with a Probability: X% line, recovered by regex. Works on any model, with or without a chat template.

enable_thinking / --enable-thinking is a sub-option of chain-of-thought: it turns on a tokenizer’s native thinking mode (e.g. Qwen3) via apply_chat_template(..., enable_thinking=True), and the resulting <think>…</think> block is stripped before extraction. It only applies when CoT is on — setting it alone implicitly enables CoT (and warns). If both numeric_risk_prompting and cot_prompting are set, chain-of-thought wins.

The mode is separate from how the prompt is delivered — zero-shot (the default), few-shot (FewShotConfig / --few-shot), or chat-template formatting (use_chat_template / --use-chat-template). The allowed pairings:

zero-shot

few-shot

chat-template

Multiple-choice

Numeric

Chain-of-thought

✓ supported · ✗ raises ValueError · – not a supported combination. Two pairings are rejected at config time: few-shot + chat-template, and **chain-of-thought

  • chat-template** (CoT already applies the chat template internally, so an outer one would double-wrap the prompt). Few-shot and chat-template are themselves mutually exclusive — a run uses exactly one delivery path. Chain-of-thought is a standalone path: it runs zero-shot and is not combined with few-shot.

The variation pipeline

The [INFO] block is produced by a pipeline of Vary* stages whose order is enforced by their return types — each stage’s output type is the next stage’s input type, so they compose in exactly one order:

VaryValueMap → VaryOrder → VaryConnector → VaryFormat
(list→list)    (list→list)  (list→list)    (list→str)

VaryFormat collapses the feature list into a single string, which is why no per-item stage can run after it. You don’t instantiate these stages yourself — set the keys in the last column below (via --variation or PromptConfig.from_dict).

INFO-pipeline stage

Controls

--variation key

VaryValueMap

How raw column values render as text; low granularity coarsens ACS values into broader bins (age ranges, grouped occupations).

granularity

VaryOrder

Feature ordering (named columns first, the rest appended).

order

VaryConnector

The label↔value separator (is:, is, =, :, …).

connector

VaryFormat

Final layout (textbullet, bullet, comma, text).

format

The [PREFIX] and [SUFFIX] are built separately: VaryPrefix and VarySuffix each return their str directly, and VarySystemPrompt holds the optional system-role string for the chat path.

Prompt part

Controls

Key

VaryPrefix

Task description + optional custom prefix.

custom_prompt_prefix

VarySuffix

Question text / answer prefill.

custom_prompt_suffix, show_question

VarySystemPrompt

Optional system-role string (chat path).

system_prompt= / --system-prompt (not a --variation key)

PromptConfig

PromptConfig holds one instance of each Vary* stage — one each for the prefix, suffix, and system prompt, plus the four-stage pipeline for the feature block. Build one from a dictionary of overrides whose keys are the seven --variation keys from the tables above (format, connector, granularity, order, custom_prompt_prefix, custom_prompt_suffix, show_question), validated against DEFAULT_PROMPT_STYLE — an unknown key raises ValueError, as does an unrecognized granularity or format value. The task argument must be a TaskMetadata object; resolve a task name with TaskMetadata.get_task:

from folktexts import TaskMetadata
from folktexts.prompting import PromptConfig

task = TaskMetadata.get_task("ACSIncome")
prompt_config = PromptConfig.from_dict(
    {
        "format": "bullet",
        "connector": "=",
        "order": "AGEP,SCHL,COW",
        "custom_prompt_prefix": "Consider the following person.",
    },
    task=task,
)

from_dict also takes two optional keyword arguments: question= overrides the task’s default question interface, and add_task_description= (default True) — set it to False to drop the task description from the prefix.

Pass it straight to any classifier:

from folktexts.classifier import VLLMClassifier

clf = VLLMClassifier(
    llm=llm, tokenizer=tokenizer, task="ACSIncome",
    prompt_config=prompt_config,
)

The PROMPT_DEFAULT sentinel

system_prompt (the system-role text) and chat_prompt (the assistant-turn prefill the model continues from in chat mode) have three modes: omit the argument for the built-in default, pass None to remove the role entirely (needed for Gemma-style templates that reject a system turn), or pass your own string. The “built-in default” mode is spelled with the public sentinel PROMPT_DEFAULT — which is distinct from None:

from folktexts.prompting import PROMPT_DEFAULT, PromptConfig

PromptConfig.from_dict({}, task=task)                       # default system prompt
PromptConfig.from_dict({}, task=task, system_prompt=None)   # no system role at all
PromptConfig.from_dict({}, task=task, system_prompt="...")  # custom system prompt

These defaults are ClassVars on the QAInterface hierarchy: multiple-choice questions use the base QAInterface defaults, DirectNumericQA overrides them with numeric-specific prompts, and ChainOfThoughtQA sets them to None (free-form generation). The question type therefore supplies the right default, which is why there is no longer a separate numeric= argument to pass — pick the mode as described under Question modes above.

FewShotConfig

Few-shot prompting is configured with a single frozen dataclass:

from folktexts.prompting import FewShotConfig
from folktexts.benchmark import Benchmark

bench = Benchmark.make_acs_benchmark(
    "ACSIncome", model=llm, tokenizer=tokenizer, data_dir="~/data",
    few_shot_config=FewShotConfig(
        n_shots=4,
        compose="balanced",        # "random" (default) | "balanced" | per-class counts in label order, e.g. (2, 2) = 2 of class 0 + 2 of class 1
        example_order=(3, 2, 1, 0),  # optional permutation of the example indices
        reuse_examples=True,         # default False (resamples per row); True reuses the same examples
        show_question_in_examples=True,  # default True; set False for answer-only examples
    ),
)

Few-shot prompting cannot be combined with the chat-template path (use_chat_template=True) — that combination raises ValueError.

Migrating from the flat-keyword API

Earlier versions configured prompts through scattered keyword arguments. Those have been consolidated into PromptConfig / FewShotConfig. Passing a removed keyword to a constructor or encode_row_prompt* now raises TypeError instead of being silently ignored. Saved benchmark configs from before the change still load: BenchmarkConfig.load_from_disk translates the legacy few-shot keys and ignores any other unknown keys with a warning.

Old

New

custom_prompt_prefix="..." (classifier / encode_row_prompt*)

prompt_config=PromptConfig.from_dict({"custom_prompt_prefix": "..."}, task) or CLI --variation custom_prompt_prefix=...

add_task_description=False

now an argument to PromptConfig.from_dict(...)

few_shot=N, reuse_few_shot_examples=..., balance_few_shot_examples=... (BenchmarkConfig)

few_shot_config=FewShotConfig(n_shots=N, reuse_examples=..., compose="balanced")

class_balancing=True (sample_n_train_examples / encode_row_prompt_few_shot)

compose="balanced" / CLI --compose-few-shot-examples balanced

CLI --balance-few-shot-examples

CLI --compose-few-shot-examples balanced

numeric=True (encode_row_prompt_chat / resolve_chat_defaults)

removed — the default is derived from the QAInterface subclass (DirectNumericQA)

encode_row_prompt(row, task, question_obj) (positional question)

question= is now keyword-only

system_prompt=None / chat_prompt=None to mean “default”

PROMPT_DEFAULT means “default”; None now means “explicitly disable”

The top-level public API (Benchmark, BenchmarkConfig, the classifiers, the QAInterface subclasses, TaskMetadata, ACSDataset) is unchanged.