Configuring prompts
Overview
You configure every prompt through two frozen dataclasses, built once and passed down the call stack unchanged (they replace the older approach of threading individual keyword arguments through every call):
PromptConfig— how one row is rendered: value mapping, ordering, the label↔value connector, the final layout, an optional custom prefix/suffix, and the system prompt.FewShotConfig— whether and how in-context examples are prepended.
Every prompt is then composed of three independently-built parts:
[PREFIX] task description (constant across rows)
[INFO] serialized feature-value pairs (row-specific)
[SUFFIX] question text + answer prefill (constant)
The answer prefill is the short lead-in the prompt ends on, so the model’s next
token is the answer we score; its exact text depends on the question type
(multiple-choice ends on Answer:, numeric on Answer (between 0 and 1): 0.,
chain-of-thought generates free-form). In chat mode it becomes the assistant’s
opening turn instead.
Both configs are hashable, so each distinct configuration gets its own results-file name and runs never silently overwrite one another. The defaults reproduce the original paper’s prompts exactly; read on only to change how prompts are rendered. The command-line equivalents are summarized in the README.
Question modes
A run asks the model one of three kinds of question. The mode is a single choice that determines what the model is asked to produce and how that output becomes a probability:
Mode |
Activate with |
What the model does |
|---|---|---|
Multiple-choice (default) |
(nothing — it is the default) |
Answers a multiple-choice question; we score the answer-letter tokens and read the probability off them. Order-bias correction is on by default ( |
Numeric |
|
Reports the probability directly — the prompt ends on |
Chain-of-thought |
|
Generates free-form reasoning and ends with a |
enable_thinking / --enable-thinking is a sub-option of chain-of-thought: it
turns on a tokenizer’s native thinking mode (e.g. Qwen3) via
apply_chat_template(..., enable_thinking=True), and the resulting
<think>…</think> block is stripped before extraction. It only applies when CoT
is on — setting it alone implicitly enables CoT (and warns). If both
numeric_risk_prompting and cot_prompting are set, chain-of-thought wins.
The mode is separate from how the prompt is delivered — zero-shot (the
default), few-shot (FewShotConfig / --few-shot), or chat-template formatting
(use_chat_template / --use-chat-template). The allowed pairings:
zero-shot |
few-shot |
chat-template |
|
|---|---|---|---|
Multiple-choice |
✓ |
✓ |
✓ |
Numeric |
✓ |
✓ |
✓ |
Chain-of-thought |
✓ |
– |
✗ |
✓ supported · ✗ raises ValueError · – not a supported combination. Two pairings
are rejected at config time: few-shot + chat-template, and **chain-of-thought
chat-template** (CoT already applies the chat template internally, so an outer one would double-wrap the prompt). Few-shot and chat-template are themselves mutually exclusive — a run uses exactly one delivery path. Chain-of-thought is a standalone path: it runs zero-shot and is not combined with few-shot.
The variation pipeline
The [INFO] block is produced by a pipeline of Vary* stages whose order is
enforced by their return types — each stage’s output type is the next stage’s
input type, so they compose in exactly one order:
VaryValueMap → VaryOrder → VaryConnector → VaryFormat
(list→list) (list→list) (list→list) (list→str)
VaryFormat collapses the feature list into a single string, which is why no
per-item stage can run after it. You don’t instantiate these stages yourself —
set the keys in the last column below (via --variation or
PromptConfig.from_dict).
INFO-pipeline stage |
Controls |
|
|---|---|---|
|
How raw column values render as text; |
|
|
Feature ordering (named columns first, the rest appended). |
|
|
The label↔value separator ( |
|
|
Final layout ( |
|
The [PREFIX] and [SUFFIX] are built separately: VaryPrefix and VarySuffix
each return their str directly, and VarySystemPrompt holds the optional
system-role string for the chat path.
Prompt part |
Controls |
Key |
|---|---|---|
|
Task description + optional custom prefix. |
|
|
Question text / answer prefill. |
|
|
Optional system-role string (chat path). |
|
PromptConfig
PromptConfig holds one instance of each Vary* stage — one each for the
prefix, suffix, and system prompt, plus the four-stage pipeline for the feature
block. Build one from a dictionary of overrides whose keys are the seven
--variation keys from the tables above (format, connector, granularity,
order, custom_prompt_prefix, custom_prompt_suffix, show_question),
validated against DEFAULT_PROMPT_STYLE — an unknown key raises ValueError, as
does an unrecognized granularity or format value.
The task argument must be a TaskMetadata object; resolve a task name with
TaskMetadata.get_task:
from folktexts import TaskMetadata
from folktexts.prompting import PromptConfig
task = TaskMetadata.get_task("ACSIncome")
prompt_config = PromptConfig.from_dict(
{
"format": "bullet",
"connector": "=",
"order": "AGEP,SCHL,COW",
"custom_prompt_prefix": "Consider the following person.",
},
task=task,
)
from_dict also takes two optional keyword arguments: question= overrides the
task’s default question interface, and add_task_description= (default True) —
set it to False to drop the task description from the prefix.
Pass it straight to any classifier:
from folktexts.classifier import VLLMClassifier
clf = VLLMClassifier(
llm=llm, tokenizer=tokenizer, task="ACSIncome",
prompt_config=prompt_config,
)
The PROMPT_DEFAULT sentinel
system_prompt (the system-role text) and chat_prompt (the assistant-turn
prefill the model continues from in chat mode) have three modes: omit the
argument for the
built-in default, pass None to remove the role entirely (needed for
Gemma-style templates that reject a system turn), or pass your own string. The
“built-in default” mode is spelled with the public sentinel PROMPT_DEFAULT —
which is distinct from None:
from folktexts.prompting import PROMPT_DEFAULT, PromptConfig
PromptConfig.from_dict({}, task=task) # default system prompt
PromptConfig.from_dict({}, task=task, system_prompt=None) # no system role at all
PromptConfig.from_dict({}, task=task, system_prompt="...") # custom system prompt
These defaults are ClassVars on the QAInterface hierarchy: multiple-choice
questions use the base QAInterface defaults, DirectNumericQA overrides them
with numeric-specific prompts, and ChainOfThoughtQA sets them to None
(free-form generation). The question type therefore supplies the right default,
which is why there is no longer a separate numeric= argument to pass — pick the
mode as described under Question modes above.
FewShotConfig
Few-shot prompting is configured with a single frozen dataclass:
from folktexts.prompting import FewShotConfig
from folktexts.benchmark import Benchmark
bench = Benchmark.make_acs_benchmark(
"ACSIncome", model=llm, tokenizer=tokenizer, data_dir="~/data",
few_shot_config=FewShotConfig(
n_shots=4,
compose="balanced", # "random" (default) | "balanced" | per-class counts in label order, e.g. (2, 2) = 2 of class 0 + 2 of class 1
example_order=(3, 2, 1, 0), # optional permutation of the example indices
reuse_examples=True, # default False (resamples per row); True reuses the same examples
show_question_in_examples=True, # default True; set False for answer-only examples
),
)
Few-shot prompting cannot be combined with the chat-template path
(use_chat_template=True) — that combination raises ValueError.
Migrating from the flat-keyword API
Earlier versions configured prompts through scattered keyword arguments. Those
have been consolidated into PromptConfig / FewShotConfig. Passing a removed
keyword to a constructor or encode_row_prompt* now raises TypeError instead
of being silently ignored. Saved benchmark configs from before the change still
load: BenchmarkConfig.load_from_disk translates the legacy few-shot keys and
ignores any other unknown keys with a warning.
Old |
New |
|---|---|
|
|
|
now an argument to |
|
|
|
|
CLI |
CLI |
|
removed — the default is derived from the |
|
|
|
|
The top-level public API (Benchmark, BenchmarkConfig, the classifiers, the
QAInterface subclasses, TaskMetadata, ACSDataset) is unchanged.