Configuring prompts — typed & composable prompt variations

This notebook demonstrates the prompt-configuration feature added in PR #35. Every prompt folktexts builds for a tabular row is composed of three parts:

[PREFIX]  task description                (constant across rows)
[INFO]    serialized feature-value pairs  (row-specific)
[SUFFIX]  question text + answer prefill  (constant)

These are configured through two small, frozen and hashable dataclasses:

``PromptConfig`` — how one row is rendered (value mapping, ordering, the label↔value connector, the final layout, optional custom prefix/suffix, and the system prompt).
``FewShotConfig`` — whether/how in-context examples are prepended.

Because both are hashable, each distinct configuration gets its own results-file name — runs never silently overwrite one another. The defaults reproduce the original paper’s prompts exactly; you only need this notebook to change how prompts are rendered.

Full reference: `docs/configuring_prompts.md <https://socialfoundations.github.io/folktexts/configuring_prompts.html>`__. The command-line equivalents are the --variation, --few-shot, --numeric-risk-prompting, --cot-prompting, and --use-chat-template flags of run_acs_benchmark (see the README).

0. Setup

We use the vLLM backend (the default for local models since v0.6.0) with a small, fast instruct model. Sections 1–4 only render prompts and run on CPU; the GPU engine is loaded lazily in section 5.

vLLM runtime note (this cluster). vLLM needs the full CUDA toolkit and the FP8 warmup disabled. Launch the kernel from a shell where you have run:
source /etc/profile.d/modules.sh && module load cuda/13.2
export VLLM_USE_DEEP_GEMM=0
(See CLAUDE.md → “vLLM runtime env”.)

[1]:

import folktexts
print(f"{folktexts.__version__=}")

folktexts.__version__='0.6.0'

[2]:

from pathlib import Path

# Pre-cached snapshot + folktables cache on this cluster (keep these /fast paths).
MODEL_PATH = "/fast/groups/sf/huggingface-models/meta-llama--Llama-3.2-3B-Instruct"
DATA_DIR = Path("/fast/groups/sf/data")          # ACSDataset appends "folktables/"
RESULTS_DIR = Path("results") / "configuring-prompts-example"
RESULTS_DIR.mkdir(parents=True, exist_ok=True)
TASK_NAME = "ACSIncome"

1. Inspecting the feature block — `PromptConfig.from_dict`

The [INFO] block is produced by a pipeline of Vary* stages whose order is fixed by their return types:

VaryValueMap → VaryOrder → VaryConnector → VaryFormat
(granularity)   (order)     (connector)     (format)

You never instantiate those stages yourself — you pass a dict of overrides to PromptConfig.from_dict(...). Valid keys: format, connector, granularity, order, custom_prompt_prefix, custom_prompt_suffix, show_question.

Let’s load the task + dataset once and render the same row under several variations.

[3]:

from folktexts import TaskMetadata
from folktexts.acs import ACSDataset

# MC task is the default; we reuse this same cached task object throughout.
task = TaskMetadata.get_task(TASK_NAME)

# Loads ACSIncome from the folktables cache (no download). We grab one row to render;
# the full dataset is reused (subsampled) for the benchmark runs in section 5.
dataset = ACSDataset.make_from_task(task=task, cache_dir=DATA_DIR)
X_sample, _ = dataset.sample_n_train_examples(n=1, reuse_examples=True)
row = X_sample.iloc[0]
print(f"dataset size = {len(dataset.data):,} rows; rendering 1 example row")

Loading ACS data...
dataset size = 1,664,500 rows; rendering 1 example row

[4]:

from folktexts.prompting import PromptConfig, encode_row_prompt

def show(title, pv):
    # Render `row` under a PromptConfig built from variation dict `pv`.
    cfg = PromptConfig.from_dict(pv, task=task)
    print(f"{'='*70}\n{title}\n  variation = {pv or '(defaults)'}\n{'-'*70}")
    print(encode_row_prompt(row, task, prompt_config=cfg))
    print()

show("Default (textbullet, connector 'is:')", {})

======================================================================
Default (textbullet, connector 'is:')
  variation = (defaults)
----------------------------------------------------------------------
The following data corresponds to a survey respondent. The survey was conducted among US residents in 2018. Please answer the question based on the information provided. The data provided is enough to reach an approximate answer.

Information:
- The age is: 53 years old.
- The class of worker is: Owner of non-incorporated business, professional practice, or farm.
- The highest educational attainment is: Bachelor's degree.
- The marital status is: Married.
- The occupation is: Musicians and singers.
- The place of birth is: New York.
- The relationship to the reference person in the survey is: The reference person itself.
- The usual number of hours worked per week is: 20 hours.
- The sex is: Male.
- The race is: White.

Question: What is this person's estimated yearly income?
A. Below $50,000.
B. Above $50,000.
Answer:

[5]:

# Plain bullets, "=" connector, age/education/class-of-worker first.
show("format=bullet, connector='=', order=AGEP,SCHL,COW",
     {"format": "bullet", "connector": "=", "order": "AGEP,SCHL,COW"})

======================================================================
format=bullet, connector='=', order=AGEP,SCHL,COW
  variation = {'format': 'bullet', 'connector': '=', 'order': 'AGEP,SCHL,COW'}
----------------------------------------------------------------------
The following data corresponds to a survey respondent. The survey was conducted among US residents in 2018. Please answer the question based on the information provided. The data provided is enough to reach an approximate answer.

Information:
- age = 53 years old
- highest educational attainment = Bachelor's degree
- class of worker = Owner of non-incorporated business, professional practice, or farm
- marital status = Married
- occupation = Musicians and singers
- place of birth = New York
- relationship to the reference person in the survey = The reference person itself
- usual number of hours worked per week = 20 hours
- sex = Male
- race = White

Question: What is this person's estimated yearly income?
A. Below $50,000.
B. Above $50,000.
Answer:

[6]:

# Coarser ACS feature values (age ranges, grouped occupations), comma-separated.
show("granularity=low, format=comma",
     {"granularity": "low", "format": "comma"})

======================================================================
granularity=low, format=comma
  variation = {'granularity': 'low', 'format': 'comma'}
----------------------------------------------------------------------
The following data corresponds to a survey respondent. The survey was conducted among US residents in 2018. Please answer the question based on the information provided. The data provided is enough to reach an approximate answer.

Information:
age is: 50-59 years old, class of worker is: Self-employed, highest educational attainment is: Bachelor's degree, marital status is: Married, occupation is: Arts, Design, Entertainment, Sports, and Media, place of birth is: Northeast USA, relationship to the reference person in the survey is: Reference person, usual number of hours worked per week is: 20-29 hours, sex is: Male, race is: White

Question: What is this person's estimated yearly income?
A. Below $50,000.
B. Above $50,000.
Answer:

[7]:

# Inject extra context before the feature block and after the question.
show("custom prefix + suffix",
     {"custom_prompt_prefix": "Consider the following US census respondent.",
      "custom_prompt_suffix": "Answer with a single letter."})

======================================================================
custom prefix + suffix
  variation = {'custom_prompt_prefix': 'Consider the following US census respondent.', 'custom_prompt_suffix': 'Answer with a single letter.'}
----------------------------------------------------------------------
The following data corresponds to a survey respondent. The survey was conducted among US residents in 2018. Please answer the question based on the information provided. The data provided is enough to reach an approximate answer.
Consider the following US census respondent.

Information:
- The age is: 53 years old.
- The class of worker is: Owner of non-incorporated business, professional practice, or farm.
- The highest educational attainment is: Bachelor's degree.
- The marital status is: Married.
- The occupation is: Musicians and singers.
- The place of birth is: New York.
- The relationship to the reference person in the survey is: The reference person itself.
- The usual number of hours worked per week is: 20 hours.
- The sex is: Male.
- The race is: White.

Question: What is this person's estimated yearly income?
A. Below $50,000.
B. Above $50,000.
Answer:Answer with a single letter.

2. Question modes — multiple-choice vs numeric vs chain-of-thought

The question mode is orthogonal to the feature-block variations above. It changes what the model is asked to produce and how the answer is read off. The right system prompt / answer-prefill is supplied automatically by the QAInterface subclass, so there is no separate flag to pass — you just select the mode:

Mode	How to select
Multiple-choice (default)	nothing
Numeric	`task.use_numeric_qa = True` · `BenchmarkConfig(numeric_risk_prompting=True)`
Chain-of-thought	`task.set_question(ChainOfThoughtQA(...))` · `BenchmarkConfig(cot_prompting=True)`

Note: TaskMetadata.get_task returns a cached singleton, so flipping a mode mutates the shared task object — we flip it back to multiple-choice after each demo below.

[8]:

# Multiple-choice (default) vs numeric: note the different answer prefill at the end.
task.use_numeric_qa = False
mc = encode_row_prompt(row, task, prompt_config=PromptConfig.from_dict({}, task=task))
task.use_numeric_qa = True
num = encode_row_prompt(row, task, prompt_config=PromptConfig.from_dict({}, task=task))
task.use_numeric_qa = False  # reset the shared task

print("=== MULTIPLE-CHOICE (last lines) ===")
print("\n".join(mc.splitlines()[-6:]))
print("\n=== NUMERIC (last lines) ===")
print("\n".join(num.splitlines()[-4:]))

=== MULTIPLE-CHOICE (last lines) ===
- The race is: White.

Question: What is this person's estimated yearly income?
A. Below $50,000.
B. Above $50,000.
Answer:

=== NUMERIC (last lines) ===
- The race is: White.

Question: What is the probability that this person's estimated yearly income is above $50,000 ?
Answer (between 0 and 1): 0.

[9]:

# Chain-of-thought: free-form reasoning ending in a "Probability: X%" line.
# ACS tasks ship MC + numeric questions; the CoT interface is built on demand from
# the numeric question (exactly what BenchmarkConfig(cot_prompting=True) does internally).
from folktexts.qa_interface import ChainOfThoughtQA

base_q = task.direct_numeric_qa
task.set_question(ChainOfThoughtQA(column=base_q.column, text=base_q.text))
cot_prompt = encode_row_prompt(row, task, prompt_config=PromptConfig.from_dict({}, task=task))
task.use_cot_qa = False        # reset the shared task back to multiple-choice
task.use_numeric_qa = False

print("=== CHAIN-OF-THOUGHT (last lines) ===")
print("\n".join(cot_prompt.splitlines()[-8:]))

=== CHAIN-OF-THOUGHT (last lines) ===

Think step-by-step about the factors that could influence the answer to this question. After reasoning through the relevant information, provide your final probability estimate.

Your response MUST end with your probability estimate in the following format:
Probability: X%
where X is a number between 0 and 100.

Reasoning:

3. Few-shot examples — `FewShotConfig`

FewShotConfig prepends n_shots in-context examples drawn from the dataset’s training split. compose controls class balance ("random", "balanced", or per-class counts), reuse_examples fixes the same examples across rows, and example_order permutes them.

[10]:

from folktexts.prompting import FewShotConfig, encode_row_prompt_few_shot

few_shot_prompt = encode_row_prompt_few_shot(
    row, task, dataset,
    few_shot_config=FewShotConfig(
        n_shots=2,
        compose="balanced",     # one example per class
        reuse_examples=True,
    ),
)
print(few_shot_prompt)

The following data corresponds to different survey respondents. The survey was conducted among US residents in 2018. Please answer each question based on the information provided. The data provided is enough to reach an approximate answer for each person.

Information:
- The age is: 43 years old.
- The class of worker is: Working for a for-profit private company or organization.
- The highest educational attainment is: Some college, 1 or more years, no degree.
- The marital status is: Divorced.
- The occupation is: Sales representatives of services, except advertising, insurance, financial services, and travel.
- The place of birth is: Michigan.
- The relationship to the reference person in the survey is: The reference person itself.
- The usual number of hours worked per week is: 40 hours.
- The sex is: Male.
- The race is: Two or more races.

Question: What is this person's estimated yearly income?
A. Below $50,000.
B. Above $50,000.
Answer: B

Information:
- The age is: 53 years old.
- The class of worker is: Owner of non-incorporated business, professional practice, or farm.
- The highest educational attainment is: Bachelor's degree.
- The marital status is: Married.
- The occupation is: Musicians and singers.
- The place of birth is: New York.
- The relationship to the reference person in the survey is: The reference person itself.
- The usual number of hours worked per week is: 20 hours.
- The sex is: Male.
- The race is: White.

Question: What is this person's estimated yearly income?
A. Below $50,000.
B. Above $50,000.
Answer: A

Information:
- The age is: 53 years old.
- The class of worker is: Owner of non-incorporated business, professional practice, or farm.
- The highest educational attainment is: Bachelor's degree.
- The marital status is: Married.
- The occupation is: Musicians and singers.
- The place of birth is: New York.
- The relationship to the reference person in the survey is: The reference person itself.
- The usual number of hours worked per week is: 20 hours.
- The sex is: Male.
- The race is: White.

Question: What is this person's estimated yearly income?
A. Below $50,000.
B. Above $50,000.
Answer:

4. Chat template + system prompt

For instruct/chat models, use_chat_template formats the prompt with the tokenizer’s chat template. The system-role text has three modes, spelled with the public PROMPT_DEFAULT sentinel:

omit it → the QA type’s built-in default,
PROMPT_DEFAULT → same as omitting (explicit),
None → no system role at all (needed for templates that reject a system turn),
any string → your own system prompt.

[11]:

from transformers import AutoTokenizer
from folktexts.prompting import PROMPT_DEFAULT, PromptBuilder

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
builder = PromptBuilder(task)

for label, sys_prompt in [
    ("default system prompt", PROMPT_DEFAULT),
    ("no system role (system_prompt=None)", None),
    ("custom system prompt", "You are a meticulous social scientist."),
]:
    cfg = PromptConfig.from_dict({}, task=task, system_prompt=sys_prompt)
    chat = builder.build_chat(row, cfg, tokenizer)
    print(f"{'='*70}\n{label}\n{'-'*70}\n{chat}\n")

======================================================================
default system prompt
----------------------------------------------------------------------
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 09 Jun 2026

You are a helpful assistant. You answer multiple-choice questions based on the information provided. Respond with a single answer choice.<|eot_id|><|start_header_id|>user<|end_header_id|>

The following data corresponds to a survey respondent. The survey was conducted among US residents in 2018. Please answer the question based on the information provided. The data provided is enough to reach an approximate answer.

Information:
- The age is: 53 years old.
- The class of worker is: Owner of non-incorporated business, professional practice, or farm.
- The highest educational attainment is: Bachelor's degree.
- The marital status is: Married.
- The occupation is: Musicians and singers.
- The place of birth is: New York.
- The relationship to the reference person in the survey is: The reference person itself.
- The usual number of hours worked per week is: 20 hours.
- The sex is: Male.
- The race is: White.

Question: What is this person's estimated yearly income?
A. Below $50,000.
B. Above $50,000.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

If had to select one of the options, my answer would be

======================================================================
no system role (system_prompt=None)
----------------------------------------------------------------------
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 09 Jun 2026

<|eot_id|><|start_header_id|>user<|end_header_id|>

The following data corresponds to a survey respondent. The survey was conducted among US residents in 2018. Please answer the question based on the information provided. The data provided is enough to reach an approximate answer.

Information:
- The age is: 53 years old.
- The class of worker is: Owner of non-incorporated business, professional practice, or farm.
- The highest educational attainment is: Bachelor's degree.
- The marital status is: Married.
- The occupation is: Musicians and singers.
- The place of birth is: New York.
- The relationship to the reference person in the survey is: The reference person itself.
- The usual number of hours worked per week is: 20 hours.
- The sex is: Male.
- The race is: White.

Question: What is this person's estimated yearly income?
A. Below $50,000.
B. Above $50,000.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

If had to select one of the options, my answer would be

======================================================================
custom system prompt
----------------------------------------------------------------------
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 09 Jun 2026

You are a meticulous social scientist.<|eot_id|><|start_header_id|>user<|end_header_id|>

The following data corresponds to a survey respondent. The survey was conducted among US residents in 2018. Please answer the question based on the information provided. The data provided is enough to reach an approximate answer.

Information:
- The age is: 53 years old.
- The class of worker is: Owner of non-incorporated business, professional practice, or farm.
- The highest educational attainment is: Bachelor's degree.
- The marital status is: Married.
- The occupation is: Musicians and singers.
- The place of birth is: New York.
- The relationship to the reference person in the survey is: The reference person itself.
- The usual number of hours worked per week is: 20 hours.
- The sex is: Male.
- The race is: White.

Question: What is this person's estimated yearly income?
A. Below $50,000.
B. Above $50,000.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

If had to select one of the options, my answer would be

5. Running benchmarks — comparing variations on the GPU

Now we actually score the model under a few configurations and compare ROC AUC / ECE. We load the vLLM engine once and reuse it (and the dataset) across variations, so the GPU model is loaded a single time.

The prompt variation is carried by the BenchmarkConfig (prompt_variation, numeric_risk_prompting, …). Because the config is part of the benchmark hash, each variation writes a distinct results.bench-{hash}.json — no collisions.

[12]:

from folktexts.llm_utils import load_vllm_model

# Engine load reserves GPU memory (gpu_memory_utilization defaults to 0.85).
# max_model_len is small here: MC/numeric need only a few generated tokens.
llm, vllm_tokenizer = load_vllm_model(MODEL_PATH, max_model_len=1024)

INFO 06-09 16:42:46 [utils.py:233] non-default args: {'trust_remote_code': True, 'seed': 42, 'max_model_len': 1024, 'gpu_memory_utilization': 0.85, 'max_logprobs': 50, 'logprobs_mode': 'processed_logprobs', 'disable_log_stats': True, 'model': '/fast/groups/sf/huggingface-models/meta-llama--Llama-3.2-3B-Instruct'}
INFO 06-09 16:42:46 [model.py:555] Resolved architecture: LlamaForCausalLM
INFO 06-09 16:42:46 [model.py:1680] Using max model len 1024
INFO 06-09 16:42:46 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 06-09 16:42:46 [vllm.py:840] Asynchronous scheduling is enabled.
INFO 06-09 16:42:46 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(EngineCore pid=1858566) INFO 06-09 16:42:49 [core.py:109] Initializing a V1 LLM engine (v0.20.1) with config: model='/fast/groups/sf/huggingface-models/meta-llama--Llama-3.2-3B-Instruct', speculative_config=None, tokenizer='/fast/groups/sf/huggingface-models/meta-llama--Llama-3.2-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=42, served_model_name=/fast/groups/sf/huggingface-models/meta-llama--Llama-3.2-3B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [16384], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')
(EngineCore pid=1858566) INFO 06-09 16:42:49 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
(EngineCore pid=1858566) WARNING 06-09 16:42:49 [nixl_utils.py:34] NIXL is not available
(EngineCore pid=1858566) WARNING 06-09 16:42:49 [nixl_utils.py:44] NIXL agent config is not available
(EngineCore pid=1858566) INFO 06-09 16:42:50 [parallel_state.py:1402] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://0.0.0.0:44077 backend=nccl
(EngineCore pid=1858566) INFO 06-09 16:42:50 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=1858566) INFO 06-09 16:42:51 [gpu_model_runner.py:4777] Starting to load model /fast/groups/sf/huggingface-models/meta-llama--Llama-3.2-3B-Instruct...
(EngineCore pid=1858566) INFO 06-09 16:42:52 [cuda.py:368] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'FLASH_ATTN', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=1858566) INFO 06-09 16:42:52 [selector.py:136] Using HND KV cache layout for FLASHINFER backend.
(EngineCore pid=1858566) INFO 06-09 16:42:52 [weight_utils.py:904] Filesystem type for checkpoints: LUSTRE. Checkpoint size: 5.98 GiB. Available RAM: 131.86 GiB.
(EngineCore pid=1858566) INFO 06-09 16:42:52 [weight_utils.py:874] Prefetching checkpoint files into page cache started (in background)
(EngineCore pid=1858566) INFO 06-09 16:42:52 [weight_utils.py:851] Prefetching checkpoint files: 10% (1/2)
(EngineCore pid=1858566) INFO 06-09 16:42:52 [weight_utils.py:851] Prefetching checkpoint files: 20% (2/2)
(EngineCore pid=1858566) INFO 06-09 16:42:52 [weight_utils.py:869] Prefetching checkpoint files into page cache finished in 0.22s
(EngineCore pid=1858566) INFO 06-09 16:42:54 [default_loader.py:384] Loading weights took 1.82 seconds
(EngineCore pid=1858566) INFO 06-09 16:42:55 [gpu_model_runner.py:4879] Model loading took 6.02 GiB memory and 2.601799 seconds
(EngineCore pid=1858566) INFO 06-09 16:42:58 [backends.py:1069] Using cache directory: /home/acruz/.cache/vllm/torch_compile_cache/1af2a64564/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=1858566) INFO 06-09 16:42:58 [backends.py:1128] Dynamo bytecode transform time: 2.55 s
(EngineCore pid=1858566) INFO 06-09 16:43:08 [backends.py:290] Directly load the compiled graph(s) for compile range (1, 16384) from the cache, took 10.499 s
(EngineCore pid=1858566) INFO 06-09 16:43:08 [decorators.py:305] Directly load AOT compilation from path /home/acruz/.cache/vllm/torch_compile_cache/torch_aot_compile/90f71e708c0050aa91a746f42d5d689764eb4ed31798b6f47744696f821a44cc/rank_0_0/model
(EngineCore pid=1858566) INFO 06-09 16:43:08 [monitor.py:53] torch.compile took 13.21 s in total
(EngineCore pid=1858566) INFO 06-09 16:43:08 [monitor.py:81] Initial profiling/warmup run took 0.13 s
(EngineCore pid=1858566) INFO 06-09 16:43:09 [utils.py:60] `_KV_CACHE_LAYOUT_OVERRIDE` variable detected. Setting KV cache layout to HND.
(EngineCore pid=1858566) INFO 06-09 16:43:09 [gpu_model_runner.py:5963] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=51 (largest=512)
(EngineCore pid=1858566) WARNING 06-09 16:43:09 [flashinfer.py:405] Using TRTLLM prefill attention (auto-detected).
(EngineCore pid=1858566) INFO 06-09 16:43:10 [gpu_model_runner.py:6042] Estimated CUDA graph memory: 0.68 GiB total
(EngineCore pid=1858566) INFO 06-09 16:43:11 [gpu_worker.py:440] Available KV cache memory: 142.28 GiB
(EngineCore pid=1858566) INFO 06-09 16:43:11 [gpu_worker.py:455] CUDA graph memory profiling is enabled (default since v0.21.0). The current --gpu-memory-utilization=0.8500 is equivalent to --gpu-memory-utilization=0.8462 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.8538. To disable, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0.
(EngineCore pid=1858566) INFO 06-09 16:43:11 [kv_cache_utils.py:1708] GPU KV cache size: 1,332,080 tokens
(EngineCore pid=1858566) INFO 06-09 16:43:11 [kv_cache_utils.py:1709] Maximum concurrency for 1,024 tokens per request: 1300.86x
(EngineCore pid=1858566) INFO 06-09 16:43:11 [kernel_warmup.py:69] Warming up FlashInfer attention.
(EngineCore pid=1858566) INFO 06-09 16:43:13 [gpu_model_runner.py:6133] Graph capturing finished in 3 secs, took 0.31 GiB
(EngineCore pid=1858566) INFO 06-09 16:43:13 [gpu_worker.py:599] CUDA graph pool memory: 0.31 GiB (actual), 0.68 GiB (estimated), difference: 0.36 GiB (115.5%).
(EngineCore pid=1858566) INFO 06-09 16:43:13 [core.py:299] init engine (profile, create kv cache, warmup model) took 18.78 s (compilation: 13.21 s)
(EngineCore pid=1858566) INFO 06-09 16:43:14 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])

[13]:

# Subsample the (already-loaded) dataset for a quick demo run.
dataset.subsample(0.01)
print(f"{dataset.subsampling=}; test rows = {len(dataset.get_test()[0]):,}")

dataset.subsampling=0.01; test rows = 1,665

[14]:

from folktexts.benchmark import Benchmark, BenchmarkConfig

VARIATIONS = [
    ("default (textbullet, MC)", BenchmarkConfig.default_config(
        batch_size=64, context_size=600)),
    ("low granularity, comma (MC)", BenchmarkConfig.default_config(
        prompt_variation={"granularity": "low", "format": "comma"},
        batch_size=64, context_size=600)),
    ("numeric risk prompting", BenchmarkConfig.default_config(
        numeric_risk_prompting=True, batch_size=64, context_size=600)),
]

rows = []
for label, cfg in VARIATIONS:
    bench = Benchmark.make_benchmark(
        task=TASK_NAME, dataset=dataset,
        model=llm, tokenizer=vllm_tokenizer,
        backend="vllm", model_name_or_path=MODEL_PATH,
        config=cfg,
    )
    # fit_threshold fits the 0/1 decision threshold on a few train rows so the
    # `accuracy` column is meaningful (ROC AUC and ECE are threshold-independent).
    res = bench.run(results_root_dir=RESULTS_DIR, fit_threshold=100)
    rows.append({
        "variation": label,
        "roc_auc": res["roc_auc"],
        "ece": res["ece"],
        "accuracy": res["accuracy"],
        "benchmark_hash": res["benchmark_hash"],
    })
    print(f"[done] {label}: AUC={res['roc_auc']:.3f} ECE={res['ece']:.3f}")

[done] default (textbullet, MC): AUC=0.800 ECE=0.394
[done] low granularity, comma (MC): AUC=0.817 ECE=0.420
[done] numeric risk prompting: AUC=0.663 ECE=0.252

[15]:

import pandas as pd
comparison = pd.DataFrame(rows).set_index("variation")
comparison

[15]:

	roc_auc	ece	accuracy	benchmark_hash
variation
default (textbullet, MC)	0.799535	0.394215	0.691892	2306273009
low granularity, comma (MC)	0.817429	0.419900	0.697898	3849493214
numeric risk prompting	0.663173	0.252028	0.702102	3308184234

[16]:

# Each variation produced a distinct results file (distinct hash → no overwrite):
for p in sorted(RESULTS_DIR.glob("**/results.bench-*.json")):
    print(p.relative_to(RESULTS_DIR))

meta-llama--Llama-3.2-3B-Instruct_bench-2173322335/results.bench-2173322335.json
meta-llama--Llama-3.2-3B-Instruct_bench-2306273009/results.bench-2306273009.json
meta-llama--Llama-3.2-3B-Instruct_bench-265318770/results.bench-265318770.json
meta-llama--Llama-3.2-3B-Instruct_bench-3308184234/results.bench-3308184234.json
meta-llama--Llama-3.2-3B-Instruct_bench-3369999377/results.bench-3369999377.json
meta-llama--Llama-3.2-3B-Instruct_bench-3849493214/results.bench-3849493214.json

Summary

``PromptConfig.from_dict({…}, task=task)`` controls the feature block: format, connector, granularity, order, custom_prompt_prefix, custom_prompt_suffix, show_question.
Question mode (multiple-choice / numeric / chain-of-thought) is selected via the task (use_numeric_qa, use_cot_qa) or BenchmarkConfig (numeric_risk_prompting, cot_prompting) — the right system prompt / prefill follows automatically.
``FewShotConfig`` adds in-context examples; ``use_chat_template`` + ``system_prompt`` drive the chat path (PROMPT_DEFAULT vs None vs a custom string).
Configs are hashable, so each variation gets its own results file.

CLI equivalents (run_acs_benchmark):

run_acs_benchmark --model "$MODEL" --task ACSIncome --results-dir results \
    --variation format=bullet connector== order=AGEP,SCHL,COW  # section 1 ('connector==' sets the connector to '=')
run_acs_benchmark ... --numeric-risk-prompting                          # section 2 (numeric)
run_acs_benchmark ... --cot-prompting                                   # section 2 (CoT)
run_acs_benchmark ... --few-shot 2 --compose-few-shot-examples balanced # section 3
run_acs_benchmark ... --use-chat-template --system-prompt "..."         # section 4

See `docs/configuring_prompts.md <https://socialfoundations.github.io/folktexts/configuring_prompts.html>`__ for the full reference and a migration note from the older flat-keyword API.