Updates
Release notes summarising user-visible changes between versions. Older changes not yet listed here can be reconstructed from the git log.
v0.5.0 — rename reasoning_* → cot_* / chain_of_thought_*
Two distinct concepts shared the word “reasoning” in the public API: the
free-form chain-of-thought prompt template and the HF chat-template
enable_thinking kwarg. They’re orthogonal — enable_thinking=True
requires the CoT prompt path, but the CoT prompt path runs on any model
with no thinking-mode support — so we renamed the CoT side to drop the
ambiguity. enable_thinking is unchanged because that name matches the HF
kwarg and stays aligned with it.
Renames (hard cut, no aliases)
Before |
After |
|---|---|
|
|
|
|
CLI flag |
|
|
|
The previous symbols raise AttributeError / TypeError rather than
warning + forwarding — update callsites in one commit.
Migration
# Before
from folktexts.qa_interface import ReasoningQA
config = BenchmarkConfig(reasoning_prompting=True, enable_thinking=True)
# After
from folktexts.qa_interface import ChainOfThoughtQA
config = BenchmarkConfig(cot_prompting=True, enable_thinking=True)
# Before
run_acs_benchmark --model <m> --task ACSIncome --data-dir <d> --reasoning-prompting
# After
run_acs_benchmark --model <m> --task ACSIncome --data-dir <d> --cot-prompting
Notes
enable_thinkingis unchanged (dataclass field, CLI flag, and class attribute onChainOfThoughtQA). It still requirescot_prompting=Trueand warns + auto-enables CoT mode if you forget.ChainOfThoughtQA.max_new_tokens=8000value is preserved.Result JSON files (
results.bench-*.json) written before the rename carry"config": {"reasoning_prompting": true, ...}. They remain readable: sweep helpers (scripts/cot_e2e_sweep.py,cot_sweep.py,audit_cot_failures.py,extended_sweep.py,multi_seed_stability.py,validate_pr26.py) accept either key when scanning existing results.Hash stability: not preserved.
BenchmarkConfig.__hash__usesdataclasses.asdict(self), so the hash includes the field name. New runs write to freshresults.bench-{hash}.jsonpaths; pre-rename cached paths stay readable but won’t be short-circuited by a hash match.
v0.4.0 — vLLM backend
folktexts v0.4.0 introduces local inference via vLLM alongside the existing
HuggingFace transformers backend, typically delivering a 5–30× throughput
improvement on GPU benchmarks while preserving the full score-extraction
contract (multiple-choice, direct-numeric, and chain-of-thought prompting).
What’s new
VLLMClassifier: a new top-K-logprobs classifier infolktexts.classifier.vllm_classifier, parallel toTransformersLLMClassifier. Both feed the same QA decoders, so result semantics are unchanged.load_vllm_modelinfolktexts.llm_utils: helper that initialises a vLLMLLMengine + tokenizer with sensible defaults for this benchmark (BF16,gpu_memory_utilization=0.85,logprobs_mode="processed_logprobs").CLI flag
--inference-backend {transformers,vllm}: selects the local backend. Default is nowvllm. Pass--inference-backend transformersto fall back to the previous path; the transformers code is unchanged and remains a fully supported alternative.vLLM-specific CLI flags:
--gpu-memory-utilization,--max-model-len,--vllm-dtype,--tensor-parallel-size. The CLI auto-derives amax_model_lenfrom--context-size + ChainOfThoughtQA.max_new_tokens + 256when the user does not pass--max-model-lenexplicitly.Optional install group
[vllm]:pip install folktexts[vllm]pulls in the vLLM wheel. The base install is unchanged for users on the transformers path.
Architecture
Two classifiers, one decoder.
VLLMClassifier,TransformersLLMClassifier, andWebAPILLMClassifierall hand answers to the QA-decoder methods onMultipleChoiceQA,DirectNumericQA, andChainOfThoughtQA. The new helperdecode_topk_logprobs_to_risk_estimateinfolktexts.llm_utilsfactors out the top-K decoding logic shared by vLLM and the WebAPI; the transformers path (which has full-vocab logits) bypasses this helper, as before.Backend dispatch.
Benchmark.make_*_benchmark(...)accepts abackend=argument ("transformers","vllm","webapi", orNonefor autodetect). WhenNone, autodetect usesstr → webapi, duck-typedLLM-shaped → vllm, elsetransformers.VLLMClassifier.__hash__includes a"vllm"tag so cached result paths (results.bench-{hash}.json) cannot collide with transformers runs of the same model. Predictions can drift by ~1e-3 across backends due to attention-kernel differences; mixing them in one CSV would be a silent mistake.Numeric mode uses vLLM’s
allowed_token_idsto restrict generation to digit tokens (mirroring the transformersdigits_only=Truemask). Multiple-choice mode runs unmasked; the QA decoder’s prefix-variant matching handles renormalisation across answer letters.
Cluster runtime requirements (B200 / Hopper / vllm 0.20.1 wheel)
The vLLM 0.20.1 wheel is built against CUDA 13. On clusters where the default toolkit is older, two environment steps are required for any vLLM invocation:
source /etc/profile.d/modules.sh
module load cuda/13.2 # provides libcudart.so.13
export VLLM_USE_DEEP_GEMM=0 # skips an FP8 warmup that needs deep_gemm
# (not on PyPI); harmless on BF16 models
Without these, import vllm._C and engine init both crash on Hopper+
GPUs.
Validation
The migration was validated across 38 cross-backend cells covering the
paper’s Table 1 (8 models × 2-4 modes), a modern + thinking-model sweep
(gemma-3-1b-it, Qwen3-1.7B, Qwen3-4B, Qwen3-4B-Instruct-2507,
Qwen3-4B-Thinking-2507), and a chat-template extension on
Mistral-7B-Instruct-v0.2 and Yi-34B-Chat. Multi-seed stability was
verified across 4 seeds × 2 backends on Llama-3-8B-Instruct and
Qwen3-Thinking-2507.
36/38 cells fall within the strict gates |ΔAUC| ≤ 0.015 and
|ΔECE| ≤ 0.025. The two remaining outliers are characterised:
Llama-3-8Bbase ×numeric(zero-shot): vLLM+0.017AUC,−0.041ECE — vLLM is slightly better. The model is essentially near-random on this prompt (TF AUC 0.559); the delta is within the kernel-noise band of a near-random model.Qwen3-1.7B×chat-MCQ: vLLM+0.190AUC,+0.265ECE — vLLM is much better. The transformers path collapses to 3 unique scores on this combination; vLLM produces 425 unique scores with broad spread. The bug is on the transformers side and does not reproduce on Qwen3-4B / Qwen3-4B-Instruct / Qwen3-4B-Thinking-2507.
Phase 7 robustness checks (1-row DataFrame, sequential model swap in the same Python process, near- and over-cap inputs, tied-logit cross-backend agreement, and OOM clean failure) all pass.
Backwards compatibility
The CLI accepts the same flags as before plus the new
--inference-backend/--gpu-memory-utilization/--max-model-len/--vllm-dtype/--tensor-parallel-size. All new flags have safe defaults; existing scripts work unchanged on the vLLM backend, or on the transformers backend with--inference-backend transformers.Benchmark.make_*_benchmark(...)accepts an optionalbackend=kwarg. Existing callers that passmodel=as a HuggingFacePreTrainedModelcontinue to be routed toTransformersLLMClassifier.Result CSVs from previous runs (transformers) are not invalidated; the new vLLM hash tag means vLLM runs save to a fresh path rather than overwriting transformers numbers.
Migration notes
If you previously installed folktexts and want to use the new vLLM
backend:
pip install --upgrade 'folktexts[vllm]'
# or, from a checkout:
pip install -e .[vllm]
Then either accept the new default (vllm) or stay on transformers
explicitly:
run_acs_benchmark --model <path> --task ACSIncome --data-dir <path> \
--inference-backend transformers