# Updates

Release notes summarising user-visible changes between versions. Older changes
not yet listed here can be reconstructed from the git log.

## v0.5.0 — rename `reasoning_*` → `cot_*` / `chain_of_thought_*`

Two distinct concepts shared the word "reasoning" in the public API: the
free-form chain-of-thought prompt template and the HF chat-template
`enable_thinking` kwarg. They're orthogonal — `enable_thinking=True`
requires the CoT prompt path, but the CoT prompt path runs on any model
with no thinking-mode support — so we renamed the CoT side to drop the
ambiguity. `enable_thinking` is unchanged because that name matches the HF
kwarg and stays aligned with it.

### Renames (hard cut, no aliases)

| Before                                | After                                  |
|---------------------------------------|----------------------------------------|
| `ReasoningQA` class                   | `ChainOfThoughtQA`                     |
| `BenchmarkConfig.reasoning_prompting` | `BenchmarkConfig.cot_prompting`        |
| CLI flag `--reasoning-prompting`      | `--cot-prompting`                      |
| `TaskMetadata.reasoning_qa` / `use_reasoning_qa` | `TaskMetadata.cot_qa` / `use_cot_qa` |

The previous symbols raise `AttributeError` / `TypeError` rather than
warning + forwarding — update callsites in one commit.

### Migration

```py
# Before
from folktexts.qa_interface import ReasoningQA
config = BenchmarkConfig(reasoning_prompting=True, enable_thinking=True)

# After
from folktexts.qa_interface import ChainOfThoughtQA
config = BenchmarkConfig(cot_prompting=True, enable_thinking=True)
```

```bash
# Before
run_acs_benchmark --model <m> --task ACSIncome --data-dir <d> --reasoning-prompting
# After
run_acs_benchmark --model <m> --task ACSIncome --data-dir <d> --cot-prompting
```

### Notes

- **`enable_thinking` is unchanged** (dataclass field, CLI flag, and class
  attribute on `ChainOfThoughtQA`). It still requires `cot_prompting=True`
  and warns + auto-enables CoT mode if you forget.
- **`ChainOfThoughtQA.max_new_tokens=8000`** value is preserved.
- **Result JSON files** (`results.bench-*.json`) written before the rename
  carry `"config": {"reasoning_prompting": true, ...}`. They remain
  readable: sweep helpers (`scripts/cot_e2e_sweep.py`, `cot_sweep.py`,
  `audit_cot_failures.py`, `extended_sweep.py`, `multi_seed_stability.py`,
  `validate_pr26.py`) accept either key when scanning existing results.
- **Hash stability**: not preserved. `BenchmarkConfig.__hash__` uses
  `dataclasses.asdict(self)`, so the hash includes the field name. New
  runs write to fresh `results.bench-{hash}.json` paths; pre-rename
  cached paths stay readable but won't be short-circuited by a hash
  match.

## v0.4.0 — vLLM backend

`folktexts` v0.4.0 introduces local inference via [vLLM] alongside the existing
HuggingFace `transformers` backend, typically delivering a 5–30× throughput
improvement on GPU benchmarks while preserving the full score-extraction
contract (multiple-choice, direct-numeric, and chain-of-thought prompting).

[vLLM]: https://docs.vllm.ai/

### What's new

- **`VLLMClassifier`**: a new top-K-logprobs classifier in
  `folktexts.classifier.vllm_classifier`, parallel to
  `TransformersLLMClassifier`. Both feed the same QA decoders, so result
  semantics are unchanged.
- **`load_vllm_model`** in `folktexts.llm_utils`: helper that initialises a
  vLLM `LLM` engine + tokenizer with sensible defaults for this benchmark
  (BF16, `gpu_memory_utilization=0.85`, `logprobs_mode="processed_logprobs"`).
- **CLI flag `--inference-backend {transformers,vllm}`**: selects the local
  backend. **Default is now `vllm`.** Pass `--inference-backend transformers`
  to fall back to the previous path; the transformers code is unchanged and
  remains a fully supported alternative.
- **vLLM-specific CLI flags**: `--gpu-memory-utilization`, `--max-model-len`,
  `--vllm-dtype`, `--tensor-parallel-size`. The CLI auto-derives a
  `max_model_len` from `--context-size + ChainOfThoughtQA.max_new_tokens + 256`
  when the user does not pass `--max-model-len` explicitly.
- **Optional install group `[vllm]`**: `pip install folktexts[vllm]` pulls in
  the vLLM wheel. The base install is unchanged for users on the transformers
  path.

### Architecture

- **Two classifiers, one decoder.** `VLLMClassifier`,
  `TransformersLLMClassifier`, and `WebAPILLMClassifier` all hand answers to
  the QA-decoder methods on `MultipleChoiceQA`, `DirectNumericQA`, and
  `ChainOfThoughtQA`. The new helper `decode_topk_logprobs_to_risk_estimate` in
  `folktexts.llm_utils` factors out the top-K decoding logic shared by vLLM
  and the WebAPI; the transformers path (which has full-vocab logits)
  bypasses this helper, as before.
- **Backend dispatch.** `Benchmark.make_*_benchmark(...)` accepts a
  `backend=` argument (`"transformers"`, `"vllm"`, `"webapi"`, or
  `None` for autodetect). When `None`, autodetect uses
  `str → webapi`, duck-typed `LLM-shaped → vllm`, else `transformers`.
- **`VLLMClassifier.__hash__`** includes a `"vllm"` tag so cached result
  paths (`results.bench-{hash}.json`) cannot collide with transformers runs
  of the same model. Predictions can drift by ~1e-3 across backends due to
  attention-kernel differences; mixing them in one CSV would be a silent
  mistake.
- **Numeric mode** uses vLLM's `allowed_token_ids` to restrict generation to
  digit tokens (mirroring the transformers `digits_only=True` mask).
  Multiple-choice mode runs unmasked; the QA decoder's prefix-variant
  matching handles renormalisation across answer letters.

### Cluster runtime requirements (B200 / Hopper / vllm 0.20.1 wheel)

The vLLM 0.20.1 wheel is built against CUDA 13. On clusters where the
default toolkit is older, two environment steps are required for any
vLLM invocation:

```bash
source /etc/profile.d/modules.sh
module load cuda/13.2          # provides libcudart.so.13
export VLLM_USE_DEEP_GEMM=0    # skips an FP8 warmup that needs deep_gemm
                               # (not on PyPI); harmless on BF16 models
```

Without these, `import vllm._C` and engine init both crash on Hopper+
GPUs.

### Validation

The migration was validated across 38 cross-backend cells covering the
paper's Table 1 (8 models × 2-4 modes), a modern + thinking-model sweep
(`gemma-3-1b-it`, `Qwen3-1.7B`, `Qwen3-4B`, `Qwen3-4B-Instruct-2507`,
`Qwen3-4B-Thinking-2507`), and a chat-template extension on
`Mistral-7B-Instruct-v0.2` and `Yi-34B-Chat`. Multi-seed stability was
verified across 4 seeds × 2 backends on Llama-3-8B-Instruct and
Qwen3-Thinking-2507.

**36/38 cells fall within the strict gates** `|ΔAUC| ≤ 0.015` and
`|ΔECE| ≤ 0.025`. The two remaining outliers are characterised:

- `Llama-3-8B` base × `numeric` (zero-shot): vLLM `+0.017` AUC, `−0.041`
  ECE — vLLM is slightly *better*. The model is essentially near-random
  on this prompt (TF AUC 0.559); the delta is within the kernel-noise
  band of a near-random model.
- `Qwen3-1.7B` × `chat-MCQ`: vLLM `+0.190` AUC, `+0.265` ECE — vLLM
  is *much* better. The transformers path collapses to 3 unique scores
  on this combination; vLLM produces 425 unique scores with broad
  spread. The bug is on the transformers side and does not reproduce on
  Qwen3-4B / Qwen3-4B-Instruct / Qwen3-4B-Thinking-2507.

Phase 7 robustness checks (1-row DataFrame, sequential model swap in
the same Python process, near- and over-cap inputs, tied-logit
cross-backend agreement, and OOM clean failure) all pass.

### Backwards compatibility

- The CLI accepts the same flags as before plus the new
  `--inference-backend` / `--gpu-memory-utilization` /
  `--max-model-len` / `--vllm-dtype` / `--tensor-parallel-size`. All new
  flags have safe defaults; existing scripts work unchanged on the
  vLLM backend, or on the transformers backend with
  `--inference-backend transformers`.
- `Benchmark.make_*_benchmark(...)` accepts an optional `backend=` kwarg.
  Existing callers that pass `model=` as a HuggingFace `PreTrainedModel`
  continue to be routed to `TransformersLLMClassifier`.
- Result CSVs from previous runs (transformers) are not invalidated; the
  new vLLM hash tag means vLLM runs save to a fresh path rather than
  overwriting transformers numbers.

### Migration notes

If you previously installed `folktexts` and want to use the new vLLM
backend:

```bash
pip install --upgrade 'folktexts[vllm]'
# or, from a checkout:
pip install -e .[vllm]
```

Then either accept the new default (`vllm`) or stay on transformers
explicitly:

```bash
run_acs_benchmark --model <path> --task ACSIncome --data-dir <path> \
    --inference-backend transformers
```