{ "cells": [ { "cell_type": "markdown", "id": "e6fb9997", "metadata": {}, "source": [ "# Configuring prompts — typed & composable prompt variations\n", "\n", "This notebook demonstrates the prompt-configuration feature added in\n", "[PR #35](https://github.com/socialfoundations/folktexts/pull/35). Every prompt\n", "`folktexts` builds for a tabular row is composed of three parts:\n", "\n", "```\n", "[PREFIX] task description (constant across rows)\n", "[INFO] serialized feature-value pairs (row-specific)\n", "[SUFFIX] question text + answer prefill (constant)\n", "```\n", "\n", "These are configured through two small, **frozen and hashable** dataclasses:\n", "\n", "- **`PromptConfig`** — how one row is rendered (value mapping, ordering, the\n", " label↔value connector, the final layout, optional custom prefix/suffix, and the\n", " system prompt).\n", "- **`FewShotConfig`** — whether/how in-context examples are prepended.\n", "\n", "Because both are hashable, **each distinct configuration gets its own results-file\n", "name** — runs never silently overwrite one another. The defaults reproduce the\n", "original paper's prompts exactly; you only need this notebook to *change* how\n", "prompts are rendered.\n", "\n", "> Full reference: [`docs/configuring_prompts.md`](https://socialfoundations.github.io/folktexts/configuring_prompts.html).\n", "> The command-line equivalents are the `--variation`, `--few-shot`,\n", "> `--numeric-risk-prompting`, `--cot-prompting`, and `--use-chat-template` flags of\n", "> `run_acs_benchmark` (see the README)." ] }, { "cell_type": "markdown", "id": "d2f980c9", "metadata": {}, "source": [ "## 0. Setup\n", "\n", "We use the **vLLM backend** (the default for local models since v0.6.0) with a small,\n", "fast instruct model. Sections 1–4 only *render* prompts and run on CPU; the GPU engine\n", "is loaded lazily in section 5.\n", "\n", "> **vLLM runtime note (this cluster).** vLLM needs the full CUDA toolkit and the\n", "> FP8 warmup disabled. Launch the kernel from a shell where you have run:\n", "> ```bash\n", "> source /etc/profile.d/modules.sh && module load cuda/13.2\n", "> export VLLM_USE_DEEP_GEMM=0\n", "> ```\n", "> (See `CLAUDE.md` → \"vLLM runtime env\".)" ] }, { "cell_type": "code", "execution_count": 1, "id": "ae58b753", "metadata": { "execution": { "iopub.execute_input": "2026-06-09T14:41:58.673448Z", "iopub.status.busy": "2026-06-09T14:41:58.673360Z", "iopub.status.idle": "2026-06-09T14:42:05.583879Z", "shell.execute_reply": "2026-06-09T14:42:05.583388Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "folktexts.__version__='0.6.0'\n" ] } ], "source": [ "import folktexts\n", "print(f\"{folktexts.__version__=}\")" ] }, { "cell_type": "code", "execution_count": 2, "id": "e4e5e67c", "metadata": { "execution": { "iopub.execute_input": "2026-06-09T14:42:05.585540Z", "iopub.status.busy": "2026-06-09T14:42:05.585263Z", "iopub.status.idle": "2026-06-09T14:42:05.587920Z", "shell.execute_reply": "2026-06-09T14:42:05.587588Z" } }, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "# Pre-cached snapshot + folktables cache on this cluster (keep these /fast paths).\n", "MODEL_PATH = \"/fast/groups/sf/huggingface-models/meta-llama--Llama-3.2-3B-Instruct\"\n", "DATA_DIR = Path(\"/fast/groups/sf/data\") # ACSDataset appends \"folktables/\"\n", "RESULTS_DIR = Path(\"results\") / \"configuring-prompts-example\"\n", "RESULTS_DIR.mkdir(parents=True, exist_ok=True)\n", "TASK_NAME = \"ACSIncome\"" ] }, { "cell_type": "markdown", "id": "65eaa8a1", "metadata": {}, "source": [ "## 1. Inspecting the feature block — `PromptConfig.from_dict`\n", "\n", "The `[INFO]` block is produced by a pipeline of `Vary*` stages whose order is fixed\n", "by their return types:\n", "\n", "```\n", "VaryValueMap → VaryOrder → VaryConnector → VaryFormat\n", "(granularity) (order) (connector) (format)\n", "```\n", "\n", "You never instantiate those stages yourself — you pass a dict of overrides to\n", "`PromptConfig.from_dict(...)`. Valid keys: `format`, `connector`, `granularity`,\n", "`order`, `custom_prompt_prefix`, `custom_prompt_suffix`, `show_question`.\n", "\n", "Let's load the task + dataset once and render the **same row** under several\n", "variations." ] }, { "cell_type": "code", "execution_count": 3, "id": "721fbe79", "metadata": { "execution": { "iopub.execute_input": "2026-06-09T14:42:05.588831Z", "iopub.status.busy": "2026-06-09T14:42:05.588738Z", "iopub.status.idle": "2026-06-09T14:42:43.009339Z", "shell.execute_reply": "2026-06-09T14:42:43.008798Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading ACS data...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "dataset size = 1,664,500 rows; rendering 1 example row\n" ] } ], "source": [ "from folktexts import TaskMetadata\n", "from folktexts.acs import ACSDataset\n", "\n", "# MC task is the default; we reuse this same cached task object throughout.\n", "task = TaskMetadata.get_task(TASK_NAME)\n", "\n", "# Loads ACSIncome from the folktables cache (no download). We grab one row to render;\n", "# the full dataset is reused (subsampled) for the benchmark runs in section 5.\n", "dataset = ACSDataset.make_from_task(task=task, cache_dir=DATA_DIR)\n", "X_sample, _ = dataset.sample_n_train_examples(n=1, reuse_examples=True)\n", "row = X_sample.iloc[0]\n", "print(f\"dataset size = {len(dataset.data):,} rows; rendering 1 example row\")" ] }, { "cell_type": "code", "execution_count": 4, "id": "34cb955e", "metadata": { "execution": { "iopub.execute_input": "2026-06-09T14:42:43.010879Z", "iopub.status.busy": "2026-06-09T14:42:43.010747Z", "iopub.status.idle": "2026-06-09T14:42:43.016045Z", "shell.execute_reply": "2026-06-09T14:42:43.015691Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "======================================================================\n", "Default (textbullet, connector 'is:')\n", " variation = (defaults)\n", "----------------------------------------------------------------------\n", "The following data corresponds to a survey respondent. The survey was conducted among US residents in 2018. Please answer the question based on the information provided. The data provided is enough to reach an approximate answer.\n", "\n", "Information:\n", "- The age is: 53 years old.\n", "- The class of worker is: Owner of non-incorporated business, professional practice, or farm.\n", "- The highest educational attainment is: Bachelor's degree.\n", "- The marital status is: Married.\n", "- The occupation is: Musicians and singers.\n", "- The place of birth is: New York.\n", "- The relationship to the reference person in the survey is: The reference person itself.\n", "- The usual number of hours worked per week is: 20 hours.\n", "- The sex is: Male.\n", "- The race is: White.\n", "\n", "Question: What is this person's estimated yearly income?\n", "A. Below $50,000.\n", "B. Above $50,000.\n", "Answer:\n", "\n" ] } ], "source": [ "from folktexts.prompting import PromptConfig, encode_row_prompt\n", "\n", "def show(title, pv):\n", " # Render `row` under a PromptConfig built from variation dict `pv`.\n", " cfg = PromptConfig.from_dict(pv, task=task)\n", " print(f\"{'='*70}\\n{title}\\n variation = {pv or '(defaults)'}\\n{'-'*70}\")\n", " print(encode_row_prompt(row, task, prompt_config=cfg))\n", " print()\n", "\n", "show(\"Default (textbullet, connector 'is:')\", {})" ] }, { "cell_type": "code", "execution_count": 5, "id": "18d0ef89", "metadata": { "execution": { "iopub.execute_input": "2026-06-09T14:42:43.016960Z", "iopub.status.busy": "2026-06-09T14:42:43.016857Z", "iopub.status.idle": "2026-06-09T14:42:43.019134Z", "shell.execute_reply": "2026-06-09T14:42:43.018795Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "======================================================================\n", "format=bullet, connector='=', order=AGEP,SCHL,COW\n", " variation = {'format': 'bullet', 'connector': '=', 'order': 'AGEP,SCHL,COW'}\n", "----------------------------------------------------------------------\n", "The following data corresponds to a survey respondent. The survey was conducted among US residents in 2018. Please answer the question based on the information provided. The data provided is enough to reach an approximate answer.\n", "\n", "Information:\n", "- age = 53 years old\n", "- highest educational attainment = Bachelor's degree\n", "- class of worker = Owner of non-incorporated business, professional practice, or farm\n", "- marital status = Married\n", "- occupation = Musicians and singers\n", "- place of birth = New York\n", "- relationship to the reference person in the survey = The reference person itself\n", "- usual number of hours worked per week = 20 hours\n", "- sex = Male\n", "- race = White\n", "\n", "Question: What is this person's estimated yearly income?\n", "A. Below $50,000.\n", "B. Above $50,000.\n", "Answer:\n", "\n" ] } ], "source": [ "# Plain bullets, \"=\" connector, age/education/class-of-worker first.\n", "show(\"format=bullet, connector='=', order=AGEP,SCHL,COW\",\n", " {\"format\": \"bullet\", \"connector\": \"=\", \"order\": \"AGEP,SCHL,COW\"})" ] }, { "cell_type": "code", "execution_count": 6, "id": "79b22545", "metadata": { "execution": { "iopub.execute_input": "2026-06-09T14:42:43.020151Z", "iopub.status.busy": "2026-06-09T14:42:43.020042Z", "iopub.status.idle": "2026-06-09T14:42:43.024383Z", "shell.execute_reply": "2026-06-09T14:42:43.024073Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "======================================================================\n", "granularity=low, format=comma\n", " variation = {'granularity': 'low', 'format': 'comma'}\n", "----------------------------------------------------------------------\n", "The following data corresponds to a survey respondent. The survey was conducted among US residents in 2018. Please answer the question based on the information provided. The data provided is enough to reach an approximate answer.\n", "\n", "Information:\n", "age is: 50-59 years old, class of worker is: Self-employed, highest educational attainment is: Bachelor's degree, marital status is: Married, occupation is: Arts, Design, Entertainment, Sports, and Media, place of birth is: Northeast USA, relationship to the reference person in the survey is: Reference person, usual number of hours worked per week is: 20-29 hours, sex is: Male, race is: White\n", "\n", "Question: What is this person's estimated yearly income?\n", "A. Below $50,000.\n", "B. Above $50,000.\n", "Answer:\n", "\n" ] } ], "source": [ "# Coarser ACS feature values (age ranges, grouped occupations), comma-separated.\n", "show(\"granularity=low, format=comma\",\n", " {\"granularity\": \"low\", \"format\": \"comma\"})" ] }, { "cell_type": "code", "execution_count": 7, "id": "20583f11", "metadata": { "execution": { "iopub.execute_input": "2026-06-09T14:42:43.025259Z", "iopub.status.busy": "2026-06-09T14:42:43.025166Z", "iopub.status.idle": "2026-06-09T14:42:43.027361Z", "shell.execute_reply": "2026-06-09T14:42:43.027041Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "======================================================================\n", "custom prefix + suffix\n", " variation = {'custom_prompt_prefix': 'Consider the following US census respondent.', 'custom_prompt_suffix': 'Answer with a single letter.'}\n", "----------------------------------------------------------------------\n", "The following data corresponds to a survey respondent. The survey was conducted among US residents in 2018. Please answer the question based on the information provided. The data provided is enough to reach an approximate answer.\n", "Consider the following US census respondent.\n", "\n", "Information:\n", "- The age is: 53 years old.\n", "- The class of worker is: Owner of non-incorporated business, professional practice, or farm.\n", "- The highest educational attainment is: Bachelor's degree.\n", "- The marital status is: Married.\n", "- The occupation is: Musicians and singers.\n", "- The place of birth is: New York.\n", "- The relationship to the reference person in the survey is: The reference person itself.\n", "- The usual number of hours worked per week is: 20 hours.\n", "- The sex is: Male.\n", "- The race is: White.\n", "\n", "Question: What is this person's estimated yearly income?\n", "A. Below $50,000.\n", "B. Above $50,000.\n", "Answer:Answer with a single letter.\n", "\n" ] } ], "source": [ "# Inject extra context before the feature block and after the question.\n", "show(\"custom prefix + suffix\",\n", " {\"custom_prompt_prefix\": \"Consider the following US census respondent.\",\n", " \"custom_prompt_suffix\": \"Answer with a single letter.\"})" ] }, { "cell_type": "markdown", "id": "868e04eb", "metadata": {}, "source": [ "## 2. Question modes — multiple-choice vs numeric vs chain-of-thought\n", "\n", "The *question mode* is orthogonal to the feature-block variations above. It changes\n", "what the model is asked to produce and how the answer is read off. The right system\n", "prompt / answer-prefill is supplied automatically by the `QAInterface` subclass, so\n", "there is no separate flag to pass — you just select the mode:\n", "\n", "| Mode | How to select |\n", "|:---|:---|\n", "| Multiple-choice *(default)* | nothing |\n", "| Numeric | `task.use_numeric_qa = True` · `BenchmarkConfig(numeric_risk_prompting=True)` |\n", "| Chain-of-thought | `task.set_question(ChainOfThoughtQA(...))` · `BenchmarkConfig(cot_prompting=True)` |\n", "\n", "> Note: `TaskMetadata.get_task` returns a **cached singleton**, so flipping a mode\n", "> mutates the shared `task` object — we flip it back to multiple-choice after each\n", "> demo below." ] }, { "cell_type": "code", "execution_count": 8, "id": "34a54f97", "metadata": { "execution": { "iopub.execute_input": "2026-06-09T14:42:43.028164Z", "iopub.status.busy": "2026-06-09T14:42:43.028076Z", "iopub.status.idle": "2026-06-09T14:42:43.031094Z", "shell.execute_reply": "2026-06-09T14:42:43.030796Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== MULTIPLE-CHOICE (last lines) ===\n", "- The race is: White.\n", "\n", "Question: What is this person's estimated yearly income?\n", "A. Below $50,000.\n", "B. Above $50,000.\n", "Answer:\n", "\n", "=== NUMERIC (last lines) ===\n", "- The race is: White.\n", "\n", "Question: What is the probability that this person's estimated yearly income is above $50,000 ?\n", "Answer (between 0 and 1): 0.\n" ] } ], "source": [ "# Multiple-choice (default) vs numeric: note the different answer prefill at the end.\n", "task.use_numeric_qa = False\n", "mc = encode_row_prompt(row, task, prompt_config=PromptConfig.from_dict({}, task=task))\n", "task.use_numeric_qa = True\n", "num = encode_row_prompt(row, task, prompt_config=PromptConfig.from_dict({}, task=task))\n", "task.use_numeric_qa = False # reset the shared task\n", "\n", "print(\"=== MULTIPLE-CHOICE (last lines) ===\")\n", "print(\"\\n\".join(mc.splitlines()[-6:]))\n", "print(\"\\n=== NUMERIC (last lines) ===\")\n", "print(\"\\n\".join(num.splitlines()[-4:]))" ] }, { "cell_type": "code", "execution_count": 9, "id": "b83637e7", "metadata": { "execution": { "iopub.execute_input": "2026-06-09T14:42:43.032017Z", "iopub.status.busy": "2026-06-09T14:42:43.031927Z", "iopub.status.idle": "2026-06-09T14:42:43.034523Z", "shell.execute_reply": "2026-06-09T14:42:43.034253Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== CHAIN-OF-THOUGHT (last lines) ===\n", "\n", "Think step-by-step about the factors that could influence the answer to this question. After reasoning through the relevant information, provide your final probability estimate.\n", "\n", "Your response MUST end with your probability estimate in the following format:\n", "Probability: X%\n", "where X is a number between 0 and 100.\n", "\n", "Reasoning:\n" ] } ], "source": [ "# Chain-of-thought: free-form reasoning ending in a \"Probability: X%\" line.\n", "# ACS tasks ship MC + numeric questions; the CoT interface is built on demand from\n", "# the numeric question (exactly what BenchmarkConfig(cot_prompting=True) does internally).\n", "from folktexts.qa_interface import ChainOfThoughtQA\n", "\n", "base_q = task.direct_numeric_qa\n", "task.set_question(ChainOfThoughtQA(column=base_q.column, text=base_q.text))\n", "cot_prompt = encode_row_prompt(row, task, prompt_config=PromptConfig.from_dict({}, task=task))\n", "task.use_cot_qa = False # reset the shared task back to multiple-choice\n", "task.use_numeric_qa = False\n", "\n", "print(\"=== CHAIN-OF-THOUGHT (last lines) ===\")\n", "print(\"\\n\".join(cot_prompt.splitlines()[-8:]))" ] }, { "cell_type": "markdown", "id": "128d2ce8", "metadata": {}, "source": [ "## 3. Few-shot examples — `FewShotConfig`\n", "\n", "`FewShotConfig` prepends `n_shots` in-context examples drawn from the dataset's\n", "training split. `compose` controls class balance (`\"random\"`, `\"balanced\"`, or\n", "per-class counts), `reuse_examples` fixes the same examples across rows, and\n", "`example_order` permutes them." ] }, { "cell_type": "code", "execution_count": 10, "id": "6dbf5ddf", "metadata": { "execution": { "iopub.execute_input": "2026-06-09T14:42:43.035340Z", "iopub.status.busy": "2026-06-09T14:42:43.035254Z", "iopub.status.idle": "2026-06-09T14:42:43.067066Z", "shell.execute_reply": "2026-06-09T14:42:43.066630Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The following data corresponds to different survey respondents. The survey was conducted among US residents in 2018. Please answer each question based on the information provided. The data provided is enough to reach an approximate answer for each person.\n", "\n", "Information:\n", "- The age is: 43 years old.\n", "- The class of worker is: Working for a for-profit private company or organization.\n", "- The highest educational attainment is: Some college, 1 or more years, no degree.\n", "- The marital status is: Divorced.\n", "- The occupation is: Sales representatives of services, except advertising, insurance, financial services, and travel.\n", "- The place of birth is: Michigan.\n", "- The relationship to the reference person in the survey is: The reference person itself.\n", "- The usual number of hours worked per week is: 40 hours.\n", "- The sex is: Male.\n", "- The race is: Two or more races.\n", "\n", "Question: What is this person's estimated yearly income?\n", "A. Below $50,000.\n", "B. Above $50,000.\n", "Answer: B\n", "\n", "Information:\n", "- The age is: 53 years old.\n", "- The class of worker is: Owner of non-incorporated business, professional practice, or farm.\n", "- The highest educational attainment is: Bachelor's degree.\n", "- The marital status is: Married.\n", "- The occupation is: Musicians and singers.\n", "- The place of birth is: New York.\n", "- The relationship to the reference person in the survey is: The reference person itself.\n", "- The usual number of hours worked per week is: 20 hours.\n", "- The sex is: Male.\n", "- The race is: White.\n", "\n", "Question: What is this person's estimated yearly income?\n", "A. Below $50,000.\n", "B. Above $50,000.\n", "Answer: A\n", "\n", "Information:\n", "- The age is: 53 years old.\n", "- The class of worker is: Owner of non-incorporated business, professional practice, or farm.\n", "- The highest educational attainment is: Bachelor's degree.\n", "- The marital status is: Married.\n", "- The occupation is: Musicians and singers.\n", "- The place of birth is: New York.\n", "- The relationship to the reference person in the survey is: The reference person itself.\n", "- The usual number of hours worked per week is: 20 hours.\n", "- The sex is: Male.\n", "- The race is: White.\n", "\n", "Question: What is this person's estimated yearly income?\n", "A. Below $50,000.\n", "B. Above $50,000.\n", "Answer:\n" ] } ], "source": [ "from folktexts.prompting import FewShotConfig, encode_row_prompt_few_shot\n", "\n", "few_shot_prompt = encode_row_prompt_few_shot(\n", " row, task, dataset,\n", " few_shot_config=FewShotConfig(\n", " n_shots=2,\n", " compose=\"balanced\", # one example per class\n", " reuse_examples=True,\n", " ),\n", ")\n", "print(few_shot_prompt)" ] }, { "cell_type": "markdown", "id": "97439074", "metadata": {}, "source": [ "## 4. Chat template + system prompt\n", "\n", "For instruct/chat models, `use_chat_template` formats the prompt with the tokenizer's\n", "chat template. The system-role text has three modes, spelled with the public\n", "`PROMPT_DEFAULT` sentinel:\n", "\n", "- omit it → the QA type's built-in default,\n", "- `PROMPT_DEFAULT` → same as omitting (explicit),\n", "- `None` → no system role at all (needed for templates that reject a system turn),\n", "- any string → your own system prompt." ] }, { "cell_type": "code", "execution_count": 11, "id": "2f29c27a", "metadata": { "execution": { "iopub.execute_input": "2026-06-09T14:42:43.068099Z", "iopub.status.busy": "2026-06-09T14:42:43.067995Z", "iopub.status.idle": "2026-06-09T14:42:43.975582Z", "shell.execute_reply": "2026-06-09T14:42:43.975164Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "======================================================================\n", "default system prompt\n", "----------------------------------------------------------------------\n", "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n", "\n", "Cutting Knowledge Date: December 2023\n", "Today Date: 09 Jun 2026\n", "\n", "You are a helpful assistant. You answer multiple-choice questions based on the information provided. Respond with a single answer choice.<|eot_id|><|start_header_id|>user<|end_header_id|>\n", "\n", "The following data corresponds to a survey respondent. The survey was conducted among US residents in 2018. Please answer the question based on the information provided. The data provided is enough to reach an approximate answer.\n", "\n", "Information:\n", "- The age is: 53 years old.\n", "- The class of worker is: Owner of non-incorporated business, professional practice, or farm.\n", "- The highest educational attainment is: Bachelor's degree.\n", "- The marital status is: Married.\n", "- The occupation is: Musicians and singers.\n", "- The place of birth is: New York.\n", "- The relationship to the reference person in the survey is: The reference person itself.\n", "- The usual number of hours worked per week is: 20 hours.\n", "- The sex is: Male.\n", "- The race is: White.\n", "\n", "Question: What is this person's estimated yearly income?\n", "A. Below $50,000.\n", "B. Above $50,000.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n", "\n", "If had to select one of the options, my answer would be\n", "\n", "======================================================================\n", "no system role (system_prompt=None)\n", "----------------------------------------------------------------------\n", "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n", "\n", "Cutting Knowledge Date: December 2023\n", "Today Date: 09 Jun 2026\n", "\n", "<|eot_id|><|start_header_id|>user<|end_header_id|>\n", "\n", "The following data corresponds to a survey respondent. The survey was conducted among US residents in 2018. Please answer the question based on the information provided. The data provided is enough to reach an approximate answer.\n", "\n", "Information:\n", "- The age is: 53 years old.\n", "- The class of worker is: Owner of non-incorporated business, professional practice, or farm.\n", "- The highest educational attainment is: Bachelor's degree.\n", "- The marital status is: Married.\n", "- The occupation is: Musicians and singers.\n", "- The place of birth is: New York.\n", "- The relationship to the reference person in the survey is: The reference person itself.\n", "- The usual number of hours worked per week is: 20 hours.\n", "- The sex is: Male.\n", "- The race is: White.\n", "\n", "Question: What is this person's estimated yearly income?\n", "A. Below $50,000.\n", "B. Above $50,000.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n", "\n", "If had to select one of the options, my answer would be\n", "\n", "======================================================================\n", "custom system prompt\n", "----------------------------------------------------------------------\n", "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n", "\n", "Cutting Knowledge Date: December 2023\n", "Today Date: 09 Jun 2026\n", "\n", "You are a meticulous social scientist.<|eot_id|><|start_header_id|>user<|end_header_id|>\n", "\n", "The following data corresponds to a survey respondent. The survey was conducted among US residents in 2018. Please answer the question based on the information provided. The data provided is enough to reach an approximate answer.\n", "\n", "Information:\n", "- The age is: 53 years old.\n", "- The class of worker is: Owner of non-incorporated business, professional practice, or farm.\n", "- The highest educational attainment is: Bachelor's degree.\n", "- The marital status is: Married.\n", "- The occupation is: Musicians and singers.\n", "- The place of birth is: New York.\n", "- The relationship to the reference person in the survey is: The reference person itself.\n", "- The usual number of hours worked per week is: 20 hours.\n", "- The sex is: Male.\n", "- The race is: White.\n", "\n", "Question: What is this person's estimated yearly income?\n", "A. Below $50,000.\n", "B. Above $50,000.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n", "\n", "If had to select one of the options, my answer would be\n", "\n" ] } ], "source": [ "from transformers import AutoTokenizer\n", "from folktexts.prompting import PROMPT_DEFAULT, PromptBuilder\n", "\n", "tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)\n", "builder = PromptBuilder(task)\n", "\n", "for label, sys_prompt in [\n", " (\"default system prompt\", PROMPT_DEFAULT),\n", " (\"no system role (system_prompt=None)\", None),\n", " (\"custom system prompt\", \"You are a meticulous social scientist.\"),\n", "]:\n", " cfg = PromptConfig.from_dict({}, task=task, system_prompt=sys_prompt)\n", " chat = builder.build_chat(row, cfg, tokenizer)\n", " print(f\"{'='*70}\\n{label}\\n{'-'*70}\\n{chat}\\n\")" ] }, { "cell_type": "markdown", "id": "f5053610", "metadata": {}, "source": [ "## 5. Running benchmarks — comparing variations on the GPU\n", "\n", "Now we actually score the model under a few configurations and compare ROC AUC / ECE.\n", "We **load the vLLM engine once** and reuse it (and the dataset) across variations, so\n", "the GPU model is loaded a single time.\n", "\n", "The prompt variation is carried by the `BenchmarkConfig` (`prompt_variation`,\n", "`numeric_risk_prompting`, …). Because the config is part of the benchmark hash, each\n", "variation writes a **distinct** `results.bench-{hash}.json` — no collisions." ] }, { "cell_type": "code", "execution_count": 12, "id": "eae03f23", "metadata": { "execution": { "iopub.execute_input": "2026-06-09T14:42:43.976953Z", "iopub.status.busy": "2026-06-09T14:42:43.976834Z", "iopub.status.idle": "2026-06-09T14:43:14.367443Z", "shell.execute_reply": "2026-06-09T14:43:14.366845Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:42:46 [utils.py:233] non-default args: {'trust_remote_code': True, 'seed': 42, 'max_model_len': 1024, 'gpu_memory_utilization': 0.85, 'max_logprobs': 50, 'logprobs_mode': 'processed_logprobs', 'disable_log_stats': True, 'model': '/fast/groups/sf/huggingface-models/meta-llama--Llama-3.2-3B-Instruct'}\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:42:46 [model.py:555] Resolved architecture: LlamaForCausalLM\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:42:46 [model.py:1680] Using max model len 1024\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:42:46 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=16384.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:42:46 [vllm.py:840] Asynchronous scheduling is enabled.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:42:46 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:42:49 [core.py:109] Initializing a V1 LLM engine (v0.20.1) with config: model='/fast/groups/sf/huggingface-models/meta-llama--Llama-3.2-3B-Instruct', speculative_config=None, tokenizer='/fast/groups/sf/huggingface-models/meta-llama--Llama-3.2-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=42, served_model_name=/fast/groups/sf/huggingface-models/meta-llama--Llama-3.2-3B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': , 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [16384], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': , 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': , 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:42:49 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING 06-09 16:42:49 [nixl_utils.py:34] NIXL is not available\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING 06-09 16:42:49 [nixl_utils.py:44] NIXL agent config is not available\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:42:50 [parallel_state.py:1402] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://0.0.0.0:44077 backend=nccl\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:42:50 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:42:51 [gpu_model_runner.py:4777] Starting to load model /fast/groups/sf/huggingface-models/meta-llama--Llama-3.2-3B-Instruct...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:42:52 [cuda.py:368] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'FLASH_ATTN', 'TRITON_ATTN', 'FLEX_ATTENTION'].\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:42:52 [selector.py:136] Using HND KV cache layout for FLASHINFER backend.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:42:52 [weight_utils.py:904] Filesystem type for checkpoints: LUSTRE. Checkpoint size: 5.98 GiB. Available RAM: 131.86 GiB.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:42:52 [weight_utils.py:874] Prefetching checkpoint files into page cache started (in background)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:42:52 [weight_utils.py:851] Prefetching checkpoint files: 10% (1/2)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:42:52 [weight_utils.py:851] Prefetching checkpoint files: 20% (2/2)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:42:52 [weight_utils.py:869] Prefetching checkpoint files into page cache finished in 0.22s\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:42:54 [default_loader.py:384] Loading weights took 1.82 seconds\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:42:55 [gpu_model_runner.py:4879] Model loading took 6.02 GiB memory and 2.601799 seconds\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:42:58 [backends.py:1069] Using cache directory: /home/acruz/.cache/vllm/torch_compile_cache/1af2a64564/rank_0_0/backbone for vLLM's torch.compile\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:42:58 [backends.py:1128] Dynamo bytecode transform time: 2.55 s\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:43:08 [backends.py:290] Directly load the compiled graph(s) for compile range (1, 16384) from the cache, took 10.499 s\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:43:08 [decorators.py:305] Directly load AOT compilation from path /home/acruz/.cache/vllm/torch_compile_cache/torch_aot_compile/90f71e708c0050aa91a746f42d5d689764eb4ed31798b6f47744696f821a44cc/rank_0_0/model\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:43:08 [monitor.py:53] torch.compile took 13.21 s in total\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:43:08 [monitor.py:81] Initial profiling/warmup run took 0.13 s\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:43:09 [utils.py:60] `_KV_CACHE_LAYOUT_OVERRIDE` variable detected. Setting KV cache layout to HND.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:43:09 [gpu_model_runner.py:5963] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=51 (largest=512)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING 06-09 16:43:09 [flashinfer.py:405] Using TRTLLM prefill attention (auto-detected).\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:43:10 [gpu_model_runner.py:6042] Estimated CUDA graph memory: 0.68 GiB total\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:43:11 [gpu_worker.py:440] Available KV cache memory: 142.28 GiB\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:43:11 [gpu_worker.py:455] CUDA graph memory profiling is enabled (default since v0.21.0). The current --gpu-memory-utilization=0.8500 is equivalent to --gpu-memory-utilization=0.8462 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.8538. To disable, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:43:11 [kv_cache_utils.py:1708] GPU KV cache size: 1,332,080 tokens\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:43:11 [kv_cache_utils.py:1709] Maximum concurrency for 1,024 tokens per request: 1300.86x\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:43:11 [kernel_warmup.py:69] Warming up FlashInfer attention.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:43:13 [gpu_model_runner.py:6133] Graph capturing finished in 3 secs, took 0.31 GiB\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:43:13 [gpu_worker.py:599] CUDA graph pool memory: 0.31 GiB (actual), 0.68 GiB (estimated), difference: 0.36 GiB (115.5%).\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:43:13 [core.py:299] init engine (profile, create kv cache, warmup model) took 18.78 s (compilation: 13.21 s)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(EngineCore pid=1858566) " ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO 06-09 16:43:14 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])\n" ] } ], "source": [ "from folktexts.llm_utils import load_vllm_model\n", "\n", "# Engine load reserves GPU memory (gpu_memory_utilization defaults to 0.85).\n", "# max_model_len is small here: MC/numeric need only a few generated tokens.\n", "llm, vllm_tokenizer = load_vllm_model(MODEL_PATH, max_model_len=1024)" ] }, { "cell_type": "code", "execution_count": 13, "id": "b9dbeb87", "metadata": { "execution": { "iopub.execute_input": "2026-06-09T14:43:14.369600Z", "iopub.status.busy": "2026-06-09T14:43:14.368898Z", "iopub.status.idle": "2026-06-09T14:43:14.388438Z", "shell.execute_reply": "2026-06-09T14:43:14.388066Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dataset.subsampling=0.01; test rows = 1,665\n" ] } ], "source": [ "# Subsample the (already-loaded) dataset for a quick demo run.\n", "dataset.subsample(0.01)\n", "print(f\"{dataset.subsampling=}; test rows = {len(dataset.get_test()[0]):,}\")" ] }, { "cell_type": "code", "execution_count": 14, "id": "e4cd0f75", "metadata": { "execution": { "iopub.execute_input": "2026-06-09T14:43:14.389774Z", "iopub.status.busy": "2026-06-09T14:43:14.389654Z", "iopub.status.idle": "2026-06-09T14:43:22.592616Z", "shell.execute_reply": "2026-06-09T14:43:22.592119Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[done] default (textbullet, MC): AUC=0.800 ECE=0.394\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[done] low granularity, comma (MC): AUC=0.817 ECE=0.420\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[done] numeric risk prompting: AUC=0.663 ECE=0.252\n" ] } ], "source": [ "from folktexts.benchmark import Benchmark, BenchmarkConfig\n", "\n", "VARIATIONS = [\n", " (\"default (textbullet, MC)\", BenchmarkConfig.default_config(\n", " batch_size=64, context_size=600)),\n", " (\"low granularity, comma (MC)\", BenchmarkConfig.default_config(\n", " prompt_variation={\"granularity\": \"low\", \"format\": \"comma\"},\n", " batch_size=64, context_size=600)),\n", " (\"numeric risk prompting\", BenchmarkConfig.default_config(\n", " numeric_risk_prompting=True, batch_size=64, context_size=600)),\n", "]\n", "\n", "rows = []\n", "for label, cfg in VARIATIONS:\n", " bench = Benchmark.make_benchmark(\n", " task=TASK_NAME, dataset=dataset,\n", " model=llm, tokenizer=vllm_tokenizer,\n", " backend=\"vllm\", model_name_or_path=MODEL_PATH,\n", " config=cfg,\n", " )\n", " # fit_threshold fits the 0/1 decision threshold on a few train rows so the\n", " # `accuracy` column is meaningful (ROC AUC and ECE are threshold-independent).\n", " res = bench.run(results_root_dir=RESULTS_DIR, fit_threshold=100)\n", " rows.append({\n", " \"variation\": label,\n", " \"roc_auc\": res[\"roc_auc\"],\n", " \"ece\": res[\"ece\"],\n", " \"accuracy\": res[\"accuracy\"],\n", " \"benchmark_hash\": res[\"benchmark_hash\"],\n", " })\n", " print(f\"[done] {label}: AUC={res['roc_auc']:.3f} ECE={res['ece']:.3f}\")" ] }, { "cell_type": "code", "execution_count": 15, "id": "f9055fec", "metadata": { "execution": { "iopub.execute_input": "2026-06-09T14:43:22.593880Z", "iopub.status.busy": "2026-06-09T14:43:22.593761Z", "iopub.status.idle": "2026-06-09T14:43:22.602031Z", "shell.execute_reply": "2026-06-09T14:43:22.601672Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
roc_auceceaccuracybenchmark_hash
variation
default (textbullet, MC)0.7995350.3942150.6918922306273009
low granularity, comma (MC)0.8174290.4199000.6978983849493214
numeric risk prompting0.6631730.2520280.7021023308184234
\n", "
" ], "text/plain": [ " roc_auc ece accuracy benchmark_hash\n", "variation \n", "default (textbullet, MC) 0.799535 0.394215 0.691892 2306273009\n", "low granularity, comma (MC) 0.817429 0.419900 0.697898 3849493214\n", "numeric risk prompting 0.663173 0.252028 0.702102 3308184234" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "comparison = pd.DataFrame(rows).set_index(\"variation\")\n", "comparison" ] }, { "cell_type": "code", "execution_count": 16, "id": "2d04998c", "metadata": { "execution": { "iopub.execute_input": "2026-06-09T14:43:22.602956Z", "iopub.status.busy": "2026-06-09T14:43:22.602856Z", "iopub.status.idle": "2026-06-09T14:43:22.612480Z", "shell.execute_reply": "2026-06-09T14:43:22.612162Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "meta-llama--Llama-3.2-3B-Instruct_bench-2173322335/results.bench-2173322335.json\n", "meta-llama--Llama-3.2-3B-Instruct_bench-2306273009/results.bench-2306273009.json\n", "meta-llama--Llama-3.2-3B-Instruct_bench-265318770/results.bench-265318770.json\n", "meta-llama--Llama-3.2-3B-Instruct_bench-3308184234/results.bench-3308184234.json\n", "meta-llama--Llama-3.2-3B-Instruct_bench-3369999377/results.bench-3369999377.json\n", "meta-llama--Llama-3.2-3B-Instruct_bench-3849493214/results.bench-3849493214.json\n" ] } ], "source": [ "# Each variation produced a distinct results file (distinct hash → no overwrite):\n", "for p in sorted(RESULTS_DIR.glob(\"**/results.bench-*.json\")):\n", " print(p.relative_to(RESULTS_DIR))" ] }, { "cell_type": "markdown", "id": "128f51de", "metadata": {}, "source": [ "## Summary\n", "\n", "- **`PromptConfig.from_dict({...}, task=task)`** controls the feature block:\n", " `format`, `connector`, `granularity`, `order`, `custom_prompt_prefix`,\n", " `custom_prompt_suffix`, `show_question`.\n", "- **Question mode** (multiple-choice / numeric / chain-of-thought) is selected via the\n", " task (`use_numeric_qa`, `use_cot_qa`) or `BenchmarkConfig`\n", " (`numeric_risk_prompting`, `cot_prompting`) — the right system prompt / prefill\n", " follows automatically.\n", "- **`FewShotConfig`** adds in-context examples; **`use_chat_template` + `system_prompt`**\n", " drive the chat path (`PROMPT_DEFAULT` vs `None` vs a custom string).\n", "- Configs are hashable, so each variation gets its own results file.\n", "\n", "**CLI equivalents** (`run_acs_benchmark`):\n", "\n", "```bash\n", "run_acs_benchmark --model \"$MODEL\" --task ACSIncome --results-dir results \\\n", " --variation format=bullet connector== order=AGEP,SCHL,COW # section 1 ('connector==' sets the connector to '=')\n", "run_acs_benchmark ... --numeric-risk-prompting # section 2 (numeric)\n", "run_acs_benchmark ... --cot-prompting # section 2 (CoT)\n", "run_acs_benchmark ... --few-shot 2 --compose-few-shot-examples balanced # section 3\n", "run_acs_benchmark ... --use-chat-template --system-prompt \"...\" # section 4\n", "```\n", "\n", "See [`docs/configuring_prompts.md`](https://socialfoundations.github.io/folktexts/configuring_prompts.html) for the full\n", "reference and a migration note from the older flat-keyword API." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (folktexts)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.20" } }, "nbformat": 4, "nbformat_minor": 5 }