folktexts package

Subpackages

Submodules

folktexts.benchmark module

A benchmark class for measuring and evaluating LLM calibration.

class folktexts.benchmark.Benchmark(llm_clf, dataset, config=BenchmarkConfig(numeric_risk_prompting=False, few_shot=None, reuse_few_shot_examples=False, balance_few_shot_examples=False, batch_size=None, context_size=None, correct_order_bias=True, feature_subset=None, population_filter=None, seed=42))[source]

Bases: object

Measures and evaluates risk scores produced by an LLM.

A benchmark object to measure and evaluate risk scores produced by an LLM.

Parameters:

llm_clf (LLMClassifier) – A language model classifier object (can be local or web-hosted).
dataset (Dataset) – The dataset object to use for the benchmark.÷
config (BenchmarkConfig, optional) – The configuration object used to create the benchmark parameters. NOTE: This is used to uniquely identify the benchmark object for reproducibility; it will not be used to change the benchmark behavior. To configure the benchmark, pass a configuration object to the Benchmark.make_benchmark method.

ACS_DATASET_CONFIGS = {'horizon': '1-Year', 'seed': 42, 'subsampling': None, 'survey': 'person', 'survey_year': '2018', 'test_size': 0.1, 'val_size': 0.1}

property configs_dict: dict

classmethod make_acs_benchmark(task_name, *, model, tokenizer=None, data_dir=None, max_api_rpm=None, config=BenchmarkConfig(numeric_risk_prompting=False, few_shot=None, reuse_few_shot_examples=False, balance_few_shot_examples=False, batch_size=None, context_size=None, correct_order_bias=True, feature_subset=None, population_filter=None, seed=42), **kwargs)[source]

Create a standardized calibration benchmark on ACS data.

Parameters:

task_name (str) – The name of the ACS task to use.
model (AutoModelForCausalLM | str) – The transformers language model to use, or the model ID for a webAPI hosted model (e.g., “openai/gpt-4o-mini”).
tokenizer (AutoTokenizer, optional) – The tokenizer used to train the model (if using a transformers model). Not required for webAPI models.
data_dir (str | Path, optional) – Path to the directory to load data from and save data in.
max_api_rpm (int, optional) – The maximum number of API requests per minute for webAPI models.
config (BenchmarkConfig, optional) – Extra benchmark configurations, by default will use BenchmarkConfig.default_config().
**kwargs – Additional arguments passed to ACSDataset and BenchmarkConfig. By default will use a set of standardized configurations for reproducibility.

Returns:

bench – The ACS calibration benchmark object.

Return type:

Benchmark

classmethod make_benchmark(*, task, dataset, model, tokenizer=None, max_api_rpm=None, config=BenchmarkConfig(numeric_risk_prompting=False, few_shot=None, reuse_few_shot_examples=False, balance_few_shot_examples=False, batch_size=None, context_size=None, correct_order_bias=True, feature_subset=None, population_filter=None, seed=42), **kwargs)[source]

Create a calibration benchmark from a given configuration.

Parameters:

task (TaskMetadata | str) – The task metadata object or name of the task to use.
dataset (Dataset) – The dataset to use for the benchmark.
model (AutoModelForCausalLM | str) – The transformers language model to use, or the model ID for a webAPI hosted model (e.g., “openai/gpt-4o-mini”).
tokenizer (AutoTokenizer, optional) – The tokenizer used to train the model (if using a transformers model). Not required for webAPI models.
max_api_rpm (int, optional) – The maximum number of API requests per minute for webAPI models.
config (BenchmarkConfig, optional) – Extra benchmark configurations, by default will use BenchmarkConfig.default_config().
**kwargs – Additional arguments for easier configuration of the benchmark. Will simply use these values to update the config object.

Returns:

bench – The calibration benchmark object.

Return type:

Benchmark

property model_name

plot_results(*, show_plots=True)[source]

Render evaluation plots and save to disk.

Parameters:: show_plots (bool, optional) – Whether to show plots, by default True.
Returns:: plots_paths – The paths to the saved plots.
Return type:: dict[str, str]

property results

property results_dir: Path: Get the results directory for this benchmark.

property results_root_dir: Path

run(results_root_dir, fit_threshold=0)[source]

Run the calibration benchmark experiment.

Parameters:

results_root_dir (str | Path) – Path to root directory under which results will be saved.
fit_threshold (int | bool, optional) – Whether to fit the binarization threshold on a given number of training samples, by default 0 (will not fit the threshold).

Returns:

The benchmark metric value. By default this is the ECE score.

Return type:

float

save_results(results_root_dir=None)[source]

Save the benchmark results to disk.

Parameters:: results_root_dir (str | Path, optional) – Path to root directory under which results will be saved. By default will use self.results_root_dir.

property task

class folktexts.benchmark.BenchmarkConfig(numeric_risk_prompting=False, few_shot=None, reuse_few_shot_examples=False, balance_few_shot_examples=False, batch_size=None, context_size=None, correct_order_bias=True, feature_subset=None, population_filter=None, seed=42)[source]

Bases: object

A dataclass to hold the configuration for risk-score benchmark.

numeric_risk_prompting

Whether to prompt for numeric risk-estimates instead of multiple-choice Q&A, by default False.

Type:: bool, optional

few_shot

Whether to use few-shot prompting with a given number of examples, by default None.

Type:: int | None, optional

reuse_few_shot_examples

Whether to reuse the same samples for few-shot prompting (or sample new ones every time), by default False.

Type:: bool, optional

balance_few_shot_examples

Whether to balance the samples for few-shot prompting with respect to their labels, by default False.

Type:: bool, optional

batch_size

The batch size to use for inference.

Type:: int | None, optional

context_size

The maximum context size when prompting the LLM.

Type:: int | None, optional

correct_order_bias

Whether to correct the ordering bias in multiple-choice Q&A when prompting the LLM, by default True.

Type:: bool, optional

feature_subset

Whether to use a subset of the standard feature set for the task. The list should contain the names of the columns of features to use.

Type:: list[str] | None, optional

population_filter

Optional population filter for this benchmark; must follow the format {“column_name”: “value”}.

Type:: dict | None, optional

seed

Random seed – to set for reproducibility.

Type:: int, optional

balance_few_shot_examples: bool = False

batch_size: int | None = None

context_size: int | None = None

correct_order_bias: bool = True

classmethod default_config(**changes)[source]: Returns the default configuration with optional changes.

feature_subset: list[str] | None = None

few_shot: int | None = None

classmethod load_from_disk(path)[source]: Load the configuration from disk.

numeric_risk_prompting: bool = False

population_filter: dict | None = None

reuse_few_shot_examples: bool = False

save_to_disk(path)[source]: Save the configuration to disk.

seed: int = 42

update(**changes)[source]

Update the configuration with new values.

Return type:: BenchmarkConfig

folktexts.col_to_text module

class folktexts.col_to_text.ColumnToText(name, short_description, value_map=None, question=None, connector_verb='is:', missing_value_fill='N/A', use_value_map_only=False)[source]

Bases: object

Maps a single column’s values to natural text.

Constructs a ColumnToText object.

Parameters:

name (str) – The column’s name.
short_description (str) – A short description of the column to be used before different values. For example, short_description=”yearly income” will result in “The yearly income is […]”.
value_map (dict[int | str, str] | Callable, optional) – A map between column values and their textual meaning. If not provided, will try to infer a mapping from the question.
question (QAInterface, optional) – A question associated with the column. If not provided, will try to infer a multiple-choice question from the value_map.
connector_verb (str, optional) – Which verb to use when connecting the column’s description to its value; by default “is”.
missing_value_fill (str, optional) – The value to use when the column’s value is not found in the value_map, by default “N/A”.
use_value_map_only (bool, optional) – Whether to only use the value_map for mapping values to text, or whether natural language representation should be generated using the connector_verb and short_description as well. By default (False) will construct a natural language representation of the form: “The [short_description] [connector_verb] [value_map.get(val)]”.

get_text(value)[source]

Returns the natural text representation of the given data value.

Return type:: str

property name: str

property question: QAInterface

property short_description: str

property value_map: Callable: Returns the value map function for this column.

folktexts.dataset module

General Dataset functionality for text-based datasets.

class folktexts.dataset.Dataset(data, task, test_size=0.1, val_size=0.1, subsampling=None, seed=42)[source]

Bases: object

Construct a Dataset object.

Parameters:

data (pd.DataFrame) – The dataset’s data in pandas DataFrame format.
task (TaskMetadata) – The metadata for the prediction task.
test_size (float, optional) – The size of the test set, as a fraction of the total dataset size, by default 0.1.
val_size (float, optional) – The size of the validation set, as a fraction of the total dataset size, by default 0.1.
subsampling (float, optional) – Whether to use sub-sampling, and which fraction of the data to keep. By default will not use sub-sampling (subsampling=None).
seed (int, optional) – The random state seed, by default 42.

property data: DataFrame

filter(population_feature_values)[source]: Filter dataset rows in-place.

get_data_split(split)[source]

Return type:: tuple[DataFrame, Series]

get_features_data()[source]

Return type:: DataFrame

get_sensitive_attribute_data()[source]

Return type:: Series

get_target_data()[source]

Return type:: Series

get_test()[source]

get_train()[source]

get_val()[source]

property name: str: A unique name for this dataset.

sample_n_train_examples(n, reuse_examples=False, class_balancing=False)[source]

Return a set of samples from the training set.

Parameters:

n (int) – The number of example rows to return.
reuse_examples (bool, optional) – Whether to reuse the same examples for consistency. By default will sample new examples each time (reuse_examples=False).

Returns:

X, y – The features and target data for the sampled examples.

Return type:

tuple[pd.DataFrame, pd.Series]

property seed: int

subsample(subsampling)[source]: Subsamples this dataset in-place.

property subsampling: float

property task: TaskMetadata

property test_size: float

property train_size: float

property val_size: float

folktexts.evaluation module

Module to map risk-estimates to a variety of evaluation metrics.

Notes

Code based on the error_parity.evaluation module, at: https://github.com/socialfoundations/error-parity/blob/main/error_parity/evaluation.py

folktexts.evaluation.bootstrap_estimate(eval_func, *, y_true, y_pred_scores, sensitive_attribute=None, k=200, confidence_pct=95, seed=42)[source]

Computes bootstrap estimates of the given evaluation function.

Parameters:

eval_func (Callable[[np.ndarray, np.ndarray, np.ndarray], dict[str, float]]) – The evaluation function to run for each bootstrap sample. Must follow the signature eval_func(y_true, y_pred_scores, sensitive_attribute).
y_true (np.ndarray) – The true labels.
y_pred_scores (np.ndarray) – The predicted scores.
sensitive_attribute (np.ndarray, optional) – Optionally, provide the sensitive attribute data to compute fairness metrics, by default None.
k (int, optional) – How many bootstrap samples to draw, by default 200.
confidence_pct (float, optional) – The confidence interval to use, in percentage, by default 95.
seed (int, optional) – The random seed, by default 42.

Returns:

results – A dictionary containing bootstrap estimates for a variety of metrics.

Return type:

dict[str, float]

folktexts.evaluation.compute_best_threshold(y_true, y_pred_scores, *, false_pos_cost=1.0, false_neg_cost=1.0)[source]

Computes the binarization threshold that maximizes accuracy.

Parameters:

y_true (np.ndarray) – The true class labels.
y_pred_scores (np.ndarray) – The predicted risk scores.
false_pos_cost (float, optional) – The cost of a false positive error, by default 1.0
false_neg_cost (float, optional) – The cost of a false negative error, by default 1.0

Returns:

best_threshold – The threshold value that maximizes accuracy for the given predictions.

Return type:

float

folktexts.evaluation.evaluate_binary_predictions(y_true, y_pred)[source]

Evaluates the provided binary predictions on common performance metrics.

Parameters:

y_true (np.ndarray) – The true class labels.
y_pred (np.ndarray) – The binary predictions.

Returns:

A dictionary with key-value pairs of (metric name, metric value).

Return type:

dict

folktexts.evaluation.evaluate_binary_predictions_fairness(y_true, y_pred, sensitive_attribute, return_groupwise_metrics=False, min_group_size=0.04)[source]

Evaluates fairness of the given predictions.

Fairness metrics are computed as the ratios between group-wise performance metrics.

Parameters:

y_true (np.ndarray) – The true class labels.
y_pred (np.ndarray) – The discretized predictions.
sensitive_attribute (np.ndarray) – The sensitive attribute (protected group membership).
return_groupwise_metrics (bool, optional) – Whether to return group-wise performance metrics (bool: True) or only the ratios between these metrics (bool: False), by default False.
min_group_size (float, optional) – The minimum fraction of samples (as a fraction of the total number of samples) that a group must have to be considered for fairness evaluation, by default 0.04. This is meant to avoid evaluating metrics on very small groups which leads to noisy and inconsistent results.

Returns:

A dictionary with key-value pairs of (metric name, metric value).

Return type:

dict

folktexts.evaluation.evaluate_predictions(y_true, y_pred_scores, *, sensitive_attribute=None, threshold='best', model_name=None)[source]

Evaluates predictions on common performance and fairness metrics.

Parameters:

y_true (np.ndarray) – The true class labels.
y_pred_scores (np.ndarray) – The predicted scores.
sensitive_attribute (np.ndarray, optional) – The sensitive attribute data. Will compute fairness metrics if provided.
threshold (float | str, optional) – The threshold to use for binarizing the predictions, or “best” to infer which threshold maximizes accuracy.
model_name (str, optional) – The name of the model to be used on the plots, by default None.

Returns:

results – A dictionary with key-value pairs of (metric name, metric value).

Return type:

dict

folktexts.evaluation.evaluate_predictions_bootstrap(y_true, y_pred_scores, *, sensitive_attribute=None, threshold='best', k=200, confidence_pct=95, seed=42)[source]

Computes bootstrap estimates of classification metrics for the given predictions.

Parameters:

y_true (np.ndarray) – The true labels.
y_pred_scores (np.ndarray) – The score predictions.
sensitive_attribute (np.ndarray, optional) – The sensitive attribute data. Will compute fairness metrics if provided.
threshold (float | str, optional) – The threshold to use for binarizing the predictions, or “best” to infer which threshold maximizes accuracy, by default “best”.
k (int, optional) – How many bootstrap samples to draw, by default 200.
confidence_pct (float, optional) – How large of a confidence interval to use when reporting lower and upper bounds, by default 95 (i.e., 2.5 to 97.5 percentile of results).
seed (int, optional) – The random seed, by default 42.

Returns:

results – A dictionary containing bootstrap estimates for a variety of metrics.

Return type:

dict[str, float]

folktexts.llm_utils module

Common functions to use with transformer LLMs.

folktexts.llm_utils.add_pad_token(tokenizer)[source]

Add a pad token to the model and tokenizer if it doesn’t already exist.

Here we’re using the end-of-sentence token as the pad token. Both the model weights and tokenizer vocabulary are untouched.

Another possible way would be to add a new token [PAD] to the tokenizer and update the tokenizer vocabulary and model weight embeddings accordingly. The embedding for the new pad token would be the average of all other embeddings.

folktexts.llm_utils.get_model_folder_path(model_name, root_dir='/tmp')[source]

Returns the folder where the model is saved.

Return type:: str

folktexts.llm_utils.get_model_size_B(model_name, default=None)[source]

Get the model size from the model name, in Billions of parameters.

Return type:: int

folktexts.llm_utils.is_bf16_compatible()[source]

Checks if the current environment is bfloat16 compatible.

Return type:: bool

folktexts.llm_utils.load_model_tokenizer(model_name_or_path, **kwargs)[source]

Load a model and tokenizer from the given local path (or using the model name).

Parameters:

model_name_or_path (str | Path) – Model name or local path to the model folder.
kwargs (dict) – Additional keyword arguments to pass to the model from_pretrained call.

Returns:

The loaded model and tokenizer, respectively.

Return type:

tuple[AutoModelForCausalLM, AutoTokenizer]

folktexts.llm_utils.query_model_batch(text_inputs, model, tokenizer, context_size)[source]

Queries the model with a batch of text inputs.

Parameters:

text_inputs (list[str]) – The inputs to the model as a list of strings.
model (AutoModelForCausalLM) – The model to query.
tokenizer (AutoTokenizer) – The tokenizer used to encode the text inputs.
context_size (int) – The maximum context size to consider for each input (in tokens).

Returns:

last_token_probs – Model’s last token linear probabilities for each input as an np.array of shape (batch_size, vocab_size).

Return type:

np.array

folktexts.llm_utils.query_model_batch_multiple_passes(text_inputs, model, tokenizer, context_size, n_passes, digits_only=False)[source]

Queries an LM for multiple forward passes.

Greedy token search over multiple forward passes: Each forward pass takes the highest likelihood token from the previous pass.

NOTE: could use model.generate in the future!

Parameters:

text_inputs (list[str]) – The batch inputs to the model as a list of strings.
model (AutoModelForCausalLM) – The model to query.
tokenizer (AutoTokenizer) – The tokenizer used to encode the text inputs.
context_size (int) – The maximum context size to consider for each input (in tokens).
n_passes (int, optional) – The number of forward passes to run.
digits_only (bool, optional) – Whether to only sample for digit tokens.

Returns:

last_token_probs – Last token linear probabilities for each forward pass, for each text in the input batch. The output has shape (batch_size, n_passes, vocab_size).

Return type:

np.array

folktexts.plotting module

Module to plot evaluation results.

folktexts.plotting.render_evaluation_plots(y_true, y_pred_scores, *, eval_results={}, model_name=None, imgs_dir=None, show_plots=False)[source]

Renders evaluation plots for the given predictions.

Return type:: dict

folktexts.plotting.render_fairness_plots(y_true, y_pred_scores, *, sensitive_attribute, eval_results={}, model_name=None, group_value_map, group_size_threshold=0.04, imgs_dir=None, show_plots=False)[source]

Renders fairness plots for the given predictions.

Return type:: dict

folktexts.plotting.save_fig(fig, fig_name, imgs_dir, format='pdf')[source]

Helper to save a matplotlib figure to disk.

Return type:: str

folktexts.prompting module

Module to map risk-estimation questions to different prompting techniques.

e.g., - multiple-choice Q&A vs direct numeric Q&A; - zero-shot vs few-shot vs CoT;

folktexts.prompting.apply_chat_template(tokenizer, user_prompt, system_prompt=None, chat_prompt='If had to select one of the options, my answer would be', **kwargs)[source]

Return type:: str

folktexts.prompting.encode_row_prompt(row, task, question=None, custom_prompt_prefix=None, add_task_description=True)[source]

Encode a question regarding a given row.

Return type:: str

folktexts.prompting.encode_row_prompt_chat(row, task, tokenizer, question=None, **chat_template_kwargs)[source]

Return type:: str

folktexts.prompting.encode_row_prompt_few_shot(row, task, dataset, n_shots, question=None, reuse_examples=False, class_balancing=False, custom_prompt_prefix=None)[source]

Encode a question regarding a given row using few-shot prompting.

Parameters:

row (pd.Series) – The row that the question will be about.
task (TaskMetadata) – The task that the row belongs to.
n_shots (int, optional) – The number of example questions and answers to use before prompting about the given row, by default 3.
reuse_examples (bool, optional) – Whether to reuse the same examples for consistency. By default will resample new examples each time (reuse_examples=False).

Returns:

prompt – The encoded few-shot prompt.

Return type:

str

folktexts.qa_interface module

Interface for question-answering with LLMs.

Create different types of questions (direct numeric, multiple-choice).
Encode questions and decode model outputs.
Compute risk-estimate from model outputs.

class folktexts.qa_interface.Choice(text, data_value, numeric_value=None)[source]

Bases: object

Represents a choice in multiple-choice Q&A.

text

The text of the choice. E.g., “25-34 years old”.

Type:: str

data_value

The categorical value corresponding to this choice in the data.

Type:: object

numeric_value

A meaningful numeric value for the choice. E.g., if the choice is “25-34 years old”, the numeric value could be 30. The choice with the highest numeric value can be used as a proxy for the positive class. If not provided, will try to use the choice.value.

Type:: float, optional

data_value: object

get_numeric_value()[source]

Returns the numeric value of the choice.

Return type:: float

numeric_value: float = None

text: str

class folktexts.qa_interface.DirectNumericQA(column, text, num_forward_passes=2, answer_probability=True)[source]

Bases: QAInterface

Represents a direct numeric question.

Notes

For example, the prompt could be ” Q: What is 2 + 2? A: ” With the expected answer being “4”.

If looking for a direct numeric probability, the answer prompt will be framed as so: ” Q: What is the probability, between 0 and 1, of getting heads on a coin flip? A: 0.” So that we can extract a numeric answer with at most 2 forward passes. This is done automatically by passing the kwarg answer_probability=True.

Note that some models have multi-digit tokens in their vocabulary, so we need to correctly assess which tokens in the vocabulary correspond to valid numeric answers.

answer_probability: bool = True

get_answer_from_model_output(last_token_probs, tokenizer_vocab)[source]

Outputs a numeric answer inferred from the model’s output.

Parameters:

last_token_probs (np.ndarray) – The last token probabilities of the model for the question. The first dimension must correspond to the number for forward passes as specified by num_forward_passes.
tokenizer_vocab (dict[str, int],) – The tokenizer’s vocabulary.

Returns:

answer – The numeric answer to the question.

Return type:

float | int

Notes

Eventually we could run a search algorithm to find the most likely answer over multiple forward passes, but for now we’ll just take the argmax on each forward pass.

get_question_prompt()[source]

Returns a question and answer key.

Return type:: str

num_forward_passes: int = 2

class folktexts.qa_interface.MultipleChoiceQA(column, text, num_forward_passes=1, choices=<factory>, _answer_keys_source=<factory>)[source]

Bases: QAInterface

Represents a multiple-choice question and its answer keys.

property answer_keys: tuple[str, ...]

property choice_to_key: dict[Choice, str]

choices: tuple[Choice]

classmethod create_answer_keys_permutations(question)[source]

Yield questions with all permutations of answer keys.

Parameters:: question (Question) – The template question whose answer keys will be permuted.
Returns:: permutations – A generator of questions with all permutations of answer keys.
Return type:: Iterator[Question]

classmethod create_question_from_value_map(column, value_map, attribute, **kwargs)[source]

Constructs a question from a value map.

Return type:: MultipleChoiceQA

get_answer_from_model_output(last_token_probs, tokenizer_vocab)[source]

Decodes the model’s output into an answer for the given question.

Parameters:

last_token_probs (np.ndarray) – The model’s last token probabilities for the question. The first dimension corresponds to the number of forward passes as specified by self.num_forward_passes.
tokenizer_vocab (dict[str, int],) – The tokenizer’s vocabulary.

Returns:

answer – The answer to the question.

Return type:

float

get_answer_from_text(text)[source]

Return type:: Choice

get_answer_key_from_value(value)[source]

Returns the answer key corresponding to the given data value.

Return type:: str

get_question_prompt()[source]

Returns a question and answer key.

Return type:: str

get_value_to_text_map()[source]

Returns the map from choice data value to choice textual representation.

Return type:: dict[object, str]

property key_to_choice: dict[str, Choice]

num_forward_passes: int = 1

class folktexts.qa_interface.QAInterface(column, text, num_forward_passes)[source]

Bases: ABC

An interface for a question-answering system.

column: str

get_answer_from_model_output(last_token_probs, tokenizer_vocab)[source]

Decodes the model’s output into an answer for the given question.

Parameters:

last_token_probs (np.ndarray) – The model’s last token probabilities for the question. The first dimension corresponds to the number of forward passes as specified by self.num_forward_passes.
tokenizer (dict[str, int]) – The tokenizer’s vocabulary.

Returns:

answer – The answer to the question.

Return type:

float

get_question_prompt()[source]

Returns a question and answer key.

Return type:: str

num_forward_passes: int

text: str

folktexts.task module

Definition of a generic TaskMetadata class.

class folktexts.task.TaskMetadata(name, features, target, cols_to_text, sensitive_attribute=None, target_threshold=None, multiple_choice_qa=None, direct_numeric_qa=None, description=None, _use_numeric_qa=False)[source]

Bases: object

A base class to hold information on a prediction task.

check_task_columns_are_available(available_cols, raise_=True)[source]

Checks if all columns required by this task are available.

Parameters:

available_cols (list[str]) – The list of column names available in the dataset.
raise (bool, optional) – Whether to raise an error if some columns are missing, by default True.

Returns:

all_available – True if all required columns are present in the given list of available columns, False otherwise.

Return type:

bool

cols_to_text: dict[str, ColumnToText]: A mapping between column names and their textual descriptions.

create_task_with_feature_subset(feature_subset)[source]: Creates a new task with a subset of the original features.

description: str = None: A description of the task, including the population to which the task pertains to.

direct_numeric_qa: DirectNumericQA = None: The direct numeric question and answer interface for this task.

features: list[str]: The names of the features used in the task.

get_row_description(row)[source]

Encode a description of a given data row in textual form.

Return type:: str

get_target()[source]

Resolves the name of the target column depending on self.target_threshold.

Return type:: str

classmethod get_task(name, use_numeric_qa=False)[source]

Fetches a previously created task by its name.

Parameters:

name (str) – The name of the task to fetch.
use_numeric_qa (bool, optional) – Whether to set the retrieved task to use verbalized numeric Q&A instead of the default multiple-choice Q&A prompts. Default is False.

Returns:

task – The task object with the given name.

Return type:

TaskMetadata

Raises:

ValueError – Raised if the task with the given name has not been created yet.

multiple_choice_qa: MultipleChoiceQA = None: The multiple-choice question and answer interface for this task.

name: str: The name of the task.

property question: QAInterface: Getter for the Q&A interface for this task.

sensitive_attribute: str = None: The name of the column used as the sensitive attribute data (if provided).

sensitive_attribute_value_map()[source]

Returns a mapping between sensitive attribute values and their descriptions.

Return type:: Callable

set_question(question)[source]: Sets the Q&A interface for this task.

target: str: The name of the target column.

target_threshold: Threshold = None: The threshold used to binarize the target column (if provided).

property use_numeric_qa: bool: Getter for whether to use numeric Q&A instead of multiple-choice Q&A prompts.

folktexts.threshold module

Helper function for defining binarization thresholds.

class folktexts.threshold.Threshold(value, op)[source]

Bases: object

A class to represent a threshold value and its comparison operator.

value

The threshold value to compare against.

Type:: float | int

op

The comparison operator to use. One of ‘>’, ‘<’, ‘>=’, ‘<=’, ‘==’, ‘!=’.

Type:: str

apply_to_column_data(data)[source]

Applies the threshold operation to a pandas Series or scalar value.

Return type:: int | Series

apply_to_column_name(column_name)[source]

Standardizes naming of thresholded columns.

Return type:: str

op: str

valid_ops: ClassVar[dict] = {'!=': <built-in function ne>, '<': <built-in function lt>, '<=': <built-in function le>, '==': <built-in function eq>, '>': <built-in function gt>, '>=': <built-in function ge>}

value: float | int

folktexts package

Subpackages

Submodules

folktexts.benchmark module

folktexts.col_to_text module

folktexts.dataset module

folktexts.evaluation module

folktexts.llm_utils module

folktexts.plotting module

folktexts.prompting module

folktexts.qa_interface module

folktexts.task module

folktexts.threshold module

Module contents