folktexts.classifier package

Submodules

folktexts.classifier.base module

Module containing the base class for all LLM risk classifiers.

class folktexts.classifier.base.LLMClassifier(model_name, task, custom_prompt_prefix=None, encode_row=None, threshold=0.5, correct_order_bias=True, seed=42, **inference_kwargs)[source]

Bases: BaseEstimator, ClassifierMixin, ABC

An interface to produce risk scores and class predictions with an LLM.

Creates an LLMClassifier object.

Parameters:
  • model_name (str) – The model name or ID.

  • task (TaskMetadata | str) – The task metadata object or name of an already created task.

  • custom_prompt_prefix (str, optional) – A custom prompt prefix to supply to the model before the encoded row data, by default None.

  • encode_row (Callable[[pd.Series], str], optional) – The function used to encode tabular rows into natural text. If not provided, will use the default encoding function for the task.

  • threshold (float, optional) – The classification threshold to use when outputting binary predictions, by default 0.5. Must be between 0 and 1. Will be re-calibrated if fit is called.

  • correct_order_bias (bool, optional) – Whether to correct ordering bias in multiple-choice Q&A questions, by default True.

  • seed (int, optional) – The random seed - used for reproducibility.

  • **inference_kwargs – Additional keyword arguments to be used at inference time. Options include context_size and batch_size.

DEFAULT_INFERENCE_KWARGS = {'batch_size': 16, 'context_size': 600}
compute_risk_estimates_for_dataframe(df)[source]

Compute risk estimates for a specific dataframe (internal helper function).

Parameters:

df (pd.DataFrame) – The dataframe to compute risk estimates for.

Returns:

risk_scores – The risk estimates for each row in the dataframe.

Return type:

np.ndarray

compute_risk_estimates_for_dataset(dataset)[source]

Computes risk estimates for each row in the dataset.

Parameters:

dataset (Dataset) – The dataset to compute risk estimates for.

Returns:

results – The risk estimates for each data type in the dataset (usually “train”, “val”, “test”).

Return type:

dict[str, np.ndarray]

property correct_order_bias: bool
property custom_prompt_prefix: str
property encode_row: Callable[[Series], str]
fit(X, y, *, false_pos_cost=1.0, false_neg_cost=1.0, **kwargs)[source]

Uses the provided data sample to fit the prediction threshold.

property inference_kwargs: dict
property model_name: str
predict(data, predictions_save_path=None, labels=None)[source]

Returns binary predictions for the given data.

Return type:

ndarray | dict[str, ndarray]

predict_proba(data, predictions_save_path=None, labels=None)[source]

Returns probability estimates for the given data.

Parameters:
  • data (pd.DataFrame) – The DataFrame to compute risk estimates for.

  • predictions_save_path (str | Path, optional) – If provided, will save the computed risk scores to this path in disk. If the path exists, will attempt to load pre-computed predictions from this path.

  • labels (pd.Series | np.ndarray, optional) – The labels corresponding to the provided data. Not required to compute predictions. Will only be used to save alongside predictions to disk.

Returns:

risk_scores – The risk scores for the given data.

Return type:

np.ndarray

property seed: int
set_fit_request(*, false_neg_cost: bool | None | str = '$UNCHANGED$', false_pos_cost: bool | None | str = '$UNCHANGED$') LLMClassifier

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • false_neg_cost (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for false_neg_cost parameter in fit.

  • false_pos_cost (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for false_pos_cost parameter in fit.

Returns:

self – The updated object.

Return type:

object

set_inference_kwargs(**kwargs)[source]

Set inference kwargs for the model.

set_predict_proba_request(*, data: bool | None | str = '$UNCHANGED$', labels: bool | None | str = '$UNCHANGED$', predictions_save_path: bool | None | str = '$UNCHANGED$') LLMClassifier

Request metadata passed to the predict_proba method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict_proba if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict_proba.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for data parameter in predict_proba.

  • labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for labels parameter in predict_proba.

  • predictions_save_path (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for predictions_save_path parameter in predict_proba.

Returns:

self – The updated object.

Return type:

object

set_predict_request(*, data: bool | None | str = '$UNCHANGED$', labels: bool | None | str = '$UNCHANGED$', predictions_save_path: bool | None | str = '$UNCHANGED$') LLMClassifier

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for data parameter in predict.

  • labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for labels parameter in predict.

  • predictions_save_path (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for predictions_save_path parameter in predict.

Returns:

self – The updated object.

Return type:

object

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') LLMClassifier

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns:

self – The updated object.

Return type:

object

property task: TaskMetadata
property threshold: float

folktexts.classifier.transformers_classifier module

Module for using huggingface transformers models as classifiers.

class folktexts.classifier.transformers_classifier.TransformersLLMClassifier(model, tokenizer, task, custom_prompt_prefix=None, encode_row=None, threshold=0.5, correct_order_bias=True, seed=42, **inference_kwargs)[source]

Bases: LLMClassifier

Use a huggingface transformers model to produce risk scores.

Creates an LLMClassifier based on a huggingface transformers model.

Parameters:
  • model (AutoModelForCausalLM) – The torch language model to use for inference.

  • tokenizer (AutoTokenizer) – The tokenizer used to train the model.

  • task (TaskMetadata | str) – The task metadata object or name of an already created task.

  • custom_prompt_prefix (str, optional) – A custom prompt prefix to supply to the model before the encoded row data, by default None.

  • encode_row (Callable[[pd.Series], str], optional) – The function used to encode tabular rows into natural text. If not provided, will use the default encoding function for the task.

  • threshold (float, optional) – The classification threshold to use when outputting binary predictions, by default 0.5. Must be between 0 and 1. Will be re-calibrated if fit is called.

  • correct_order_bias (bool, optional) – Whether to correct ordering bias in multiple-choice Q&A questions, by default True.

  • seed (int, optional) – The random seed - used for reproducibility.

  • **inference_kwargs – Additional keyword arguments to be used at inference time. Options include context_size and batch_size.

property model: AutoModelForCausalLM
set_fit_request(*, false_neg_cost: bool | None | str = '$UNCHANGED$', false_pos_cost: bool | None | str = '$UNCHANGED$') TransformersLLMClassifier

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • false_neg_cost (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for false_neg_cost parameter in fit.

  • false_pos_cost (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for false_pos_cost parameter in fit.

Returns:

self – The updated object.

Return type:

object

set_predict_proba_request(*, data: bool | None | str = '$UNCHANGED$', labels: bool | None | str = '$UNCHANGED$', predictions_save_path: bool | None | str = '$UNCHANGED$') TransformersLLMClassifier

Request metadata passed to the predict_proba method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict_proba if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict_proba.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for data parameter in predict_proba.

  • labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for labels parameter in predict_proba.

  • predictions_save_path (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for predictions_save_path parameter in predict_proba.

Returns:

self – The updated object.

Return type:

object

set_predict_request(*, data: bool | None | str = '$UNCHANGED$', labels: bool | None | str = '$UNCHANGED$', predictions_save_path: bool | None | str = '$UNCHANGED$') TransformersLLMClassifier

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for data parameter in predict.

  • labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for labels parameter in predict.

  • predictions_save_path (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for predictions_save_path parameter in predict.

Returns:

self – The updated object.

Return type:

object

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') TransformersLLMClassifier

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns:

self – The updated object.

Return type:

object

property tokenizer: AutoTokenizer

folktexts.classifier.web_api_classifier module

Module for using a language model through a web API for risk classification.

class folktexts.classifier.web_api_classifier.WebAPILLMClassifier(model_name, task, custom_prompt_prefix=None, encode_row=None, threshold=0.5, correct_order_bias=True, max_api_rpm=5000, seed=42, **inference_kwargs)[source]

Bases: LLMClassifier

Use an LLM through a web API to produce risk scores.

Creates an LLMClassifier object that uses a web API for inference.

Parameters:
  • model_name (str) – The model ID to be resolved by litellm.

  • task (TaskMetadata | str) – The task metadata object or name of an already created task.

  • custom_prompt_prefix (str, optional) – A custom prompt prefix to supply to the model before the encoded row data, by default None.

  • encode_row (Callable[[pd.Series], str], optional) – The function used to encode tabular rows into natural text. If not provided, will use the default encoding function for the task.

  • threshold (float, optional) – The classification threshold to use when outputting binary predictions, by default 0.5. Must be between 0 and 1. Will be re-calibrated if fit is called.

  • correct_order_bias (bool, optional) – Whether to correct ordering bias in multiple-choice Q&A questions, by default True.

  • max_api_rpm (int, optional) – The maximum number of requests per minute allowed for the API.

  • seed (int, optional) – The random seed - used for reproducibility.

  • **inference_kwargs – Additional keyword arguments to be used at inference time. Options include context_size and batch_size.

static check_webAPI_deps()[source]

Check if litellm dependencies are available.

Return type:

bool

set_fit_request(*, false_neg_cost: bool | None | str = '$UNCHANGED$', false_pos_cost: bool | None | str = '$UNCHANGED$') WebAPILLMClassifier

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • false_neg_cost (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for false_neg_cost parameter in fit.

  • false_pos_cost (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for false_pos_cost parameter in fit.

Returns:

self – The updated object.

Return type:

object

set_predict_proba_request(*, data: bool | None | str = '$UNCHANGED$', labels: bool | None | str = '$UNCHANGED$', predictions_save_path: bool | None | str = '$UNCHANGED$') WebAPILLMClassifier

Request metadata passed to the predict_proba method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict_proba if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict_proba.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for data parameter in predict_proba.

  • labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for labels parameter in predict_proba.

  • predictions_save_path (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for predictions_save_path parameter in predict_proba.

Returns:

self – The updated object.

Return type:

object

set_predict_request(*, data: bool | None | str = '$UNCHANGED$', labels: bool | None | str = '$UNCHANGED$', predictions_save_path: bool | None | str = '$UNCHANGED$') WebAPILLMClassifier

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for data parameter in predict.

  • labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for labels parameter in predict.

  • predictions_save_path (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for predictions_save_path parameter in predict.

Returns:

self – The updated object.

Return type:

object

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') WebAPILLMClassifier

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns:

self – The updated object.

Return type:

object

track_cost_callback(kwargs, completion_response, start_time, end_time)[source]

Callback function to cost of API calls.

Module contents