folktexts.classifier package

Submodules

folktexts.classifier.base module

Module containing the base class for all LLM risk classifiers.

class folktexts.classifier.base.LLMClassifier(model_name, task, custom_prompt_prefix=None, encode_row=None, threshold=0.5, correct_order_bias=True, seed=42, **inference_kwargs)[source]

Bases: BaseEstimator, ClassifierMixin, ABC

An interface to produce risk scores and class predictions with an LLM.

Creates an LLMClassifier object.

Parameters:
  • model_name (str) – The model name or ID.

  • task (TaskMetadata | str) – The task metadata object or name of an already created task.

  • custom_prompt_prefix (str, optional) – A custom prompt prefix to supply to the model before the encoded row data, by default None.

  • encode_row (Callable[[pd.Series], str], optional) – The function used to encode tabular rows into natural text. If not provided, will use the default encoding function for the task.

  • threshold (float, optional) – The classification threshold to use when outputting binary predictions, by default 0.5. Must be between 0 and 1. Will be re-calibrated if fit is called.

  • correct_order_bias (bool, optional) – Whether to correct ordering bias in multiple-choice Q&A questions, by default True.

  • seed (int, optional) – The random seed - used for reproducibility.

  • **inference_kwargs – Additional keyword arguments to be used at inference time. Options include context_size and batch_size.

DEFAULT_INFERENCE_KWARGS = {'batch_size': 16, 'context_size': 600}
compute_risk_estimates_for_dataframe(df)[source]

Compute risk estimates for a specific dataframe (internal helper function).

Parameters:

df (pd.DataFrame) – The dataframe to compute risk estimates for.

Returns:

risk_scores – The risk estimates for each row in the dataframe.

Return type:

np.ndarray

compute_risk_estimates_for_dataset(dataset)[source]

Computes risk estimates for each row in the dataset.

Parameters:

dataset (Dataset) – The dataset to compute risk estimates for.

Returns:

results – The risk estimates for each data type in the dataset (usually “train”, “val”, “test”).

Return type:

dict[str, np.ndarray]

property correct_order_bias: bool
property custom_prompt_prefix: str
property encode_row: Callable[[Series], str]
fit(X, y, *, false_pos_cost=1.0, false_neg_cost=1.0, **kwargs)[source]

Uses the provided data sample to fit the prediction threshold.

property inference_kwargs: dict
property model_name: str
predict(data, predictions_save_path=None, labels=None)[source]

Returns binary predictions for the given data.

Return type:

ndarray | dict[str, ndarray]

predict_proba(data, predictions_save_path=None, labels=None)[source]

Returns probability estimates for the given data.

Parameters:
  • data (pd.DataFrame) – The DataFrame to compute risk estimates for.

  • predictions_save_path (str | Path, optional) – If provided, will save the computed risk scores to this path in disk. If the path exists, will attempt to load pre-computed predictions from this path.

  • labels (pd.Series | np.ndarray, optional) – The labels corresponding to the provided data. Not required to compute predictions. Will only be used to save alongside predictions to disk.

Returns:

risk_scores – The risk scores for the given data.

Return type:

np.ndarray

property seed: int
set_fit_request(*, false_neg_cost: bool | None | str = '$UNCHANGED$', false_pos_cost: bool | None | str = '$UNCHANGED$') LLMClassifier

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • false_neg_cost (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for false_neg_cost parameter in fit.

  • false_pos_cost (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for false_pos_cost parameter in fit.

Returns:

self – The updated object.

Return type:

object

set_inference_kwargs(**kwargs)[source]

Set inference kwargs for the model.

set_predict_proba_request(*, data: bool | None | str = '$UNCHANGED$', labels: bool | None | str = '$UNCHANGED$', predictions_save_path: bool | None | str = '$UNCHANGED$') LLMClassifier

Configure whether metadata should be requested to be passed to the predict_proba method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict_proba if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict_proba.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for data parameter in predict_proba.

  • labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for labels parameter in predict_proba.

  • predictions_save_path (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for predictions_save_path parameter in predict_proba.

Returns:

self – The updated object.

Return type:

object

set_predict_request(*, data: bool | None | str = '$UNCHANGED$', labels: bool | None | str = '$UNCHANGED$', predictions_save_path: bool | None | str = '$UNCHANGED$') LLMClassifier

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for data parameter in predict.

  • labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for labels parameter in predict.

  • predictions_save_path (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for predictions_save_path parameter in predict.

Returns:

self – The updated object.

Return type:

object

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') LLMClassifier

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns:

self – The updated object.

Return type:

object

property task: TaskMetadata
property threshold: float

folktexts.classifier.transformers_classifier module

Module for using huggingface transformers models as classifiers.

class folktexts.classifier.transformers_classifier.TransformersLLMClassifier(model, tokenizer, task, custom_prompt_prefix=None, encode_row=None, threshold=0.5, correct_order_bias=True, seed=42, **inference_kwargs)[source]

Bases: LLMClassifier

Use a huggingface transformers model to produce risk scores.

Creates an LLMClassifier based on a huggingface transformers model.

Parameters:
  • model (AutoModelForCausalLM) – The torch language model to use for inference.

  • tokenizer (AutoTokenizer) – The tokenizer used to train the model.

  • task (TaskMetadata | str) – The task metadata object or name of an already created task.

  • custom_prompt_prefix (str, optional) – A custom prompt prefix to supply to the model before the encoded row data, by default None.

  • encode_row (Callable[[pd.Series], str], optional) – The function used to encode tabular rows into natural text. If not provided, will use the default encoding function for the task.

  • threshold (float, optional) – The classification threshold to use when outputting binary predictions, by default 0.5. Must be between 0 and 1. Will be re-calibrated if fit is called.

  • correct_order_bias (bool, optional) – Whether to correct ordering bias in multiple-choice Q&A questions, by default True.

  • seed (int, optional) – The random seed - used for reproducibility.

  • **inference_kwargs – Additional keyword arguments to be used at inference time. Options include context_size and batch_size.

property model: AutoModelForCausalLM
set_fit_request(*, false_neg_cost: bool | None | str = '$UNCHANGED$', false_pos_cost: bool | None | str = '$UNCHANGED$') TransformersLLMClassifier

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • false_neg_cost (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for false_neg_cost parameter in fit.

  • false_pos_cost (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for false_pos_cost parameter in fit.

Returns:

self – The updated object.

Return type:

object

set_predict_proba_request(*, data: bool | None | str = '$UNCHANGED$', labels: bool | None | str = '$UNCHANGED$', predictions_save_path: bool | None | str = '$UNCHANGED$') TransformersLLMClassifier

Configure whether metadata should be requested to be passed to the predict_proba method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict_proba if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict_proba.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for data parameter in predict_proba.

  • labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for labels parameter in predict_proba.

  • predictions_save_path (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for predictions_save_path parameter in predict_proba.

Returns:

self – The updated object.

Return type:

object

set_predict_request(*, data: bool | None | str = '$UNCHANGED$', labels: bool | None | str = '$UNCHANGED$', predictions_save_path: bool | None | str = '$UNCHANGED$') TransformersLLMClassifier

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for data parameter in predict.

  • labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for labels parameter in predict.

  • predictions_save_path (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for predictions_save_path parameter in predict.

Returns:

self – The updated object.

Return type:

object

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') TransformersLLMClassifier

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns:

self – The updated object.

Return type:

object

property tokenizer: AutoTokenizer

folktexts.classifier.web_api_classifier module

Module for using a language model through a web API for risk classification.

class folktexts.classifier.web_api_classifier.WebAPILLMClassifier(model_name, task, custom_prompt_prefix=None, encode_row=None, threshold=0.5, correct_order_bias=True, max_api_rpm=5000, seed=42, **inference_kwargs)[source]

Bases: LLMClassifier

Use an LLM through a web API to produce risk scores.

Creates an LLMClassifier object that uses a web API for inference.

Parameters:
  • model_name (str) – The model ID to be resolved by litellm.

  • task (TaskMetadata | str) – The task metadata object or name of an already created task.

  • custom_prompt_prefix (str, optional) – A custom prompt prefix to supply to the model before the encoded row data, by default None.

  • encode_row (Callable[[pd.Series], str], optional) – The function used to encode tabular rows into natural text. If not provided, will use the default encoding function for the task.

  • threshold (float, optional) – The classification threshold to use when outputting binary predictions, by default 0.5. Must be between 0 and 1. Will be re-calibrated if fit is called.

  • correct_order_bias (bool, optional) – Whether to correct ordering bias in multiple-choice Q&A questions, by default True.

  • max_api_rpm (int, optional) – The maximum number of requests per minute allowed for the API.

  • seed (int, optional) – The random seed - used for reproducibility.

  • **inference_kwargs – Additional keyword arguments to be used at inference time. Options include context_size and batch_size.

static check_webAPI_deps()[source]

Check if litellm dependencies are available.

Return type:

bool

set_fit_request(*, false_neg_cost: bool | None | str = '$UNCHANGED$', false_pos_cost: bool | None | str = '$UNCHANGED$') WebAPILLMClassifier

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • false_neg_cost (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for false_neg_cost parameter in fit.

  • false_pos_cost (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for false_pos_cost parameter in fit.

Returns:

self – The updated object.

Return type:

object

set_predict_proba_request(*, data: bool | None | str = '$UNCHANGED$', labels: bool | None | str = '$UNCHANGED$', predictions_save_path: bool | None | str = '$UNCHANGED$') WebAPILLMClassifier

Configure whether metadata should be requested to be passed to the predict_proba method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict_proba if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict_proba.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for data parameter in predict_proba.

  • labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for labels parameter in predict_proba.

  • predictions_save_path (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for predictions_save_path parameter in predict_proba.

Returns:

self – The updated object.

Return type:

object

set_predict_request(*, data: bool | None | str = '$UNCHANGED$', labels: bool | None | str = '$UNCHANGED$', predictions_save_path: bool | None | str = '$UNCHANGED$') WebAPILLMClassifier

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for data parameter in predict.

  • labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for labels parameter in predict.

  • predictions_save_path (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for predictions_save_path parameter in predict.

Returns:

self – The updated object.

Return type:

object

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') WebAPILLMClassifier

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns:

self – The updated object.

Return type:

object

track_cost_callback(kwargs, completion_response, start_time, end_time)[source]

Callback function to cost of API calls.

Module contents