folktexts.classifier package
Submodules
folktexts.classifier.base module
Module containing the base class for all LLM risk classifiers.
- class folktexts.classifier.base.LLMClassifier(model_name, task, custom_prompt_prefix=None, encode_row=None, threshold=0.5, correct_order_bias=True, seed=42, **inference_kwargs)[source]
Bases:
BaseEstimator,ClassifierMixin,ABCAn interface to produce risk scores and class predictions with an LLM.
Creates an LLMClassifier object.
- Parameters:
model_name (str) – The model name or ID.
task (TaskMetadata | str) – The task metadata object or name of an already created task.
custom_prompt_prefix (str, optional) – A custom prompt prefix to supply to the model before the encoded row data, by default None.
encode_row (Callable[[pd.Series], str], optional) – The function used to encode tabular rows into natural text. If not provided, will use the default encoding function for the task.
threshold (float, optional) – The classification threshold to use when outputting binary predictions, by default 0.5. Must be between 0 and 1. Will be re-calibrated if fit is called.
correct_order_bias (bool, optional) – Whether to correct ordering bias in multiple-choice Q&A questions, by default True.
seed (int, optional) – The random seed - used for reproducibility.
**inference_kwargs – Additional keyword arguments to be used at inference time. Options include context_size and batch_size.
- DEFAULT_INFERENCE_KWARGS = {'batch_size': 16, 'context_size': 600}
- compute_risk_estimates_for_dataframe(df)[source]
Compute risk estimates for a specific dataframe (internal helper function).
- Parameters:
df (pd.DataFrame) – The dataframe to compute risk estimates for.
- Returns:
risk_scores – The risk estimates for each row in the dataframe.
- Return type:
np.ndarray
- compute_risk_estimates_for_dataset(dataset)[source]
Computes risk estimates for each row in the dataset.
- Parameters:
dataset (Dataset) – The dataset to compute risk estimates for.
- Returns:
results – The risk estimates for each data type in the dataset (usually “train”, “val”, “test”).
- Return type:
dict[str, np.ndarray]
- property correct_order_bias: bool
- property custom_prompt_prefix: str
- property encode_row: Callable[[Series], str]
- fit(X, y, *, false_pos_cost=1.0, false_neg_cost=1.0, **kwargs)[source]
Uses the provided data sample to fit the prediction threshold.
- property inference_kwargs: dict
- property model_name: str
- predict(data, predictions_save_path=None, labels=None)[source]
Returns binary predictions for the given data.
- Return type:
ndarray|dict[str,ndarray]
- predict_proba(data, predictions_save_path=None, labels=None)[source]
Returns probability estimates for the given data.
- Parameters:
data (pd.DataFrame) – The DataFrame to compute risk estimates for.
predictions_save_path (str | Path, optional) – If provided, will save the computed risk scores to this path in disk. If the path exists, will attempt to load pre-computed predictions from this path.
labels (pd.Series | np.ndarray, optional) – The labels corresponding to the provided data. Not required to compute predictions. Will only be used to save alongside predictions to disk.
- Returns:
risk_scores – The risk scores for the given data.
- Return type:
np.ndarray
- property seed: int
- set_fit_request(*, false_neg_cost: bool | None | str = '$UNCHANGED$', false_pos_cost: bool | None | str = '$UNCHANGED$') LLMClassifier
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
false_neg_cost (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
false_neg_costparameter infit.false_pos_cost (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
false_pos_costparameter infit.
- Returns:
self – The updated object.
- Return type:
object
- set_predict_proba_request(*, data: bool | None | str = '$UNCHANGED$', labels: bool | None | str = '$UNCHANGED$', predictions_save_path: bool | None | str = '$UNCHANGED$') LLMClassifier
Configure whether metadata should be requested to be passed to the
predict_probamethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredict_probaif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict_proba.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
dataparameter inpredict_proba.labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
labelsparameter inpredict_proba.predictions_save_path (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
predictions_save_pathparameter inpredict_proba.
- Returns:
self – The updated object.
- Return type:
object
- set_predict_request(*, data: bool | None | str = '$UNCHANGED$', labels: bool | None | str = '$UNCHANGED$', predictions_save_path: bool | None | str = '$UNCHANGED$') LLMClassifier
Configure whether metadata should be requested to be passed to the
predictmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
dataparameter inpredict.labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
labelsparameter inpredict.predictions_save_path (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
predictions_save_pathparameter inpredict.
- Returns:
self – The updated object.
- Return type:
object
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') LLMClassifier
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
sample_weightparameter inscore.- Returns:
self – The updated object.
- Return type:
object
- property task: TaskMetadata
- property threshold: float
folktexts.classifier.transformers_classifier module
Module for using huggingface transformers models as classifiers.
- class folktexts.classifier.transformers_classifier.TransformersLLMClassifier(model, tokenizer, task, custom_prompt_prefix=None, encode_row=None, threshold=0.5, correct_order_bias=True, seed=42, **inference_kwargs)[source]
Bases:
LLMClassifierUse a huggingface transformers model to produce risk scores.
Creates an LLMClassifier based on a huggingface transformers model.
- Parameters:
model (AutoModelForCausalLM) – The torch language model to use for inference.
tokenizer (AutoTokenizer) – The tokenizer used to train the model.
task (TaskMetadata | str) – The task metadata object or name of an already created task.
custom_prompt_prefix (str, optional) – A custom prompt prefix to supply to the model before the encoded row data, by default None.
encode_row (Callable[[pd.Series], str], optional) – The function used to encode tabular rows into natural text. If not provided, will use the default encoding function for the task.
threshold (float, optional) – The classification threshold to use when outputting binary predictions, by default 0.5. Must be between 0 and 1. Will be re-calibrated if fit is called.
correct_order_bias (bool, optional) – Whether to correct ordering bias in multiple-choice Q&A questions, by default True.
seed (int, optional) – The random seed - used for reproducibility.
**inference_kwargs – Additional keyword arguments to be used at inference time. Options include context_size and batch_size.
- property model: AutoModelForCausalLM
- set_fit_request(*, false_neg_cost: bool | None | str = '$UNCHANGED$', false_pos_cost: bool | None | str = '$UNCHANGED$') TransformersLLMClassifier
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
false_neg_cost (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
false_neg_costparameter infit.false_pos_cost (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
false_pos_costparameter infit.
- Returns:
self – The updated object.
- Return type:
object
- set_predict_proba_request(*, data: bool | None | str = '$UNCHANGED$', labels: bool | None | str = '$UNCHANGED$', predictions_save_path: bool | None | str = '$UNCHANGED$') TransformersLLMClassifier
Configure whether metadata should be requested to be passed to the
predict_probamethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredict_probaif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict_proba.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
dataparameter inpredict_proba.labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
labelsparameter inpredict_proba.predictions_save_path (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
predictions_save_pathparameter inpredict_proba.
- Returns:
self – The updated object.
- Return type:
object
- set_predict_request(*, data: bool | None | str = '$UNCHANGED$', labels: bool | None | str = '$UNCHANGED$', predictions_save_path: bool | None | str = '$UNCHANGED$') TransformersLLMClassifier
Configure whether metadata should be requested to be passed to the
predictmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
dataparameter inpredict.labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
labelsparameter inpredict.predictions_save_path (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
predictions_save_pathparameter inpredict.
- Returns:
self – The updated object.
- Return type:
object
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') TransformersLLMClassifier
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
sample_weightparameter inscore.- Returns:
self – The updated object.
- Return type:
object
- property tokenizer: AutoTokenizer
folktexts.classifier.web_api_classifier module
Module for using a language model through a web API for risk classification.
- class folktexts.classifier.web_api_classifier.WebAPILLMClassifier(model_name, task, custom_prompt_prefix=None, encode_row=None, threshold=0.5, correct_order_bias=True, max_api_rpm=5000, seed=42, **inference_kwargs)[source]
Bases:
LLMClassifierUse an LLM through a web API to produce risk scores.
Creates an LLMClassifier object that uses a web API for inference.
- Parameters:
model_name (str) – The model ID to be resolved by litellm.
task (TaskMetadata | str) – The task metadata object or name of an already created task.
custom_prompt_prefix (str, optional) – A custom prompt prefix to supply to the model before the encoded row data, by default None.
encode_row (Callable[[pd.Series], str], optional) – The function used to encode tabular rows into natural text. If not provided, will use the default encoding function for the task.
threshold (float, optional) – The classification threshold to use when outputting binary predictions, by default 0.5. Must be between 0 and 1. Will be re-calibrated if fit is called.
correct_order_bias (bool, optional) – Whether to correct ordering bias in multiple-choice Q&A questions, by default True.
max_api_rpm (int, optional) – The maximum number of requests per minute allowed for the API.
seed (int, optional) – The random seed - used for reproducibility.
**inference_kwargs – Additional keyword arguments to be used at inference time. Options include context_size and batch_size.
- set_fit_request(*, false_neg_cost: bool | None | str = '$UNCHANGED$', false_pos_cost: bool | None | str = '$UNCHANGED$') WebAPILLMClassifier
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
false_neg_cost (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
false_neg_costparameter infit.false_pos_cost (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
false_pos_costparameter infit.
- Returns:
self – The updated object.
- Return type:
object
- set_predict_proba_request(*, data: bool | None | str = '$UNCHANGED$', labels: bool | None | str = '$UNCHANGED$', predictions_save_path: bool | None | str = '$UNCHANGED$') WebAPILLMClassifier
Configure whether metadata should be requested to be passed to the
predict_probamethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredict_probaif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict_proba.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
dataparameter inpredict_proba.labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
labelsparameter inpredict_proba.predictions_save_path (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
predictions_save_pathparameter inpredict_proba.
- Returns:
self – The updated object.
- Return type:
object
- set_predict_request(*, data: bool | None | str = '$UNCHANGED$', labels: bool | None | str = '$UNCHANGED$', predictions_save_path: bool | None | str = '$UNCHANGED$') WebAPILLMClassifier
Configure whether metadata should be requested to be passed to the
predictmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
dataparameter inpredict.labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
labelsparameter inpredict.predictions_save_path (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
predictions_save_pathparameter inpredict.
- Returns:
self – The updated object.
- Return type:
object
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') WebAPILLMClassifier
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
sample_weightparameter inscore.- Returns:
self – The updated object.
- Return type:
object