folktexts.classifier package
Submodules
folktexts.classifier.base module
Module containing the base class for all LLM risk classifiers.
- class folktexts.classifier.base.LLMClassifier(model_name, task, custom_prompt_prefix=None, encode_row=None, threshold=0.5, correct_order_bias=True, seed=42, **inference_kwargs)[source]
Bases:
BaseEstimator
,ClassifierMixin
,ABC
An interface to produce risk scores and class predictions with an LLM.
Creates an LLMClassifier object.
- Parameters:
model_name (str) – The model name or ID.
task (TaskMetadata | str) – The task metadata object or name of an already created task.
custom_prompt_prefix (str, optional) – A custom prompt prefix to supply to the model before the encoded row data, by default None.
encode_row (Callable[[pd.Series], str], optional) – The function used to encode tabular rows into natural text. If not provided, will use the default encoding function for the task.
threshold (float, optional) – The classification threshold to use when outputting binary predictions, by default 0.5. Must be between 0 and 1. Will be re-calibrated if fit is called.
correct_order_bias (bool, optional) – Whether to correct ordering bias in multiple-choice Q&A questions, by default True.
seed (int, optional) – The random seed - used for reproducibility.
**inference_kwargs – Additional keyword arguments to be used at inference time. Options include context_size and batch_size.
- DEFAULT_INFERENCE_KWARGS = {'batch_size': 16, 'context_size': 600}
- compute_risk_estimates_for_dataframe(df)[source]
Compute risk estimates for a specific dataframe (internal helper function).
- Parameters:
df (pd.DataFrame) – The dataframe to compute risk estimates for.
- Returns:
risk_scores – The risk estimates for each row in the dataframe.
- Return type:
np.ndarray
- compute_risk_estimates_for_dataset(dataset)[source]
Computes risk estimates for each row in the dataset.
- Parameters:
dataset (Dataset) – The dataset to compute risk estimates for.
- Returns:
results – The risk estimates for each data type in the dataset (usually “train”, “val”, “test”).
- Return type:
dict[str, np.ndarray]
- property correct_order_bias: bool
- property custom_prompt_prefix: str
- property encode_row: Callable[[Series], str]
- fit(X, y, *, false_pos_cost=1.0, false_neg_cost=1.0, **kwargs)[source]
Uses the provided data sample to fit the prediction threshold.
- property inference_kwargs: dict
- property model_name: str
- predict(data, predictions_save_path=None, labels=None)[source]
Returns binary predictions for the given data.
- Return type:
ndarray
|dict
[str
,ndarray
]
- predict_proba(data, predictions_save_path=None, labels=None)[source]
Returns probability estimates for the given data.
- Parameters:
data (pd.DataFrame) – The DataFrame to compute risk estimates for.
predictions_save_path (str | Path, optional) – If provided, will save the computed risk scores to this path in disk. If the path exists, will attempt to load pre-computed predictions from this path.
labels (pd.Series | np.ndarray, optional) – The labels corresponding to the provided data. Not required to compute predictions. Will only be used to save alongside predictions to disk.
- Returns:
risk_scores – The risk scores for the given data.
- Return type:
np.ndarray
- property seed: int
- set_fit_request(*, false_neg_cost: bool | None | str = '$UNCHANGED$', false_pos_cost: bool | None | str = '$UNCHANGED$') LLMClassifier
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
false_neg_cost (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
false_neg_cost
parameter infit
.false_pos_cost (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
false_pos_cost
parameter infit
.
- Returns:
self – The updated object.
- Return type:
object
- set_predict_proba_request(*, data: bool | None | str = '$UNCHANGED$', labels: bool | None | str = '$UNCHANGED$', predictions_save_path: bool | None | str = '$UNCHANGED$') LLMClassifier
Request metadata passed to the
predict_proba
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict_proba
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topredict_proba
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
data
parameter inpredict_proba
.labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
labels
parameter inpredict_proba
.predictions_save_path (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
predictions_save_path
parameter inpredict_proba
.
- Returns:
self – The updated object.
- Return type:
object
- set_predict_request(*, data: bool | None | str = '$UNCHANGED$', labels: bool | None | str = '$UNCHANGED$', predictions_save_path: bool | None | str = '$UNCHANGED$') LLMClassifier
Request metadata passed to the
predict
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topredict
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
data
parameter inpredict
.labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
labels
parameter inpredict
.predictions_save_path (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
predictions_save_path
parameter inpredict
.
- Returns:
self – The updated object.
- Return type:
object
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') LLMClassifier
Request metadata passed to the
score
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toscore
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
sample_weight
parameter inscore
.- Returns:
self – The updated object.
- Return type:
object
- property task: TaskMetadata
- property threshold: float
folktexts.classifier.transformers_classifier module
Module for using huggingface transformers models as classifiers.
- class folktexts.classifier.transformers_classifier.TransformersLLMClassifier(model, tokenizer, task, custom_prompt_prefix=None, encode_row=None, threshold=0.5, correct_order_bias=True, seed=42, **inference_kwargs)[source]
Bases:
LLMClassifier
Use a huggingface transformers model to produce risk scores.
Creates an LLMClassifier based on a huggingface transformers model.
- Parameters:
model (AutoModelForCausalLM) – The torch language model to use for inference.
tokenizer (AutoTokenizer) – The tokenizer used to train the model.
task (TaskMetadata | str) – The task metadata object or name of an already created task.
custom_prompt_prefix (str, optional) – A custom prompt prefix to supply to the model before the encoded row data, by default None.
encode_row (Callable[[pd.Series], str], optional) – The function used to encode tabular rows into natural text. If not provided, will use the default encoding function for the task.
threshold (float, optional) – The classification threshold to use when outputting binary predictions, by default 0.5. Must be between 0 and 1. Will be re-calibrated if fit is called.
correct_order_bias (bool, optional) – Whether to correct ordering bias in multiple-choice Q&A questions, by default True.
seed (int, optional) – The random seed - used for reproducibility.
**inference_kwargs – Additional keyword arguments to be used at inference time. Options include context_size and batch_size.
- property model: AutoModelForCausalLM
- set_fit_request(*, false_neg_cost: bool | None | str = '$UNCHANGED$', false_pos_cost: bool | None | str = '$UNCHANGED$') TransformersLLMClassifier
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
false_neg_cost (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
false_neg_cost
parameter infit
.false_pos_cost (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
false_pos_cost
parameter infit
.
- Returns:
self – The updated object.
- Return type:
object
- set_predict_proba_request(*, data: bool | None | str = '$UNCHANGED$', labels: bool | None | str = '$UNCHANGED$', predictions_save_path: bool | None | str = '$UNCHANGED$') TransformersLLMClassifier
Request metadata passed to the
predict_proba
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict_proba
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topredict_proba
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
data
parameter inpredict_proba
.labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
labels
parameter inpredict_proba
.predictions_save_path (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
predictions_save_path
parameter inpredict_proba
.
- Returns:
self – The updated object.
- Return type:
object
- set_predict_request(*, data: bool | None | str = '$UNCHANGED$', labels: bool | None | str = '$UNCHANGED$', predictions_save_path: bool | None | str = '$UNCHANGED$') TransformersLLMClassifier
Request metadata passed to the
predict
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topredict
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
data
parameter inpredict
.labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
labels
parameter inpredict
.predictions_save_path (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
predictions_save_path
parameter inpredict
.
- Returns:
self – The updated object.
- Return type:
object
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') TransformersLLMClassifier
Request metadata passed to the
score
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toscore
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
sample_weight
parameter inscore
.- Returns:
self – The updated object.
- Return type:
object
- property tokenizer: AutoTokenizer
folktexts.classifier.web_api_classifier module
Module for using a language model through a web API for risk classification.
- class folktexts.classifier.web_api_classifier.WebAPILLMClassifier(model_name, task, custom_prompt_prefix=None, encode_row=None, threshold=0.5, correct_order_bias=True, max_api_rpm=5000, seed=42, **inference_kwargs)[source]
Bases:
LLMClassifier
Use an LLM through a web API to produce risk scores.
Creates an LLMClassifier object that uses a web API for inference.
- Parameters:
model_name (str) – The model ID to be resolved by litellm.
task (TaskMetadata | str) – The task metadata object or name of an already created task.
custom_prompt_prefix (str, optional) – A custom prompt prefix to supply to the model before the encoded row data, by default None.
encode_row (Callable[[pd.Series], str], optional) – The function used to encode tabular rows into natural text. If not provided, will use the default encoding function for the task.
threshold (float, optional) – The classification threshold to use when outputting binary predictions, by default 0.5. Must be between 0 and 1. Will be re-calibrated if fit is called.
correct_order_bias (bool, optional) – Whether to correct ordering bias in multiple-choice Q&A questions, by default True.
max_api_rpm (int, optional) – The maximum number of requests per minute allowed for the API.
seed (int, optional) – The random seed - used for reproducibility.
**inference_kwargs – Additional keyword arguments to be used at inference time. Options include context_size and batch_size.
- set_fit_request(*, false_neg_cost: bool | None | str = '$UNCHANGED$', false_pos_cost: bool | None | str = '$UNCHANGED$') WebAPILLMClassifier
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
false_neg_cost (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
false_neg_cost
parameter infit
.false_pos_cost (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
false_pos_cost
parameter infit
.
- Returns:
self – The updated object.
- Return type:
object
- set_predict_proba_request(*, data: bool | None | str = '$UNCHANGED$', labels: bool | None | str = '$UNCHANGED$', predictions_save_path: bool | None | str = '$UNCHANGED$') WebAPILLMClassifier
Request metadata passed to the
predict_proba
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict_proba
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topredict_proba
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
data
parameter inpredict_proba
.labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
labels
parameter inpredict_proba
.predictions_save_path (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
predictions_save_path
parameter inpredict_proba
.
- Returns:
self – The updated object.
- Return type:
object
- set_predict_request(*, data: bool | None | str = '$UNCHANGED$', labels: bool | None | str = '$UNCHANGED$', predictions_save_path: bool | None | str = '$UNCHANGED$') WebAPILLMClassifier
Request metadata passed to the
predict
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topredict
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
data
parameter inpredict
.labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
labels
parameter inpredict
.predictions_save_path (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
predictions_save_path
parameter inpredict
.
- Returns:
self – The updated object.
- Return type:
object
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') WebAPILLMClassifier
Request metadata passed to the
score
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toscore
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
sample_weight
parameter inscore
.- Returns:
self – The updated object.
- Return type:
object