Evaluation and metrics

This page covers performance and fairness evaluation helpers and their bootstrap variants.

Performance metrics

error_parity.evaluation.evaluate_performance(y_true, y_pred)[source]

Evaluates the provided predictions on common performance metrics.

Parameters:

y_true (np.ndarray) – The true class labels.
y_pred (np.ndarray) – The discretized predictions.

Returns:

A dictionary with key-value pairs of (metric name, metric value).

Return type:

dict

Fairness metrics

error_parity.evaluation.evaluate_fairness(y_true, y_pred, sensitive_attribute, return_groupwise_metrics=False)[source]

Evaluates fairness as the ratios between group-wise performance metrics.

Parameters:

y_true (np.ndarray) – The true class labels.
y_pred (np.ndarray) – The discretized predictions.
sensitive_attribute (np.ndarray) – The sensitive attribute (protected group membership).
return_groupwise_metrics (Optional[bool], optional) – Whether to return group-wise performance metrics (bool: True) or only the ratios between these metrics (bool: False), by default False.

Returns:

A dictionary with key-value pairs of (metric name, metric value).

Return type:

dict

End-to-end evaluation

error_parity.evaluation.evaluate_predictions(y_true, y_pred_scores, sensitive_attribute=None, return_groupwise_metrics=False, **threshold_target)[source]

Evaluates the given predictions on both performance and fairness metrics.

Will only evaluate fairness if sensitive_attribute is provided.

Note

The value of log_loss may be inaccurate when using scikit-learn<1.2.

Parameters:

y_true (np.ndarray) – The true labels.
y_pred_scores (np.ndarray) – The predicted scores.
sensitive_attribute (np.ndarray, optional) – The sensitive attribute - which protected group each sample belongs to. If not provided, will not compute fairness metrics.
return_groupwise_metrics (bool) – Whether to return groupwise performance metrics (requires providing sensitive_attribute).

Returns:

A dictionary of (key, value) -> (metric_name, metric_value).

Return type:

dict

Bootstrap estimates

error_parity.evaluation.evaluate_predictions_bootstrap(y_true, y_pred_scores, sensitive_attribute, k=200, confidence_pct=95, seed=42, **threshold_target)[source]

Computes bootstrap estimates of several metrics for the given predictions.

Parameters:

y_true (np.ndarray) – The true labels.
y_pred_scores (np.ndarray) – The score predictions.
sensitive_attribute (np.ndarray) – The sensitive attribute data.
k (int, optional) – How many bootstrap samples to draw, by default 200.
confidence_pct (float, optional) – How large of a confidence interval to use when reporting lower and upper bounds, by default 95 (i.e., 2.5 to 97.5 percentile of results).
seed (int, optional) – The random seed, by default 42.

Returns:

A dictionary of results

Return type:

dict