Cyrus Cousins

Axiomatically Justified and Statistically Sound Fair Machine Learning

Project Overview

Overview

A symphony in three parts, this line of work explores welfare-centric group-fair machine learning, sampling, and statistical estimation problems, wherein the goal is to establish a cardinal objective function over per-group average utility or disutility (risk) values, and then optimize or estimate the value of this objective from data. In this work, I generally adopt the philosophy of “flexible and non-judgmental algorithmics,” wherein we seek to show what is possible for a broad class of objectives, with the understanding that fairness concepts vary interpersonally and across cultures, and it is not necessarily our place as algorithm designers to make normative claims as to how one should view fairness. However, I do see axiomatic reasoning as a path to better understand the space of reasonable fairness concepts, as should one accept certain precepts, they can at least limit consideration to particular classes of fairness concepts. In this work, the axiomatic class of interest is usually the weighted power-mean, defined for real-valued $p$ as \[ \operatorname{M}_{p}(\boldsymbol{u}; \boldsymbol{w}) \doteq \sqrt[p]{\sum_{i=1}^{g} \boldsymbol{w}_{i} {\boldsymbol{u}_{i}^{{p}}}} \enspace. \]

Welfare-centric learning methods contrast constraint-based fairness methods, which are generally intuitively motivated, rather than axiomatically justified, and frequently run afoul of statistical quandaries (overfitting to fairness). In particular, it is difficult to ensure that fairness constraints hold with high probability over the underlying distribution while also ensuring that the learned function approximately optimizes the objective subject to these constraints: intuitively put, even selecting between two functions from data is impossible if one exactly satisfies a fairness constraint, while the other more stringently satisfies the constraint while obtaining a worse objective value, since empirical estimates of the fairness constraint will not converge. Cardinal fairness methods resolve this dilemma by encoding the objective and the fairness concept in a single function, and then must only show that this function may be efficiently optimized and estimated from data.

The Core Papers

Part one [NeurIPS 2021] finds that the power-mean family for $p \geq 1$ is the only axiomatically justified class for aggregating loss values, hence termed malfare functions, and shows that methods from convex optimization are sufficient to optimize this class in polynomial time, Rademacher averages and tools from statistical learning theory are sufficient to show generalization bounds, not just on the per-group risk values, but also on the cardinal objective function itself, despite its nonlinear dependence on per-group samples. The second movement [FAccT 2022] extends this theory, giving sharper generalization guarantees under broader conditions, and exploring dynamic (adaptive) sampling strategies for more realistic sampling models that consider the intricacies of populations consisting of multiple groups (i.e., sampling random sets, or conditionally from various groups under some cost model), and also gives similar treatment to a generalized class of regret based objectives, wherein each group considers the difference between the optimal (dis)utility it could obtain in an unshared model, and the (dis)utility it obtains in a shared model, and we optimize the malfare of this quantity. Finally, the third part [AIStats 2023] extends this analysis to welfare functions $p \leq 1$ weighted power-means, and shows that, while not always Lipschitz continuous, this class is locally Hölder continuous, and as a result, we obtain finite sample complexity bounds for bounded learning problems where uniform convergence holds (as in the malfare case), where polynomial sample complexity is preserved so long as $|p|$ is bounded away from $0$ and no group’s weight is too small.

Extensions

Many practical student projects and collaborations have also developed from this line of work; of note are projects in fair learning with unknown group IDs, robust fair learning, and fair reinforcement learning, which have all produced workshop papers. I have also further explored these themes, with the goal of sharper per-group generalization bounds using more sophisticated techniques, in subsequent work. In ongoing work, I am studying more efficient convex optimization algorithms for power-mean optimization; SGD is difficult due to the nonlinear objective and biased gradient estimators, and first-order methods suffer from a general lack of smoothness, even under smooth losses, as the $p=\infty$ power-mean is the maximum over per-group losses. I am also working to apply these methods in practical large-scale machine learning settings, such as facial recognition, verification, and classification, as well as recommender systems, wherein sampling models have additional subtlety, as we generally see multiple datapoints associated with individuals, and must aggregate these data across all individuals in a group.


An Axiomatic Theory of Provably-Fair Welfare-Centric Machine Learning

Cyrus Cousins

Abstract

We address an inherent difficulty in welfare-theoretic fair machine learning by proposing an equivalently axiomatically-justified alternative and studying the resulting computational and statistical learning questions. Welfare metrics quantify overall wellbeing across a population of one or more groups, and welfare-based objectives and constraints have recently been proposed to incentivize fair machine learning methods to produce satisfactory solutions that consider the diverse needs of multiple groups. Unfortunately, many machine-learning problems are more naturally cast as loss minimization tasks, rather than utility maximization, which complicates direct application of welfare-centric methods to fair machine learning. In this work, we define a complementary measure, termed malfare, measuring overall societal harm (rather than wellbeing), with axiomatic justification via the standard axioms of cardinal welfare. We then cast fair machine learning as malfare minimization over the risk values (expected losses) of each group. Surprisingly, the axioms of cardinal welfare (malfare) dictate that this is not equivalent to simply defining utility as negative loss. Building upon these concepts, we define fair-PAC learning, where a fair-PAC learner is an algorithm that learns an ε-δ malfare-optimal model with bounded sample complexity, for any data distribution, and for any (axiomatically justified) malfare concept. Finally, we show broad conditions under which, with appropriate modifications, standard PAC-learners may be converted to fair-PAC learners. This places fair-PAC learning on firm theoretical ground, as it yields statistical and computational efficiency guarantees for many well-studied machine-learning models, and is also practically relevant, as it democratizes fair machine learning by providing concrete training algorithms and rigorous generalization guarantees for these models.


Keywords

Fair Machine Learning ♦ Cardinal Welfare Theory ♣ PAC-Learning ♥ Computational Learning Theory ♠ Statistical Learning Theory


Read the full paper on arXiv

Read the (deliciously concise) NeurIPS 2021 conference paper


NeurIPS 2021


Presentation Recording


Slide Deck


EC 2021 Poster Session



Slide Deck


Uncertainty and the Social Planner’s Problem: Why Sample Complexity Matters

Cyrus Cousins

Abstract

Welfare measures overall utility across a population, whereas malfare measures overall disutility, and the social planner’s problem can be cast either as maximizing the former or minimizing the latter. We show novel bounds on the expectations and tail probabilities of estimators of welfare, malfare, and regret of per-group (dis)utility values, where estimates are made from a finite sample drawn from each group. In particular, we consider estimating these quantities for individual functions (e.g., allocations or classifiers) with standard probabilistic bounds, and optimizing and bounding generalization error over hypothesis classes (i.e., we quantify overfitting) using Rademacher averages. We then study algorithmic fairness through the lens of sample complexity, finding that because marginalized or minority groups are often understudied, and fewer data are therefore available, the social planner is more likely to overfit to these groups, thus even models that seem fair in training can be systematically biased against such groups. We argue that this effect can be mitigated by ensuring sufficient sample sizes for each group, and our sample complexity analysis characterizes these sample sizes. Motivated by these conclusions, we present progressive sampling algorithms to efficiently optimize various fairness objectives.


Keywords

Fair Machine Learning ♦ Cardinal Welfare Theory ♣ Minimax Fair Learning ♥ Multi-Group Agnostic PAC Learning ♠ Statistical Learning Theory


Read the extended paper

Read the FAccT 2022 conference paper


FAccT 2022 Poster


Presentation Video


Revisiting Fair-PAC Learning and the Axioms of Cardinal Welfare

Cyrus Cousins

Abstract

Cardinal objectives like welfare and malfare have recently enjoyed increased attention in fair machine learning as computationally, statistically, and philosophically sound alternatives to constraint-based methods. Welfare summarizes utility over a population of $g$ groups, whereas malfare measures overall disutility. Subject to standard axioms, both welfare and malfare functions are restricted to the family of $\boldsymbol{w}$-weighted $p$-power-means, i.e., $\operatorname{M}_{p}(\boldsymbol{u}; \boldsymbol{w}) \doteq \sqrt[p]{\sum_{i=1}^{g} \boldsymbol{w}_{i} \smash{\boldsymbol{u}_{i}^{{p}}}}$, with $p \leq 1$ for welfare (utility $\boldsymbol{u}$), or $p \geq 1$ for malfare (disutility $\boldsymbol{u}$). This work revisits said axioms, finding a weaker basis is sufficient to show the same, and furthermore that strengthening these axioms partition the welfare half of the spectrum (i.e., $p \leq 1$) into a few cases by further limiting $p$. It is known that $p \geq 1$ power means (i.e., malfare functions) are Lipschitz continuous, and thus statistically easy to estimate or learn (i.e., each $\boldsymbol{u}_{i}$ can be approximated with a sample estimate. We show that all power means are at least locally Hölder continuous, i.e., $\abs{ \operatorname{M}(\boldsymbol{u}; \boldsymbol{w}) - \operatorname{M}(\boldsymbol{u}'; \boldsymbol{w}) } \leq \lambda \norm{\boldsymbol{u} - \boldsymbol{u}'}^{\alpha}$ for some constants $\lambda > 0$, $\alpha \in (0, 1]$, and some norm $\norm{\cdot}$. Furthermore, $\lambda$ and $\smash{\frac{1}{\alpha}}$ are bounded except as $p \to 0$ or $\min_{i} \boldsymbol{w}_{i} \to 0$, and via this analysis we bound the sample complexity of estimating or optimizing welfare functions. This yields a novel concept of fair-PAC learning, with dependence on the quantities $\smash{\frac{1}{\abs{p}}}$ and/or $\smash{\frac{1}{\boldsymbol{w}_{\min}}}$ (which measure closeness to the challenging $p=0$ case and inverse minimum group weight, respectively), wherein fair welfare functions are only polynomially harder to optimize than fair malfare functions, except when $p \approx 0$ or $\min_{i} \boldsymbol{w}_{i} \approx 0$, which may be exponentially harder. These challenging cases may be avoided by assuming the appropriate strengthened axioms. In all cases, we show that if a bounded quantity is learnable with finite sample complexity, then so too is the welfare of said quantity. This takes estimating and learning welfare objectives to near-parity with malfare objectives, as all such objectives are uniformly learnable.


Keywords

Fair Machine Learning ♦ Algorithmic Fairness ♣ Fair PAC Learning ♥ Statistical Learning Theory ♠ Social Planner’s Problem ♦ Cardinal Welfare Theory ♣ Weighted Power Means ♥ Hölder Continuity


Read the extended paper

Read the AIStats 2023 conference paper


AIStats 2023 Poster


Presentation Video

Slide Deck