An overview of machine learning fairness
This article introduces machine learning fairness, to give the reader a vision of the whole problem. To be fair, this study is tinted by my own biased understanding of the field. My goal here is to ask the right questions above all. It is a vast field and when it seems important, I give links to much more in-depth studies which describe in detail the causes, the effects and the measures of fairness.
In 2016, the risk assessment AI software named COMPAS was analyzed and a severe racial bias has been discovered. This tool was used to predict the risk of a criminal reoffending. Blackness ridiculously increased the predicted risk of the criminal.
In 2018, a Reuters article exposed a serious problem of gender bias in Amazon AI curriculum selection. In fact, the model attached great positive importance to the words used more often by men and a negative importance to the ones used more often by women. This is the result of data being unequally shared between men and women. The former greatly outnumber the latter at Amazon. The model reproduced these inequalities in its predictions. The general problem of unwanted biases is not only technological. It contains several topics that need to be discussed on a case-by-case basis. The list of unfair biases is local to the problem. For instance, in one case of hiring machine learning system the age bias seems unfair but in a model involved in predicting the survival of a patient which has the COVID-19, age seems to be an important variable to take into account. There is also the question of how fairness is measured that needs to be discussed; Different methods exist, which will not have the same relevance in all cases. This makes fairness a delicate subject.
Strangely, almost all biases are a good thing to have in data. In fact, machine learning classifiers are designed to base their decision on statistical biases.
Discrimination on biases: the natural way of Machine Learning
…. So is bias a good thing?
Let's take a closer look at what a neural network is. In this article, we take this algorithm as the representative of all machine learning algorithms. Artificial Neural Networks are composed of single artificial neurons arranged in a particular way. These are the elementary units of calculation of the network. Let's look at them in detail.
This calcul unit processes the weighted sum of its inputs and applies an activation function, h(x), which is often non-linear. Here are the most common ones:
We distinguish two families, step-like (sigmoid, tanh) and ramp-like (*eLu) functions. These simplifications are to illustrate what is happening here. The step-like functions have two distinct states, active (1) or inactive (0 or -1). The ramp-like functions have two different states, linear (|x|) or inactive (>=0). The output of the activation function reacts in two distinct ways depending on the input, which discriminates the input in two cases. These neuronal models try to mimic real neurons and how they work. Brain neurons are units that fire an electrical alternating electric current (active / exited) or not (inactive / inhibited). In that sense, neural networks are complex architectures of interconnected discriminative units. No wonder machine learning is prone to discrimination… It is designed to do it! But does this mean that their creators were Machiavellian white men to invent such a diabolic algorithm? Of course not.
Discrimination and generalization are the basis of learning, for machine and humans before them. Neural Network is an algorithm designed to find biases in the data, distinguish cases and apply different processing to them. It is how it solves problems. Applied to data linked to humans, it will inevitably find biases that enables it to solve the given problem.
But what happens when the human data, created from the biased human experience, is inherently biased? The machine learns them, simply, and extrapolates (i.e. generalizes). When the data biases are unfair, the machine learns human discrimination (in the societal sense).
Sometimes biases are unfair
…. Who is to blame?! Well, it is complicated.
Note: an in depth study of the biases causes and their intertwinings can be found here.
The first sensible thing to do when there is an outrageous problem: to find who is to blame. Is it the fault of the engineers who used biased data? Is it the fault of the people who created this biased data? Who distributed them? The project manager? The company that used this algorithm? The government that has not imposed a legal framework?
Let's try something else. Let's try to understand the problem in depth with its nuances. Let's start from the assumption that this is a mistake and coldly analyze its causes and effects. In the Amazon case, most technical people are male. The algorithm learned over the text from the resume of all its previous technical employees. By analyzing what the neural network learned, the AI team discovered that “The algorithms learned to assign little significance to skills that were common across IT applicants, such as the ability to write various computer codes [...]. Instead, the technology favored candidates who described themselves using verbs more commonly found on male engineers’ resumes, such as “executed” and “captured,” [...].” (src).
We see that human fundamental characteristics, like gender, are inextricably linked to the observations of our behaviors, here our way of writing. The gender can be deduced from the resumes even if it was not directly given to the algorithm. Hence, hiding those variables by removing them is not effective. A simple solution that could be considered would be to balance the number of male and female resumes in the machine learning training dataset. Would it be sufficient? How to measure the bias? If we can measure it, we can zero it! But, is zeroing a bias measure sufficient to obliterate an unfair bias?
Unfortunately, there is no single way to measure the bias, we’ll see that these different metrics are in competition. As we unfold the fairness problem over several levels, we start to unveil its layers of complexity.
Bias has multiple measures, each related to a valid understanding of what should be equality
Note: An in-depth study of a lot of fairness measures that appeared in the literature can be found here. A more gentle in-depth explanation of most measures be found here.
Let’s take a binary variable, G (ex: gender), that we want to protect from machine learning learning discrimination. This variable can take two values (0 / 1). The samples that exhibit the same values for G are grouped together. Considering only G, there are two groups in the population, one is considered privileged (ex: male group) and the other one, unprivileged (female group). Now we consider the result of a binary classifier. One of the output classes is considered a positive outcome (ex: getting hired by Amazon) and the other one a negative outcome (ex: not getting hired by Amazon). We will illustrate each fairness measure with an related affirmation that helps to grasp its meaning and with an example on the Amazon hiring process that we discussed earlier.
Note²: I do not go into the math of the measure, I simply relate to the ideal case of their paradigm of equality. It is the case when the metric is valued at 1. When it is valued 0, it is the further we can imagine from this ideal case.
Demographic Parity² - “There should be an equal number of positive outcomes in the privileged group and in the unprivileged group”
Ex: if 80 males and 20 females apply for 10 open positions in Amazon,
5 should be given to males and 5 should be given to females.
Equality of Opportunity² - “There should be an equal rate of positive outcomes in the privileged group and in the unprivileged group”
Ex: if 80 males and 20 females apply for 10 open positions in Amazon,
8 should be given to males and 2 should be given to females.
Equalized Odds²³ - “There should be an equal rate of true and false positive outcomes in the privileged group and in the unprivileged group”
Ex: if 50 unqualified males, 30 qualified males, 15 unqualified females and 5 qualified females apply for 10 open positions in Amazon, only 3 scenarii are valid.
4 unqualified males + 4 qualified males + 1 unqualified female + 1 qualified female
8 qualified males + 2 qualified females
8 unqualified males + 2 unqualified females.
Note³: in the Amazon hiring process example, a true positive is a person that deserves to have a job regarding the unbiased data from its resume. A false positive is a person that does not deserve the job. We simplify here with ‘qualified’ (for the job).
These are the 3 most common measures, but there are many more (linked under the title of this section). We clearly see that demographic parity is incompatible with the equality of opportunity and the equalized odds. Moreover when there is equalized odds it implies equality of opportunity, the opposite is false. To convince yourself that this last assumption is true, we can recall the Amazon hiring process example on the equality of opportunity. The hired people are not required to be qualified for the job. In that sense, 1 qualified man + 7 unqualified men + 4 qualified women is a perfectly valid solution for the equality of opportunity. Each measure achieves a point of view of what is fair for the current case, which is subject to debate. Here are instance where each measure is appropriated, in my opinion:
A salsa party for straight and single people: you should ensure demographic parity between men and women or it would be a lonely party for the ones that do not get partners.
Random checks at country borders should have equality of opportunity: every segment of the population should have the same rate of police checks at borders, whatever their appearance. This prevents police harassment and this way nobody feels above the law.
Amazon’s hiring process should have equalized odds: people should be hired for their skills, past experiences and recommendations. In that way, there should be roughly the same rate of qualified people between hired males and hired females and about the same ratio of the number of males over the number of females that apply, than the ratio that gets hired.
And all of this get a little confused in a modern ML pipeline
In a modern machine learning pipeline, this can be even more complicated than these simple cases. Consider the area of NLP (Natural Language Processing). In the same machine learning pipeline for solving a single task, it is very common to find several different machine learning algorithms, some of which are already trained, others are pre-trained and the last ones must be trained from scratch:
The words in each sentence can be classified with trained POS (Part Of Speech) or NER (Named Entity Recognition) models.
Words / sentences can be transformed to embedding vectors with trained embedding matrices.
These embeddings and tokens can be used as inputs in another model that needs to be trained for the final task.
Words / sentences / documents can be classified with pre-trained models (BERT, GPT, X5).
When you deal with various machine learning algorithms trained on different datasets the final is dependent on the biases learned on each subtask. Training embeddings is an unsupervised task, so traditional fairness metrics won’t help us here, because there is no clear positive or negative outcome.
We remarked that embeddings have interesting properties when they are well trained. The final embedding set preserves some of the semantic between words in the form of local quasi-linear semantic relationships. Hacking these properties, we can construct a gender vector from the mean of the subtraction of gender-opposed word sets like {“dad”, “man”, “father”} - {“mom”, “woman”, “mother”}. By removing this gender component from all word embeddings in the dictionary, we can approximately suppress the gender bias. To measure the overall bias on an embedding set, we use a metric called WEAT.
Word Embeddings Association Test (WEAT)²⁴ - “There should be no difference in connotation (positive / negative) to semantically neutral words”
In the word embeddings of the Amazon hiring process, the following word vector distance relation should be true:
dist(“man”, “good”) - dist(“man”, “bad”) ≃ dist(“women”, “good”) - dist(“women”, “bad”)
Note⁴: it is a simplification for understanding the essence of this measure, the real maths are more complex. Actually, to compute the real WEAT score, we should have defined a set of masculine words, opposed feminine words, positive words and opposed negative words. Then, we should compute the mean of side-by-side distances between masculine and positive sets, masculine and negative sets, feminine and positive sets, and lastly between feminine and negative sets.
My own experiments confirm the common sense that removing the embedding set bias is not enough to remove the whole bias from a complete NLP pipeline. Actually, there is also the bias from the dataset of the problem we are trying to solve. As the gender bias cannot be totally removed in the first part because the gender vector suppression is an approximate process, the tiny differences along the gender direction get amplified during the final model training.
Now, what countermeasures are available to enhance these fairness metrics?
What solutions?
Note: an more exhaustive list of the possible solutions regarding different fields can be found here.
While most of the research on fairness focused on optimizing the unbiasement of the embeddings, I find that we miss the goal here. The downstream applications of embeddings will contain datasets and models to unbias as well. Hence, unbiasing the whole process is more than necessary, it is the primary target. The AIF360 project does great in these strategies. It shows in an educational way the different strategies of unbiasing and their impact on the fairness metrics:
Balancing the number of privileged and unprivileged samples. It is for most problems a really good strategy as a sheer amount of biases comes simply from the difference in group representation. The model and the problem to solve are not transformed, so it generally does not change very much the performance of the model. It has been used successfully for mitigating bias in toxic comment detection.
Penalizing the bias by adding a term in the model loss. For each couple pair (x, y), (x’, y’) where x belongs to the privileged group with y it’s expected outcome and x’ belongs to the unprivileged group with y’ its expected outcome, add a |y - y’| term in the loss. As it changes what problem the model is solving, we need to make sure that it makes sense in the context. This technique has been used across various problems such as racial bias mitigation and gender bias mitigation.
Adversarial debiasing. It changes the model by adding an adversarial network that tries to detect if the output of the model comes from a sample of the privileged or unprivileged group. It has been used for mitigating the gender bias in classification and word embeddings training.
So, we have the means to obliterate unfairness, let’s do it! Not so fast...
Balance is everything here, but not too much
Note: an interesting study of what happens when we push too far some metrics towards equality and when to use them can be found here.
When trying to mitigate bias, we could be tempted to optimize fairness metrics to their maximum. Unfortunately, the fairness problem is not an optimization problem and doing so can lead to paradoxical states of the system. When we push too far fairness enforcement, trying to balance perfectly the privileged and unprivileged groups, more problems may arise. The metrics we are trying to optimize are not fairness, they are a measure that represents how far or close we are from an ideal fair case. Hence, it is wise to not confuse what is fair and a number we are optimizing. First, we need to precisely define our ideal case of fairness and assess its limitations. For instance, hiring only unqualified people could be a perfectly valid ideal case from the point of view of the equality of opportunity and the equalized odds. It is why it is advised to check the output of models manually and statistically from different angles on a carefully built validation dataset, to make sure the mitigation measures we enforce do more good than harm.
The subject of fairness is both interesting and disturbing because it forces us to define what is fair and what is not and how a situation could be made fairer. It is a case by case discussion in the first place. It makes more sense to see it as a safety net to avoid extreme or unjust system behaviors, than an ideal to pursue at all costs. For instance, the search for unfair bias can be endless. It is always possible to multiply the dimensions of injustice. For example, in sentiment analysis, simply looking into racial bias in detail is tricky. How many races should we consider? How to define and represent them? If we take the problem on the surface, we are only concerned with mitigating the biases according to the white - black dimension. If one really wants to be fair, one should also consider the hispanic and asian groups. And it doesn't stop there, most studies are done in America and there is a bias, because often they are only considering Americans people which are very different from European and African people. This poses representation problems. All of these groups and sub-groups could have biased relationships with each other, this therefore multiplies the dimensions on which fairness measures must be applied (i.e. considering hispanic - white american, african black - hispanic, asian - european white, asian - american black etc.).
Hence, fairness should be a continuous, pragmatic and calm discussion avoiding objectification before being a technical problem.
Comments