I just came up with the following interview question for data scientists:
You apply for a position at a company and interview with five individuals, each of which then casts a hire / nohire vote. You don't get the job and decide to file a discrimination lawsuit against the company. You don't specify which of the five interviewers was being discriminatory or if you were being discriminated against based on gender, race, religion, age, or sexual orientation. The judge forces the company to disclose historical hire/nohire decisions for each of the five interviewers and declares that if you can show that one of the interviewers is significantly prejudiced (at p = 0.05) with respect to one of the five characteristics, then they will rule in your favor. Assuming that none of the interviewers are actually biased, what is your chance of winning the case?
Can you guess the answer?
The answer is: approximately 72%. Yep, you are almost three times as likely to win the case than to lose it, even when no actual discrimination is taking place.
This apparent paradox is a typical example of data dredging and it is what people mean when they warn against the perils of big data, or even downfall of science. The problem is not intrinsic to big data or science, but stems from what most problems tend to stem from: people doing things without fully understanding what they are doing.
So I thought I would take this chance to explain some of the key concepts behind statistical significance and data dredging using the discrimination lawsuit paradox as an illustration.
Statistical significance
Let's say you have historical interview decision data for interviewer X and it looks like this:
To answer this question, let's assume that X was completely blind to the gender when he gave away his eight hire votes and that gender was randomly assigned to the candidates after this fact. If that were the case, the chance of men getting x of the hire votes and women getting 8x would be:
Note: In real life we would also have to account for the overall rate of hire vs. nohire decisions for the two genders, which may not actually be equal for reasons unrelated to discrimination. For example, girls may tend to chose different colleges than boys and some colleges may be better than others at preparing candidates for whatever is the job in question. But this would make the math more complicated, so for the purpose of this example, let's assume that the overall rates are the same for men and women in the general population.
In general, the more hypotheses you are testing in hope of uncovering a "statistically significant" one, the more likely that at least one of them will show significance due to pure chance even if no underlying pattern exists (a false positive). This chance gets lower if you lower the pvalue threshold, but no matter how low the pvalue threshold is, given enough possible hypotheses, the probability of at least one false positive quickly approaches 1 (see graph below).
Conclusion
This shows that you should always take every statistically significant result with a grain of salt and ask yourself: how many hypotheses were tested before this seemingly significant result was found? You can see many examples of this in medicine (testing numerous drugs to see what works, or testing multiple combinations of genes and conditions to find a correlation), as well as software development (A/B testing multiple features and adopting only the ones which show "significant" improvement). There is nothing wrong with computing pvalues, but taking action without knowing what they mean exactly can quite literally have perilous results.
Having gone through this whole chain of thought, I decided that it probably wasn't a good idea to give this interview question to actual candidates, just in case it gives them any ideas... ;)
This apparent paradox is a typical example of data dredging and it is what people mean when they warn against the perils of big data, or even downfall of science. The problem is not intrinsic to big data or science, but stems from what most problems tend to stem from: people doing things without fully understanding what they are doing.
So I thought I would take this chance to explain some of the key concepts behind statistical significance and data dredging using the discrimination lawsuit paradox as an illustration.
Statistical significance
Let's say you have historical interview decision data for interviewer X and it looks like this:
Candidate  Gender  X's decision 

1  female  hire 
2  female  no hire 
3  male  hire 
4  female  hire 
5  male  no hire 
6  male  no hire 
7  female  hire 
8  female  no hire 
9  female  no hire 
10  female  hire 
11  male  no hire 
12  male  no hire 
13  female  no hire 
14  male  no hire 
15  female  hire 
16  male  no hire 
17  male  hire 
18  male  hire 
19  male  no hire 
20  male  no hire 
The question is: is X discriminating based on gender, or in other words, does X have a bias towards men or women?
We can see that he (or she?) voted hire for 3 out of 11 men (27%) and 5 out of 9 women (56%) — more than double the rate. So it looks like he may have some bias against men. But does this really mean that he is discriminating, or did he just happen to come across some really qualified female candidates and some really unqualified male ones?
We can see that he (or she?) voted hire for 3 out of 11 men (27%) and 5 out of 9 women (56%) — more than double the rate. So it looks like he may have some bias against men. But does this really mean that he is discriminating, or did he just happen to come across some really qualified female candidates and some really unqualified male ones?
This is the question which statistical significance attempts to answer, but it is important to realize that no mathematical system can actually answer this question conclusively, so we go after "the next best thing" and instead try to answer a related question:
"Let's give X the benefit of the doubt and assume that he is not discriminating (the 'null hypothesis'). What are the odds that he would be so unlucky with his male candidates to end up with this amount (or more) of skew towards women?"
To answer this question, let's assume that X was completely blind to the gender when he gave away his eight hire votes and that gender was randomly assigned to the candidates after this fact. If that were the case, the chance of men getting x of the hire votes and women getting 8x would be:
(  11  )  *  (  9  )  /  (  20  ), so: 
x

8x

8

Hire's for men  Hire's for women  Probability 

0  8  0.007% 
1  7  0.314% 
2  6  3.668% 
3  5  16.504% 
4  4  33.008% 
5  3  30.807% 
6  2  13.203% 
7  1  2.358% 
8  0  0.131% 
The pvalue is simply the sum of the probabilities in the row in question (in this case 16.504%) and all the rows which are even more "extreme" (in this case, even more biased against men). So in this case the pvalue is around 20%, or 0.2, which is usually not considered "statistically significant". The lower the pvalue, the more statistically significant the result. Typical pvalue thresholds for achieving statistical significance include 0.1, 0.05, or 0.01. So if there had been only two men among the eight that X voted hire for, the bias would have been considered significant having a pvalue of around 0.04. But since he (or she) voted hire for three men, the bias should not be considered statically significant.
A common misconception at this point is to assume that this means that the probability that X is biased is 1p = 80%. But that's simply not how probability works, which we can easily see by applying Bayes rule:
P(X is biased  this or more extreme result)
= P(NOT null hypothesis  this or more extreme result)
= 1  P(null hypothesis  this or more extreme result)
= 1  P(this or more extreme result  null hypothesis) * P(null hypothesis) / P(this or more extreme result)
= 1  p * P(null hypothesis) / P(this or more extreme result)
≠ 1  p
This misconception keeps getting repeated leading to a lot of confusion, but the truth is that you just can't compute the actual probability that X is biased, unless you have access to all the different parallel universes in some of which X is biased and others isn't.
P(X is biased  this or more extreme result)
= P(NOT null hypothesis  this or more extreme result)
= 1  P(null hypothesis  this or more extreme result)
= 1  P(this or more extreme result  null hypothesis) * P(null hypothesis) / P(this or more extreme result)
= 1  p * P(null hypothesis) / P(this or more extreme result)
≠ 1  p
This misconception keeps getting repeated leading to a lot of confusion, but the truth is that you just can't compute the actual probability that X is biased, unless you have access to all the different parallel universes in some of which X is biased and others isn't.
Note: In real life we would also have to account for the overall rate of hire vs. nohire decisions for the two genders, which may not actually be equal for reasons unrelated to discrimination. For example, girls may tend to chose different colleges than boys and some colleges may be better than others at preparing candidates for whatever is the job in question. But this would make the math more complicated, so for the purpose of this example, let's assume that the overall rates are the same for men and women in the general population.
Data dredging
Coming back to the original interview question: we have five different interviewers and each of the five interviewers can be prejudiced with respect to five different possible factors (gender, race, religion, age and sexual orientation). This makes 25 possible combinations of interviewer and prejudice. For each of these 25 combinations we can compute the pvalue just like we did in the previous section. Assuming that none of the interviewers is actually biased, the probability of any one of these results showing statistically significant bias with a pvalue threshold of 0.05 is 0.05 — that's just the sheer definition of pvalue. But this means that the probability of at least one of the 25 showing statistically significant bias is 1  (10.05)^{25} ≈ 0.72.
In general, the more hypotheses you are testing in hope of uncovering a "statistically significant" one, the more likely that at least one of them will show significance due to pure chance even if no underlying pattern exists (a false positive). This chance gets lower if you lower the pvalue threshold, but no matter how low the pvalue threshold is, given enough possible hypotheses, the probability of at least one false positive quickly approaches 1 (see graph below).
Conclusion
This shows that you should always take every statistically significant result with a grain of salt and ask yourself: how many hypotheses were tested before this seemingly significant result was found? You can see many examples of this in medicine (testing numerous drugs to see what works, or testing multiple combinations of genes and conditions to find a correlation), as well as software development (A/B testing multiple features and adopting only the ones which show "significant" improvement). There is nothing wrong with computing pvalues, but taking action without knowing what they mean exactly can quite literally have perilous results.
Having gone through this whole chain of thought, I decided that it probably wasn't a good idea to give this interview question to actual candidates, just in case it gives them any ideas... ;)
No comments:
Post a Comment