9 Ways to Prevent Data Bias in Predictive ModelsReading time: 6 minutes
Amazon recently noticed something strange about a predictive model it used to predict the likelihood of employee success. The model consistently favored male applicants over females, despite comparable qualifications. Why?
The culprit was data bias. The model used the company’s employment history to make its predictions, and because Amazon historically hired mostly men, the data consisted of primarily male applicants. As a result, their predictive model extrapolated that maleness was a desired employee criteria. The model observed more Jacks and Bills than Susies and Jills, and it assumed HR hired that way on purpose.
This is an illustration of how biased data can lead to unfavorable outcomes in predictive modeling. Fortunately, there are several ways to reduce the presence of data bias in a predictive model to improve outcomes. In fact, predictive modeling helps mitigate it when properly applied. Here are 9 ways to prevent data bias in predictive models.
- Decide on your goals and establish clear parameters
- Stay involved in the project
- Put a diverse team in place to review your work
- Ensure that your data-collection tools are working
- Examine and balance the data before putting it to use
- Use a predictive modeling technology that allows for user oversight and control
- Utilize dummy data, with real-world applications, to train your model and test for biases
- Build the model itself to mitigate bias
Types of Data Bias
Before we dive into how to correct for biases in your data, you should first be aware of the three primary types of data bias:
Sample Bias: A form of selection bias wherein the data does not accurately reflect the population (example: a model predicts that the Browns are the greatest football team of all time after polling only Clevelanders)
Prejudice Bias: Data informed by stereotypes (example: a model predicts that the most-used phrases in Australia are “G’day Mate” and “put another shrimp on the barbie”)
Measurement Bias: Data tainted by a poor or improper measuring device (example: using a malfunctioning radar to measure average speeds on a highway)
Let’s move on to how to prevent data bias from skewing your results:
1. Decide on your Goals
At the outset of your project, determine what you intend to gain from your model’s results. A panel of data experts convened by New America suggests setting clear goals and considering unintended consequences in the planning phase to ensure your model is achieving its goal. If you can’t establish clear goals for your use of a model… well, there’s a larger problem you’ll need to address before moving forward. Once you’ve decided how predictive modeling can help you answer your question…
2. Stay Involved
A McKinsey study states: “Business users would do better to view the application of machine-learning algorithms like the creation and tending of a garden. Much human oversight is needed.” Predictive modeling is not a set-it-and-forget-it approach; it’s a method for humans to gain and interpret insight. Don’t take results at face value. Plan post-modeling evaluation before taking any action based on the results.With that understood…
3. Diversify Your Team
Detecting data bias is easier when you’re working with a team of analysts and technicians from a variety of backgrounds. A panel of experts convened by the Brookings Institute noted, “Bringing together experts from various departments, disciplines, and sectors will help facilitate accountability standards and strategies for mitigating online biases, including from engineering, legal, marketing, strategy, and communications.” For small companies or teams consisting of a solo data scientist, an email to colleagues or friends who have different backgrounds can highlight problems. Include a cute picture of a dog to increase your reply rate.
With your team in place…
4. Evaluate your Instruments
To avoid Measurement Bias, it’s critical that the tools used to capture the data were functioning properly while collecting information. It’s not necessary to personally check measurement devices, but if something jumps out as strange, it’s worth investigating the issue and collection procedures.If you are collecting data yourself:
- Monitor collection as it proceeds if your method allows for it.
- Periodically check on equipment at established checkpoints if your method allows.
- If blindness is important to your test, inspect the equipment before and after the collection period, and review records for anomalies.
Once you’re confident you’re working with accurate data…
5. Be Skeptical of Your Data
If there’s a bias in your predictive model, the source is usually your data; you’re either missing something, or you included something that’s skewing the results. This is when Sample Bias appears.
An algorithm designed to predict the ideal pet built on a sample set that does not include any dog owners is unable to confirm that dogs are the ideal pet (disclaimer: the author of this post has a pro-dog bias).
Wharton statistics professor James Johndrow said, “Any biases that exist in the data, those things will eventually show up in the predictions”. Johndrow developed a process that removes certain information from a dataset that might result in racial or gender-based bias. This leaves a dataset that should be less likely to lead to discriminatory outcomes. Consider if critical data is missing or if data present in the sample is creating misleading results.
“If the data you work with has some inherent biases, the model will not only learn those biases but will end up amplifying them,” says Stas Sajin at Towards Data Science. Prevent the model from drawing conclusions that amplify stereotypes as a result of data bias.
If you set out to remove signifiers of ethnicity, gender, or another particular trait, don’t forget to watch for proxy data. These are data points that don’t directly state information, but stand in for that information. If a particular zip code has a majority Hispanic population, including address information in your sample could serve the same function as ethnicity and lead to a similarly biased result. Consider using census data instead; this can help to generalize your findings to more universal characteristics that any zip code could have.
When you’ve decided what data to include…
6. Balance your Data
It is critical to ensure that you’re working with balanced data. Make sure populations are not over- or under-represented. This is where Prejudice Bias can unconsciously work its way into your data.
When compiling a set of data that measures activities commonly conducted by either men or women, you might consider oversampling women working in traditionally male roles, or vice versa, to prevent the model from extrapolating towards a biased outcome.
Using your balanced data…
7. Build the Model Yourself
Ensure your predictive model is bias-free by building it yourself! It’s much less complicated than it sounds, using today’s intuitive predictive modeling tools. Rapid Insight has a user-friendly interface that anyone can use to build a predictive model while ensuring that bias doesn’t taint its utility.
You’re able to control exactly which inputs and variables the model is considering when making its decisions. Unlike working with consultants or using pre-built models, you have the ability to tweak your model at any step in the process and adjust for or remove parts of your data set that you observe introducing bias into your outcomes.
As reported in Towards Data Science: “A lot of algorithms are also a black box, you just use it ready-made, meaning that if you’re not directly developing the algorithm you have no idea what’s in there, surely you won’t be able to assess whether they’re unbiased or not”. Without first-hand knowledge of the construction of your model or the ability to make changes to it, you can’t identify what may be causing a problem.
With a self-built model in place…
8. Deploy Dummy Data Before Implementation
Work with a set of “dummy data” before implementing the algorithm. If the dummy data outputs a prediction biased towards a particular ethnicity, gender, or other group, evaluate the data and your algorithm before proceeding.
Test your models on holdout data and comparable data from external sources. Running simulations on the same dataset that built the model has inherent limitations, and often leads to the bias you’re hoping to prevent. A TechCrunch blog suggests, “you should be simulating real-world applications as much as possible when building algorithms.”
With testing complete…
9. Use your Algorithm to Remove Bias
Rather than removing data points that could lead to bias (which may not work in some cases), another approach is to deploy the model on the data and evaluate biases it displays. Then, program the model to offset the impact of whatever is leading to the biased outcome. Towards Data Science suggests a reliable method.
Predictive modeling can even identify biases that exist in human decision-makers.The HR Consultancy SHL highlighted this story about a model that produced surprising results: “A recent model created for a call center representative role revealed that candidates with call center experience were actually likely to perform poorly. This is counter-intuitive, but technology quickly spotted this, whereas the human mind would be unlikely to do so… what gets a person hired is not always what makes them good at their job. But algorithms, unlike people, are good at identifying the difference between the two.”
So the next time someone accuses you of being unqualified, fire back by asking if they ran their accusation through a predictive model first.
It is incumbent upon anyone employing predictive modeling to be thoughtful with this powerful tool. When used correctly, predictive modeling can reduce the influence of data bias, leading to fairer outcomes and quantifiably justified results.
Rapid Insight’s platform for predictive modeling can help you reach your goals. Licenses include free, unlimited support from our dedicated team of data analysts to help guide you in the right direction.