Bias exists in all data and influences critical decisions every day. Bias in data can incite social outrage, cause discrimination, produce incorrect analytics, and yield monetary losses—even within top companies in the tech industry. Considering the far-reaching impacts of bias in data, understanding its origins and how to address it enables organizations to better serve customers, minimize expenses, and maximize efficiency. There are numerous areas where bias can occur in data—how data is interpreted, who is looking at the data, and where the data is coming from, to name a few. Ironically, one of the key originators of bias in data is also essential to its solution: equity and diversity within data management.
Where Does Data Bias Occur?
Technology Innovation: A Real-World Example
One common area where data bias occurs is within technology innovation, as illustrated by the example of Tanya Hannah, the Director and Chief Information Officer of King County, Washington. In 2018, King County created a senior tax relief program to help alleviate the financial burden of high housing taxes. Relying on historical data to inform the design and implementation of a sign-up process, analysts found that residents were more likely to use and be comfortable with new technologies. As a result, data engineers created a sign-up portal that featured state-of-the-art technology. After the portal was implemented, the King County analysts performed an impact review and, much to their surprise, found that older residents were not interested in the technology and instead were focused on ensuring they were able to submit their materials.
There was clearly a gap between the conclusion of the initial analysis that suggested a new technology would be preferable and the outcome of the impact review. The cause of this difference is that the analysts failed to consider population bias within the historical data used for planning purposes. Older residents were not interested in the technology and found it difficult to use. They had trouble navigating the system and registering their information due to a lack of familiarity with new technologies. Failing to recognize the bias inherent in using a full dataset to inform a decision that would only impact a subset of the population caused King County’s software to be effectively useless in application.
In the end, the county had created a fancy (and expensive) portal when a simpler data collection and registration method would have been just as effective, if not more effective. When designing the original software, engineers did not consider seniors in the planning process, resulting in the county wasting money and time developing an unnecessary tool. Promoting equity within business decisions or, in King County’s case, attending to the difference in population demographics within the historical data compared to the target population would have allowed the county to develop a solution that fit the needs of the new program’s participants while spending less money and enabling greater efficiency and heightened success.
Why Does This Matter?
Not attending to the unintentional bias in its data due to the discrepancy between the historical data and target populations cost King County more money and resulted in a difficult program launch and less satisfied residents. Similar issues are found throughout the technology development industry, as historical data is often relied upon to inform updates to outdated technology. Without careful consideration of the biases within historical data and a clear view of how they affect target populations, this approach can lead to more expensive, less efficient outcomes.
Artificial Intelligence & Machine Learning: A Real-World Example
Racial bias in data is an issue that has come under scrutiny in recent years. In one high-profile case, nonprofit news organization ProPublica analyzed the risk assessment software COMPAS in 2016. COMPAS was used to forecast which criminals were more likely to become repeat offenders. In the study, ProPublica found that the popular software used to predict future criminals was racially biased. The system falsely flagged Black defendants as higher risk while often incorrectly identifying white defendants as low risk. After multiple Black defendants were falsely identified as high risk and the results were used as evidence in court, the Wisconsin Supreme Court expressed hesitation in its ruling about future use of the software without highlighting its limitations.
One theory experts suggest would help curb racial bias is to incorporate the public and people of color within the creation and application of such algorithms. In a study accepted by the Navigating the Broader Impacts of AI Research Workshop at the 2020 Conference on Neural Information Processing Systems (NeurIPS) presented by Columbia University, researchers concluded that biased predictions are most often caused by imbalanced data but that the demographics of engineers also play a role. When an algorithm is created, especially artificial intelligence (AI) that learns and changes based on its environment, the need to incorporate different views, types, and sources of data within the creation is critical to achieving unbiased machine learning (ML). The environment of the data encompasses who is using the data, who is creating the data, where the data is coming from, and every step of the data management and analytics process.
If more Black technologists and analysts had been included in the development of the COMPAS software, more Black criminal justice investigators had been using it, more Black engineers were coding, and more users were considering racial bias, then there likely would have been less bias in the software and its predictions. Decreasing racial bias in AI systems requires a transparent approach that promotes diversity by including the public in open-source algorithms or gathering opinions from individuals with different views to understand, monitor, and suggest improvements to algorithms.
Why Does This Matter?
Underlying racial biases within AI systems are detrimental to society, customers, constituents, members, and companies as well as modern technology itself. From companies at the top of the tech industry like Google and Facebook to AI systems like COMPAS, technologies can suffer from racial data bias, causing skewed search results, incorrect forecasts, and ultimately mimicking, creating, and modeling biases that result in end users making misinformed decisions.
Beating Bias With Equity & Diversity
While it is difficult to fully eradicate data bias considering the degree to which AI and data are built into complex social systems, there are several crucial steps and actions businesses can take to mitigate bias and avoid its negative impacts on customers, society, and the bottom line.
- Hire a diverse team. A diverse team will force individuals to think through issues more, be inclusive of different views, and promote a comprehensive outcome. Diverse teams lead to less data bias, as they are able to talk through issues together. This allows for different views to be incorporated into algorithms and AI systems (e.g., the ProPublica case) to help mitigate data bias.
- Structure data analysis to allow for different opinions. Data gathering methods present another important opportunity to mitigate bias in data. There are multiple correct though different ways to view a single dataset, many of which are often missed in teams that consist predominantly of only one demographic group. Structuring teams and analysis processes to enable analysts from different backgrounds to share their diverse interpretations and analyses broadens the scope of possible methods and interpretations, thereby reducing the likelihood of bias in the data and data processes.
- Include data from diverse sources and use diverse datasets. Regardless of how data is analyzed and interpreted by a diverse team, without the right data, diverse datasets, and enough data, bias will inevitably creep into the dataset. Incorporating data from diverse sources and utilizing datasets that represent a diverse population for a more comprehensive, 360-degree view mitigates the risks of different biases within and between individual datasets and ensures that data models are more flexible and representative.
- Consider ALL end users and ALL of those affected. Data must relate to the target population. The decisions being made based on the data may be directed at particular segments of a population, in which case attention should be paid to the data from members of those segments. In the case of King County, doing so would have prevented the county from spending time and money developing a tool that ultimately did not meet the needs of the target population. On the other hand, if data is being used to make decisions that affect large and diverse groups of people, the data used should be representative of that diversity.
Bias is nearly impossible to eradicate from decisions and data entirely, but being conscious of diversity, equity, and inclusion can help restrain the effects of bias while also saving decision makers time and money.