Data Bias Is a People Problem

Learn how prejudice can influence the logic in data‑driven technology.

It’s easy to think that computer technology’s neutral logic would free it from the prejudices of humankind. However, in some ways, machine-learning programs and similar initiatives are more at risk of bias than people are, because of the way in which computers build simulated logical patterns.

Computer “thinking” is based on data mined from people. With the increasing value of machine learning technology, the data that feeds it is becoming more and more lucrative. In fact, some have begun to dub data the “new oil,” not only because it’s the fuel source for a major commodity, but also because both its mining and the aftereffects have far-reaching consequences.

How human bias influences data

Algorithms built to mimic the process of learning and conclusion-making do so by processing data gathered from human users. Massive amounts of data are processed to identify patterns, which algorithms can then use to do things like identify common preferences or even mimic human behaviors. These algorithms have a wide range of applications for companies, from lead generation based on targeted marketing to more sophisticated artificial intelligence operations.

Bias is a component of the human thought process, and data collected from humans therefore inherently reflects that bias. This makes it incredibly difficult to gather and adjust data so that it omits bias while retaining its accuracy—especially since the determination of what is a bias is often subjective.

Bias in data collection and data analysis

The source material is not the only means through which bias can enter data. Data collection and data analysis methods can also introduce it. There are many biases that can negatively impact the data, including:

  • Confirmation bias: Confirmation bias is an error that involves allowing a preconceived notion to impact how you prioritize or interpret information. An example of confirmation bias would be if you had a strong opinion that most people preferred vanilla ice cream over chocolate ice cream and, as a result, gave more weight to data that supported that conclusion.
  • Selection bias: Selection bias is an error that stems from using population samples that don’t accurately represent the entire target group. For example, data taken from one neighborhood would not accurately represent a large city. There are many reasons selection bias arises—some intentional, some not—including voluntary participation, limiting factors for participation, or insufficient sample size.
  • Poor interpretation of outliers: Outliers can significantly skew data. For example, when analyzing income in the United States, there are a few extremely wealthy individuals whose income can warp any calculation of averages. For this reason, a median value is often a more accurate representation of the larger population.

Ethics in data collection

Ethical matters regarding the collection of data are increasingly being raised by the public, especially as it concerns consumer privacy. While consumer data is used by CRM systems and similar technology to improve customer experience, companies can also use, buy, or sell such data in ways that bump up against the edge of what’s legal or ethical, eroding consumer trust across the board.

In fact, there is such widespread concern that many laws and regulations have been enacted on the subject across the globe, such as the European Union’s General Data Protection Regulation (GDPR). Those who want to work ethically with mined consumer data may find it helpful to seek out businesses that are compliant with GDPR and/or similar codes.

Data bias in AI

The impact of biased data on applications such as artificial intelligence is not always theoretical, or even subtle. A famous example is Microsoft’s Tay. Tay was a chatbot released by Microsoft in 2016 that used AI technology to create and post to Twitter. Soon after going live, Tay began tweeting concerning content, much of it discriminatory in nature.

After deactivating Tay, the Microsoft team released a statement about the incident. This statement pointed to Twitter users intentionally spamming Tay’s conversational threads with inflammatory statements as the source of its behavior. Tay used those threads as a means of data mining to influence its output. Although this incident was at least partially caused by intentional sabotage from users, it illustrates how discrimination can take form in the data that is increasingly being put to work in our day-to-day lives.

The impact of biased data

Because data-driven technology is now so omnipresent, biased data can have a wide range of consequences, including complex social repercussions. If we are constantly feeding prejudices back into our cultural consciousness through the vehicle of data-driven technology, these prejudices may be subconsciously reinforced, creating a loop we can only break with concerted effort. The advantage that humans have over machine learning is that humans, at least in groups, have the capacity for cultural evolution, providing some level of checks and balances against prejudice.

Share This Article