Is Prejudice Infecting Data? How Human Problems Become Technological Problems

Data is being called the new oil; can its mining have just as many repercussions? This information may be compromised by the prejudice of the humans its collected from.

It is easy to think that computer technology’s cold, logical neutrality would free it from the prejudices of humankind. However, this is unfortunately not the case. In fact, in some ways, machine learning programs and similar initiatives are actually more at risk of bias than people are, due to the way in which computers build simulated logical patterns.

Computer “thinking” is based on data mined from people. With the increasing value of machine learning technology, the data that feeds it is becoming more and more lucrative. In fact, some have begun to dub data “the new oil,” not only because it is the fuel source for a major commodity, but also because both its mining and its aftereffects have far-reaching consequences.

How Human Bias Influences Data

Algorithms built to mimic the process of learning and conclusion-making do so by processing large quantities of data gathered from human users. Massive amounts of data are processed to identify patterns, which algorithms can then use to do things like identify common preferences, or even mimic human behaviors. These algorithms have a wide range of applications for companies, from lead generation based on targeted marketing, to more sophisticated artificial intelligence operations.

Bias is a component of the human thought process, and therefore data inherently holds bias. This makes it incredibly difficult to gather and adjust data so that it omits a bias while retaining its accuracy — especially since the determination of what is a negative bias is often subjective.

Bias in Data Collection and Data Analysis Methods

The source material is not the only means through which bias can enter data. Data collection and data analysis methods can also introduce bias. There are many mistakes that can negatively impact the data, including:

  • Confirmation Bias: Confirmation bias is an error that involves allowing a preconceived notion to impact how you prioritize or interpret information. An example of confirmation bias would be if you had a strong opinion that most people preferred vanilla ice cream over chocolate ice cream and, as a result, gave more weight to data that supported that conclusion.
  • Selection Bias: Selection bias is an error that involves using population samples that don’t accurately represent everyone in the target group. For example, data taken from one neighborhood would not be an accurate data sample to represent the city at large. Selection bias can be caused by many mistakes, including voluntary participation, limiting factors for participation, insufficient sample size, or intentional bias.
  • Poor Interpretation of Outliers: Outliers can significantly skew data. For example, when analyzing income in the United States, there are a few extremely wealthy individuals whose income can warp any calculation of averages. For this reason, the median wage is often a more accurate representation of the larger population.

Ethical matters regarding the collection of data are increasingly being raised by the public, especially as it concerns consumer privacy. While consumer data can be utilized in CRM systems and similar technology to improve customer experience, it can also be selfishly used by companies for the sake of their own gain.

In fact, there is such widespread concern that many laws and regulations have been enacted on the subject across the globe, such as the General Data Protection Regulation (GDPR). Therefore, those who want to work with ethical data mining may find it helpful to seek out businesses that are compliant with GDPR and/or similar codes.

Examples of Data Bias in A.I.

The impact of biased data on applications such as A.I. is not always theoretical, or even subtle. A famous example is Microsoft’s Tay. Tay was a chatbot released by Microsoft in 2016 that used artificial intelligence technology to create and post Tweets. Soon after going live, Tay began posting Tweets with concerning content, much of it discriminatory in nature.

After deactivating Tay, the Microsoft team released a statement about the incident. This statement pointed to Twitter users intentionally spamming Tay’s conversational threads with inflammatory statements as the source of her behavior. Tay used those threads as a means of data mining to influence its output. Although this incident was at least partially perpetuated by intentional sabotage from users, it is an illustration of how discriminatory thoughts can take form in the data that is increasingly being utilized in our day-to-day lives.

The Impact of Biased Data

Because data-driven technology is now so omnipresent, biased data can have a wide range of consequences. Some of these consequences are the more obvious problems inherent to the marketing of a faulty product. However, there may also be more complex social repercussions.

As stated previously, machine learning can be even more susceptible to bias than humans are. The advantage that humankind has over machine learning is that humans, at least in groups, have the capacity for cultural evolution, and cultural evolution provides some level of check-and-balance against prejudice. However, if we are constantly feeding prejudice back into our cultural consciousness through the vehicle of data-driven technology, prejudices may be subconsciously reinforced, and the natural social stabilization process stunted.