In our daily lives, we are constantly making tiny decisions, what to eat, what to wear, etc. Whether we realize it or not, all these decisions are based on data. The kind and amount of data we have determine the confidence in our decision-making.
The question arises how bad data influences our decision-making.
Bad data still wears the guise of data, therefore instilling a false sense of confidence, resulting in uninformed decisions made in a process. Every step following it will suffer from this “phantom error”. Requiring a considerable amount of resources to identify and correct.
What is “bad data”?
We need to understand, what is bad data? Because you cannot expect to find something you do not understand in the first place.
A basic definition of data is “a series of observations and measurements”. Therefore bad data is birthed when the latter two are done poorly. Subpar observations and measurements lead to a set of bad data.
Key Reasons for Bad Data
It is a tendency to interpret data so that it fits our preconceived notions. This is a very common occurrence and happens frequently in our daily lives as well. Any information that confirms what we already believe is made to look more significant by our minds. Even if the data itself is flimsy. It can render good data to become useless and/or exaggerate the prominence of bad data.
A simple to understand example can be: a student preparing a dissertation will only acquire data that fits their inference. If they are writing about the biological effects of minerals in naturally occurring water, but they have a notion that water with minerals is only a good thing. Therefore all data acquired will be about the positive effects, any negative effects will be ignored. The readers of this can make uninformed decisions about water consumption. Which is a health risk.
Uncalibrated equipment and/or methodologies
The resources we use to acquire data should always be subjected to scrutiny and maintenance. This is because everything we use has an inherent error associated with it. Therefore identification and compensation for such errors are of great importance.
Using data from uncalibrated resources only leads to an error in our data, which is compounded with every new decision we make based on it. For example, climate models need to be constantly adjusted based on incoming data, if that data was inaccurate then so would the models that are based on it. Resulting in bad data negatively influencing our efforts to fight climate change.
This is an example of how a good data set can be rendered into bad data. It is almost always a deliberate attempt to manipulate data. It is done to make an outcome look statistically significant. P-hacking is done by analyzing data using several different ways until the desired outcome is achieved.
An example of this can be someone attempting to show that coffee has cancer-curing abilities. They might sift through large sets of data repeatedly. When data confirming coffee having such inherent capabilities are found, even though it is a very small sample of the data and has an insignificant probability of occurring, this occurrence is usually exaggerated by extrapolation and scaling. This makes it look more probable than it is. This may result in poor healthcare decisions based on and reinforced by this bad data.
This is an unavoidable occurrence when acquiring data. It is a complete outlier within a predictable stream of data. As a data set becomes larger so does the statistical weightage of the noise. Therefore it needs to be identified, otherwise, it can skew the true image of a data set.
Noise tends to occur more frequently during data acquisition by measurement. For example, when recording the temperature of a specific point. There is fluctuation within the temperature of the sensor itself. This adds to the mean temperature, the longer we record our data, the more significant the factor of noise becomes.
These can occur at any time, therefore must be taken into account. It can be caused during data transfer between two devices. Either some data can be changed or lost completely due to these random errors or failures. This renders data to become unusable and cannot be used to make any decisions.
Real Life Examples of Bad Data and Bad Decisions that Followed
- Tetraethyllead in Petrol
In the 1920s in a bid to increase engine reliability and efficiency, a compound of lead was added to petrol. This led to a sharp increase in lead poisoning. To shift the blame away from leaded fuel, scientists were hired to create studies, showing lead occurring naturally within the human body. This bad data convinced the public it was safe to use this fuel. This went on for the next 50 years, resulting in countless deaths from lead poisoning.
- V-Rocket in World War 2
Towards the end of the Second World War, in 1944, Nazi-Germany had designed the world’s first guided ballistic missile to use against large cities in the United Kingdom. This rocket was extremely accurate but, to combat this the Allies spread a disinformation campaign, claiming the rockets were off by a few miles. This bad data led to the Germans calibrating their equipment to compensate for the error. Resulting in the missiles completely missing their intended targets.
- Coca-Cola’s “New Coke”
A new formula of coke was created in the 1980s, it was blind tested with 200,000 volunteers. Every time the volunteers choose the New Coke over the Old Coke and Pepsi. The Executives at Coca-Cola focused on this single point of data, ignoring all others, phased out the Old Coke with the New. There was a backlash they had not considered, taste is not all that motivates a consumer to buy something. The classic Coke had nostalgia associated with it. This and other factors caused New Coke to be a total flop.
- Mars Climate Orbiter
In 1999 NASA lost communication with its satellite orbiting mars. It was a $125 Million dollar mistake. The reason for this was that the Engineers at Lockheed Martin were using US imperial units, while the Engineers at NASA were using the Metric system. This mix and match of units created bad data being transmitted by the satellite, which led to incorrect decisions being made on its orbital adjustments, causing the spacecraft to be lost.
Now we understand what bad data is. How history shows us the outcome of making decisions based on it. But we do not understand how bad data affect our decision-making.
We sometimes have to perform the same experiments on a variety of data sets. It’s also time-consuming and costly to write a new test for each data set’s values. Data-Driven testing
Any form of the systematic decision-making process is better enhanced with data. But making sense of big data or even small data analysis when venturing into a decision-making process might
Data is important in decision making process, and that is the new golden rule in the business world. Businesses are always trying to find the balance of cutting costs while
How does bad data lead to decisions that cause systemic damage?
This occurs when our simulations or hypotheticals(based on data) do not match at all with real life. The best-case scenario causes a company’s product to woefully underperform in the hands of a consumer. Worst case, it creates a safety hazard for the consumer.
An Artificial intelligence algorithm can only work as well as the data that is provided to it. Since a huge majority of companies use such algorithms to make important decisions, such as who to give insurance to. Since we are giving AI more and more decision-making power, bad data has the potential to cause greater damage.
Pulse was an AI model created by Duke University. Its main goal was to remove blurriness from photographs and bring back the original details. The AI model took a blurred-out picture of a volunteer of African descent, filtered the data, and provided a clean image. But there was just one issue it had given the image features caucasian features. This was because of the under-representation of minorities in their data set. Fortunately, this model was for academic use. But it is a red-herring for what damage bad data can truly cause.
Population census is a very important decision-making tool in a democracy. It is what a government uses to allocate resources evenly. Cases have occurred all around the world where minorities have been under-represented in a census. This causes severe economic loss to such communities, holding back social development. All this is very easily caused by decisions based on bad data. Because governments tend to be very trusting of their methodologies to perform a census, this results in a long-lasting domino effect.
Domino effect of bad data and the decisions influenced by it
There is a considerable Anti-Vaccination movement. Which although is caused by only a handful of people has caused eradicated diseases to make a comeback. This bad data will have long-term effects on the health of the entire world. Since the world’s health care system is already under strain due to COVID-19, additional cases of contagious diseases are robbing away resources.
The health-related domino effect can pan out like this: more sick people lead to more spending on health care, this causes a burden on personal wealth, reduces buying power, and affects a local economy. It can have many more unpredictable consequences.
How to combat the collection of bad data?
Understanding bad data is the first step to combat it. Unfortunately, no secret sauce can remove all data-related woes but several steps can be taken.
- Have your data set independently reviewed
Outsourcing the audit of your data to a trustworthy entity can assist in the removal of confirmation and organizational biases. Do remember to explain what this data is to be used for because the context in data is extremely important.
- Use de-noising algorithms for your real-time data acquisition systems.
There are pre-built systems that include mathematical and machine learning models to filter out noise.
- Ensure all equipment and methodologies are calibrated.
Use widely accepted standards to adjust your observation and measurement equipment.
- Be vigilant on data manipulation and disinformation.
Never use a single data set to arrive at conclusions, take the averages of several, this reduces your error margin.
Is unethical data bad data?
Now we have a surface understanding of all the other aspects of bad data and its consequences. Still, there remains a grey area on whether data acquired through unethical means lies in the realm of bad data.
Technically and by definition if unethically acquired data follows all the proper standards, in that case, it does not lie in the realm of bad data. But in real terms when tech companies and governments acquire data without consent, not only is it an infringement on privacy but firmly lies at the depths of the bad data valley.
People mostly do not mind their data being taken without transparency, but with time these organizations become more and more intrusive to the point it can be legally considered as spying.
Human beings since our dawn have depended on data for our survival and we continue to do so. Data is one thing that will remain at the heart of our society. Since it is a tool, therefore it has been and will be misused. We can make tiny personal changes, being vigilant and double-checking, to avoid bad data negatively influencing our decisions.
Nelson, J. & Mckenzie, Craig. (2021). Confirmation bias. Encyclopedia of medical decision making: Vol. 1, 167-171 (2009).
Ferson, Scott. (2019). Prediction And Decision Making From Bad Data. 29-29. 10.3850/978-981-11-2724-3_0005-cd.
Monino, Jean‐Louis. (2020). From Data to Decision‐Making. 10.1002/9781119779780.ch1.
Ma, YongMin & Jong, YongGwang & Liu, Yang. (2021). A New De-Noising Method for Ground Penetrating Radar Signal. Journal of Physics: Conference Series. 1802. 022002. 10.1088/1742-6596/1802/2/022002.