Step by step guide to Preparing Data for Machine Learning Algorithms


Step by step guide to Preparing Data for Machine Learning Algorithms

At the heart of all our machine learning systems lies the data, guiding it and helping it improve constantly. Like the heart of any system, the data in our machine learning algorithms have a structure too.

You may be wondering now, “How can I prepare this data for my machine learning model”

Like the table of contents in a book, where all the chapters are labeled. Preparing data for machine learning algorithms  has to follow a similar path, not a complete replica. How the data is assembled both virtually and physically determine the performance of our Machine Learning algorithm

Understanding data

Before walking, one has to learn to crawl. So before understanding how data can be prepared we need to ask ourselves: What is data?

For us Human beings, this data is the memories we have. For example, how do you remember the exact path that leads to your house? It happens because you have used the path to your house so many times that your brain has formed a chain of neurons, which we refer to as memory. 

The way these memories work is whenever you arrive at a point near your home, like a signboard with specific texts, this visual data is what activates the chain of neurons that lead you to your house. But if you blindfolded yourself thereby removing the first visual data, your odds of making it back home would be slim.

Before we move onto the types of datasets we need to appreciate one more thing. All the data in a data-set is interconnected in several ways, how we use this relationship makes all the difference.

Types of Data 

There are two main categories of data we need to consider, each has two subsets. The primary categories of data do not overlap, they are:

Quantitative data

It is data that is acquired through measurement of some sort and is represented by numbers or variables. An example of quantitative data is measuring the amount of water in a storage tank. The amount will be represented by a number, which is a quantity. 

Two subsets of quantitative data are:

Discrete Data

Discrete quantitative data is when the data can only have a finite and fixed value. An example of this can be the number of students in a classroom. You would only expect to find 20 students or a number like 16. You cannot have a Pi number of students, because it is an infinite series.

Continuous Data

Continuous quantitative data is when data can hold any value within a fixed range. For example, when temperature increases from 20 degrees to 23 degrees. It will hold any value between 20 and 23 as the temperature rises. At one point it will be 20.13577 or 20.143605863 etc. It contains an infinite number of digits.

Qualitative data

This is data that is acquired through observation and cannot be quantified using numbers of variables. For example the taste of ice cream and ketchup. If you take 10 volunteers and ask them to review the taste, they will say something like: disgusting, vile, or delicious. You will not expect anyone to say it tasted like 45, because quantities cannot be observed.

There are two subsets of Qualitative Data:

Nominal Qualitative Data

Nominal qualitative data is a way of labeling and describing data without any regard to chronology or rank. For example someone’s name, you cannot claim that person A has a greater name than person B. You can only state that person A is called by a certain name and person B is called by another.

Ordinal Qualitative Data

Ordinal data can be used to label and rank data without any need for quantifiable data. For example, you ask someone to arrange some objects based on how hot and cold they feel, starting with the coldest object. They will end up labeling it like coldest, cold, warm, and warmest.

When preparing a data model for any machine learning algorithm you may have to use all four. But all decisions are eventually made using quantitative data reasons for that will be explained below.

Methods of data collection

After understanding data you now have to find a way to collect it. Understand that there is no wrong or right way to collect data. How you collect data is dependent upon the resources you have. So do not worry if you are not using real-time telemetry to receive data from sensors or any other fancy stuff. A pen and a notebook to collect data can be a very powerful method if used properly.

We will break down the data collection methods into two types here. For ease of understanding, we will only stick to empirical data.

  1. Data acquired through simulation

This is a method that is still in its adolescence but growing rapidly. This data is collected via computer simulations, for example how the temperature of a room changes with time. Such a case can be simulated with great precision using mathematical models. If done properly can imitate the data derived from the real world with an accuracy of 90% or greater. Furthermore, it can provide you with data quickly.

The major pitfall of this is it requires a great deal of computational power, furthermore, the mathematical models have to be correlated with real-life experiments.

  1. Data acquired through physical experimentation and observation

This is a mature method of data collection but is still getting better with time. We collect this data through our interactions within the physical world. For example, we can measure the change in temperature within that room by placing several temperature recording devices, measuring the change after a fixed interval of time. This is undoubtedly still the yardstick when it comes to acquiring data.

The pitfall is in order to produce great data a lot of resources and time is required.

So you may be wondering which method is best? Well, the best practice is to use the strength of both methods. Use the physical method to provide the initial data for your mathematical models, now use the simulation to extrapolate more data. Occasionally checking it by conducting physical experiments and tweaking your model slightly. Just as an easter-egg this is also how machine learning is conducted.

So far we have understood what data is. Now to understand how we can prepare it for machine learning. But first, we need a short understanding of Machine Learning.

Understanding Machine Learning

The phrase machine learning gets thrown about a lot. It is extremely popular but poorly understood. To put it simply: machine learning is a procedure we perform to teach or allow Artificial intelligence to learn. Like a student attending classes, so they can learn to become a better artist. Simply put, machine learning is a student attending classes.

There are two well-known methods of machine learning :

  1. Unsupervised learning

This is where you provide your algorithm with unlabeled data and allow it to discover patterns within that on its own. Here the provided data is what the machine learning algorithm uses as rules. Statistical models are provided to optimize the output of data. This type of learning is the most popular method used to train computer vision algorithms. 

  1. Supervised learning

All data is labeled, the trainers decide the rules that the algorithm must use. It is taught which outcomes are correct and which are not. With more training, the accuracy of the system increases, till it becomes self-sufficient. This type of machine learning is commonly found in predictive analysis, such as predicting the failure of industrial tools.

How we organize the data is completely different between the two learning models. To prepare our data for either we need to understand both methods.

Step by step guide to preparing data for machine learning

Now we are going to dive into the technicalities of organizing our data so that you can deploy it with ease.

  1. Backing up the raw uncompressed data

It is a good practice to always have two sets of uncompressed backups of your data. One backup set should be kept with you on a separate storage device. Do not store it in the same device, a failure can cause both to be lost. Secondly, keep an off-site backup, this could mean uploading all the data to a cloud service provider or keeping an encrypted storage device with someone you trust.

All this sounds mundane and common sense but unfortunately, data loss can happen and we are mostly unprepared for it. 

  1. Noise-Reduction

Noise is an unwanted variance in our data that can occur due to things like electrostatic discharges. Fortunately, several resources can remove noise from your data, all the while preserving its integrity.

  1. Data Compression

Raw data contains information that is not required by our machine learning algorithms, compression can remove all that unwanted information. This reduces the storage requirements of our data set, smaller data volumes result in faster computation.

  1. Split data into two sections
  2. Training data

It is usually 80% of your data set and this is what you will use to train your machine learning algorithm.

  1. Testing data

To gauge the performance of your machine learning algorithm, some of your data approximately 20% must be used to validate if the algorithm is doing what you expect it to do.

Splitting your data this way saves resources. Because you will not have to re-perform the data acquisition. Remember to keep the training and testing data in separate files. If training data is used as testing data then any results produced are redundant.

Data Visualization

When your data has been pre-processed by you, it is a good practice to display it visually. This can give you insight into your data well before it is processed by the machine learning algorithm. There are several ways data can be visualized.

  1. Tables

Think of Excel or Sheets, a table stores data referenced by rows and columns. This is a quick way to layout all your data. Although it is very difficult for the layman to see any patterns

  1. Pie charts

This can only be used for data that has been labeled. Otherwise, no one can know which group of data belongs to what cluster. But this displays what percentage of a circle each of the data clusters consume.

  1. Scatter Plots

They are mostly used for regression analysis. It is a way to find any patterns between the dependent variable (y-axis) and the independent variable (x-axis). This is used for Two-Dimensional analysis. Each cell of data is represented as a point on a scatter plot.

  1. Heat maps

It is a method to display the magnitude of data within two dimensions. Although it can be used for three-dimensional data analysis as well. You can take an example of a person walking around a fixed area, the longer they stay at a spot the darker the color becomes.

Hardware requirements for Running Machine Learning Algorithms

At the end of the day, your data will be run on a computer. So by using equipment optimized for machine learning you can save time and avoid a lot of headaches.

For AI and Deep learning a high spec and powerful hardware is advisable. A 16GB ECC RAM, a good GPU of at least 2GB, and an Intel core i7 from at least 7th generation are recommended.

  1. Try to use an embedded Codec

Currently, the large majority of data processing for machine learning algorithms is run through the x86 instruction set. So try to ensure that your data can be processed by a Codec (Coder-Decoder) that has been optimized to run on x86 architecture. 

The best-case scenario would be that your data can be read by the Codec embedded within modern-day CPUs. Because the more unnecessary data processing you can avoid, the better. Data processing always has a risk of introducing errors into your data. This can potentially poison your data set, requiring time-consuming error checking and correcting mechanisms.

  1. Use ECC RAM

RAM is the system memory that your CPU directly communicates with, therefore it is very low latency, and having all your data on system memory can speed up all your processes.

“ECC” stands for Error Checking and Correcting. The more data you run through system memory the higher the chance for an error. ECC RAM can correct these and save you from annoying bugs and glitches.

Remember that the type and usage of data do affect the performance of your machine learning model. So take the data aspect very seriously. You can never be too cautious.

Resources

https://www.ibm.com/cloud/learn/data-visualization

https://www.ibm.com/cloud/learn/unsupervised-learning

https://en.wikipedia.org/wiki/Unsupervised_learning

https://www.ibm.com/cloud/learn/supervised-learning?mhsrc=ibmsearch_a&mhq=supervised%20learning

https://ojs.aaai.org//index.php/aimagazine/article/view/1230

Fleck, Philipp & Kügel, Manfred & Kommenda, Michael. (2020). Understanding and Preparing Data of Industrial Processes for Machine Learning Applications. 10.1007/978-3-030-45093-9_50. 

Raju, Butchi & Lakineni, Prasanna & Mungara, Kiran. (2020). Classification Of Caesarian Data Using Machine Learning Models. European Journal of Translational and Clinical Medicine. 7. 2020. 

Sadeghinasr, Bita & Akhavan, Armin & Wang, Qi. (2019). Big Data and Machine Learning. 9-16. 10.1061/9780784482438.002. 

Kuijk, Sander & Dankers, Frank & Traverso, Alberto & Wee, Leonard. (2019). Preparing Data for Predictive Modelling. 10.1007/978-3-319-99713-1_6. 

Kozhevnikov, Vadim & Pankratova, Evgeniya. (2020). RESEARCH OF TEXT PRE-PROCESSING METHODS FOR PREPARING DATA IN RUSSIAN FOR MACHINE LEARNING. Theoretical & Applied Science. 84. 10.15863/TAS.2020.04.84.55. 

Virkkunen, Iikka & Koskinen, Tuomas & Jessen-Juhler, Oskari & Rinta-aho, Jari. (2021). Augmented Ultrasonic Data for Machine Learning. Journal of Nondestructive Evaluation. 40. 10.1007/s10921-020-00739-5. 

Mustafa, Nasir. (2020). Preparing Data for Impact Analysis. 10.13140/RG.2.2.29561.29282. 

Fleck, Philipp & Kügel, Manfred & Kommenda, Michael. (2020). Understanding and Preparing Data of Industrial Processes for Machine Learning Applications. 10.1007/978-3-030-45093-9_50. 

Emidio Amadebai

As an IT Engineer, who is passionate about learning and sharing. I have worked and learned quite a bit from Data Engineers, Data Analysts, Business Analysts, and Key Decision Makers almost for the past 5 years. Interested in learning more about Data Science and How to leverage it for better decision-making in my business and hopefully help you do the same in yours.

Recent Posts