Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning model. It is the first and crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case that we come across clean and formatted data. And while doing any operation with data, it is mandatory to clean it and put in a formatted way. So for this, we use the data preprocessing task.
Real-world data generally contains noises, and missing values, and may be in an unusable format that cannot be directly used for machine learning models. Data preprocessing is a required task for cleaning the data and making it suitable for a machine learning model which also increases the accuracy and efficiency of a machine learning model.
There are 4 main important steps for the preprocessing of data.
Train Test Split is one of the important steps in Machine Learning. It is very important because your model needs to be evaluated before it has been deployed. And that evaluation needs to be done on unseen data because when it is deployed, all incoming data is unseen.
The main idea behind the train test split is to convert the original data set into 2 parts
where the train consists of training data and training labels and the test consists of testing data and testing labels.
There is a famous Machine Learning phrase that you might have heard that is
Garbage in Garbage out
If your data set is full of NaNs and garbage values, then surely your model will perform garbage too. So taking care of such missing values is important. Let’s take a dummy data set to see how we can tackle this problem of taking care of garbage values.
Taking Care of Categorical Features:
We can take care of categorical features by converting them to integers. There are 2 common ways to do so.
In Label Encoder, we can convert the Categorical values into numerical labels.
In OneHotEncoder we make a new column for each unique categorical value, and the value is 1 for that column, if in an actual data frame that value is there, else it is 0.
This brings us to the last part of data preprocessing, which is the normalization of the dataset. It is proven from certain experimentation that Machine Learning and Deep Learning Models perform way better on a normalized data set as compared to a data set that is not normalized.
The goal of normalization is to change values to a common scale without distorting the difference between the range of values.
You may have a look at –