What is data preprocessing?
Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning model. It is the first and crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case that we come across clean and formatted data. And while doing any operation with data, it is mandatory to clean it and put in a formatted way. So for this, we use the data preprocessing task.
Why do we need data preprocessing?
Real-world data generally contains noises, and missing values, and may be in an unusable format that cannot be directly used for machine learning models. Data preprocessing is a required task for cleaning the data and making it suitable for a machine learning model which also increases the accuracy and efficiency of a machine learning model.
Important steps of data preprocessing:
There are 4 main important steps for the preprocessing of data.
- Splitting of the data set in Training and Validation sets
- Taking care of Missing values
- Taking care of Categorical Features
- Normalization of the data set
Train Test Split:
Train Test Split is one of the important steps in Machine Learning. It is very important because your model needs to be evaluated before it has been deployed. And that evaluation needs to be done on unseen data because when it is deployed, all incoming data is unseen.
The main idea behind the train test split is to convert the original data set into 2 parts
where the train consists of training data and training labels and the test consists of testing data and testing labels.
Taking Care of Missing Values:
There is a famous Machine Learning phrase that you might have heard that is
Garbage in Garbage out
If your data set is full of NaNs and garbage values, then surely your model will perform garbage too. So taking care of such missing values is important. Let’s take a dummy data set to see how we can tackle this problem of taking care of garbage values.
Taking Care of Categorical Features:
We can take care of categorical features by converting them to integers. There are 2 common ways to do so.
- Label Encoding
- One Hot Encoding
In Label Encoder, we can convert the Categorical values into numerical labels.
In OneHotEncoder we make a new column for each unique categorical value, and the value is 1 for that column, if in an actual data frame that value is there, else it is 0.
Normalizing the Dataset:
This brings us to the last part of data preprocessing, which is the normalization of the dataset. It is proven from certain experimentation that Machine Learning and Deep Learning Models perform way better on a normalized data set as compared to a data set that is not normalized.
The goal of normalization is to change values to a common scale without distorting the difference between the range of values.
Importance of Data Preprocessing:
- Most of Machine Learning performance gets slowed down if features (data ) are not scaled. Let’s understand suppose you have two features one is in the scale between ( 0-2 ) and the other ( 0-1000000). Now if you are performing regression on top of it. There will be so many iterations of adjustment in the value of the regression coefficient in order to achieve accurate prediction. This phenomenon will increase the time in the training data set. But if you scale them uniformly it would be performance oriented.
- We should always remove unexpected values from the data set . For example, random forest algorithms do not support null values. So replacing such values with some significant sort of values is also under data preprocessing.
- The data set should be in the condition where we can easily change the underline machine learning algorithm over it . Here preprocessing principle converts them into an incompatible format.
- We have to convert the categorical data into numeric one. As you know, all machine learning underlines work on numeric data .
You may have a look at –