In this article we will learn how to normalize data in R. It will involve rescaling it between 0 and 1 and discuss the use and implication of the results and why we do it. We will use a sample dataset on height/weight as well as create out own function for normalizing data in R.



Theory

From the mathematical point of view when we refer to “normalization” it means transforming your values to the range between 0 and 1.

Why do we do it?

Certain machine learning algorithms (such as SVM and KNN) are more sensitive to the scale of data than others since the distance between the data points is very important.

In order to avoid this problem we bring the dataset to a common scale (between 0 and 1) while keeping the distributions of variables the same. This is often referred to as min-max scaling.

Suppose we are working with the dataset which has 2 variables: height and weight, where height is measured in inches and weight is measured in pounds.

Even prior to running any numbers, you realize that the range for weight will have larger values than the range for height.

In our daily life we would think that the range for height can be somewhere between 65 and 75 inches (my assumption), while the range for weight can be somewhere between 120 and 220 pounds (also my assumption).

In this article I will show how to bring the data to a common scale and plot the histograms to show that we saved the distribution of the data as it was.



Application

Below are the steps we are going to take to make sure we do learn how to normalize data in R:

  1. Loading sample dataset: cars
  2. Creating a function to normalize data in R
  3. Normalize data in R
  4. Visualization of normalized data in R



Part 1. Loading sample dataset: cars

The dataset I will use in this article is the data on the speed of cars and the distances they took to stop.

It contains 50 observations on speed (mph) and distance (ft).

This dataset is built into R so we don’t need to import it from any external source rather than just “call” it into our environment.

As usual, I will store it as “mydata”:


mydata<-cars

Now you can take a look at the dataset and the variables it contains:


View(mydata)

Running the basic descriptive statistics in R for the 2 variables in this dataset will provide you with some interesting information:

  • Average speed is 15.4 mph
  • Range of speed is 4 to 25 mph
  • Average distance is 42.98 ft
  • Range of distance is 2 to 120 ft

The ranges for these two variables are significantly different from each other, and therefore may affect the performance of "distance" sensitive algorithms.



Part 2. Creating a function to normalize data in R

Now, let's dive into some of the technical stuff!

As I mentioned earlier, what we are going to do is rescale the data points for the 2 variables (speed and distance) to be between 0 and 1 (0 ≤ x ≤ 1).

What we need to do now is to create a function in R that will normalize the data according to the following formula:

rescaling between 0 and 1

Running this formula through the data in the column does the following: it takes every observation one by one, the subtracts the smallest value from the data. Then this difference is divided by the difference between the largest data point and the smallest data point, which in turn scales it to a range [0;1].

Logically, the rescaled value of the smallest data point will be 0 and the rescaled value of the largest data point will be 1.

Now, let's code this process out!

I believe there are packages that have this command built in by now, but when I was working on my first algorithm, I made my own function for it due to time constraints.

Let's call our function normalize().


normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}

The above command will do exactly what I described with the formula.



Part 3. Normalize data in R

Now that we have the function ready, it's time to discuss the procedure.

So what are we going to do? And where will we store the new data point?

I would prefer to create new columns in the same data frame with the normalized data for each of the variables.

Let's call the new columns "speed_norm" and "dist_norm".

The function will run through each row of the column we set it to work on and convert each data point to a normalized data point.

The code for each of the columns is the following:


mydata$speed_norm<-normalize(mydata$speed)
mydata$dist_norm<-normalize(mydata$dist)

We have just created two new columns with normalized data for "speed" and "dist" variables.

Take a look at your dataset now:


View(mydata)

Great! We have completed the task!

Covering the technical part is important, but I always want to intuitively explain what we found and why we did it.

The best way to show the changes is to visualize the dataset.



Part 4. Visualization of normalized data in R

When we normalized the dataset in R we rescaled it to the range [0;1].

Natural question that arises is "How did it affect our distribution of data points?"

Let's take a look!

Below I am going to show the histograms for both not normalized and normalized data.

Below is the visualization for "speed" and "speed_norm" variables:

histograms for normalized and not normalized data on speed of cars

We observe identical histograms even though the X axis is rescaled.

Therefore we show that normalization didn't affect the distribution properties of the rescaled data.

The same hold for the "dist" and "dist_norm".

To prove the point we can also plot the variables against each other in a scatter plot for both not normalized and normalized values.

not normalized data in R

normalized data in R

Both plots look identical even though the data is rescaled and we may think it will affect the relationship between the variables!



This concludes our article on how to normalize data in R. You can learn more about data preparation and algorithms in the Machine Learning section.