In this article we will learn how to standardize data in R using scale() command.



Theory

Once you start your journey in machine learning, you will often hear the word “normalization”.

It refers to data wrangling (or rescaling) as well as standardization.

Normalization itself can include multiple procedures in general: min-max normalization and Z-score standardization.

Not to be confused though, the min-max normalization method is indeed normalization, when we rescale values to be in the range between 0 and 1.

Whereas Z-score standardization is a procedure to “standardize” data to have mean = 0 and standard deviation = 1.

These are two different procedures and not to be confused since they are used for different purposes and lead to different results.

So why do we need to standardize data?

Generally speaking, it all depends on your data!

The min-max normalization procedures tends to center the rescaled data around the mean, but it doesn’t handle outliers very well.

For example, if you think of housing prices: you will have a vast majority of real estate properties being under $1 million, but you also have that one luxurious pent house that is worth $9 million.

This “pent house” outlier won’t be very well treated in the min-max normalization procedure, and when we know we have those, it is when the Z-score standardization comes in handy.

The best way to find out these insights about the data in R is to find some descriptive statistics in R for it, as well as create a scatter plot in R.

Enough of theory, let’s get to the applied part!



Application

Below are the steps we are going to take to make sure we do learn how to standardize data in R:

  1. Loading sample dataset: cars
  2. Basic scale() command description
  3. Standardize data in R
  4. Visualization of standardized data in R



Part 1. Loading sample dataset: cars

The dataset I will use in this article is the data on the speed of cars and the distances they took to stop.

It is the same dataset I used in my min-max normalization article (you can compare the results later) with the same descriptive statistics.

As usual, I will store it as “mydata”:


mydata<-cars

Now you can take a look at the dataset and the variables it contains:


View(mydata)

You will see that it contains 50 observations on speed (mph) and distance (ft).

Looking at the range of "speed" is [0; 25] and the range of distance is [2; 120].

The scatter plots below shows that we have several observations that are either very low or very large for both of these variables:

not standardized data in R

Distribution wise we can expect the majority of observations fall between 15 and 60 for "speed" and between 10 and 20 for "distance".

We can also see groups of outliers on this scatter plot.

Therefore, we can see the need to standardize this data.



Part 2. Basic scale() command description

Unlike min-max normalization (where we had to create the function ourselves), for the purpose of standardization, R has a built-in command scale().

The short theoretical explanation of the function is the following:

scale(x, center = TRUE, scale = TRUE)

Here, "x" refers to the object you are rescaling (which can be any numeric object).

Notice the important parameters "center" and "scale".

The "center" parameter (when set to TRUE) is responsible for subtracting the mean on the numeric object from each observation.

The "scale" parameter (when set to TRUE) is responsible for dividing the resulting difference by the standard deviation of the numeric object.

In our case, we are performing a Z-score standardization in R, therefore both of these parameters should be set to TRUE.



Part 3. Standardize data in R

I understand that the above part sounded very technical in some paragraphs.

It is a lot of information to intake in a single jump, so here is the more intuitive way to understand it.

Let's take a look at the Z-score formula:

$$z_i = {x_i - x̄ \over σ}$$

With reference to the command description, the "center" parameter set to TRUE ensures that we subtract "x bar"; while the "scale" parameter set to TRUE ensures that we divide by "sigma".

After clarifying the theoretical part of this article, it's time to focus on the procedure.

So what are we going to do? And where will we store the new data point?

I would prefer to create new columns in the same data frame with the normalized data for each of the variables.

Let's call the new columns "speed_scaled" and "dist_scaled".

The function will run through each row of the column we set it to work on and convert each data point to a Z-score standardized data point.

The code for each of the columns is the following:


mydata$speed_scaled<-scale(mydata$speed)
mydata$dist_scaled<-scale(mydata$dist)

We have just created two new columns with standardized data for "speed" and "dist" variables.

Take a look at your dataset now:


View(mydata)

Great! We have completed the task!



Part 4. Visualization of standardized data in R

Recall in Part 1 of this article I provided a scatter plot of "Not Standardized" data.

Now, let's take a look at it and compare it to "Standardized Data" that we calculated.

not standardized data in R

You can clearly see that we preserved the relationship between the two variables the same.

Yet, the difference is the scales of axis and the weight assigned to outliers.

Recall, in min-max normalization the resulting range was [0; 1] for both "speed_norm" and "dist_norm", thus centering the data around the mean.

Here, the ranges for "speed_scaled" and "dist_scaled" aren't identical which allows to account for the outliers in the dataset.



This concludes our article on how to standardize data in R. You can learn more about data preparation and algorithms in the Machine Learning section.