In this article we will learn how to calculate summary statistics for subsets of data using aggregate() function in R.
Theory
After we import the dataset into R, we often want to do some further data manipulation and analysis.
We can always start by looking at the descriptive statistics of the dataset and probably it will have some meaningful insights for us right away.
But what if the task is a little more complicated?
What if in order to get some insights we need to look at summary statistics of the subsets of the dataset?
To put it into context, assume you have data on salaries at individual level for a city.
You know four pieces of information: name, department, job tile, and salary.
The simplest analysis would be to just calculate the average salary in that city. And there is nothing wrong with that.
It’s a start. But as data people we always want to get as more insights as we can which often involves looking at certain layers of the dataset.
Don’t you think it will be more interesting to look at the average salaries by department? Or by job title?
It certainly is! So let’s go ahead and take a look at the example I provide below!
Application
Below are the steps we are going to take to make sure we do master the skill of finding summary statistics for the subsets of data using aggregate() function in R:
- Basic aggregate() function description
- Loading sample dataset: 2018-bloomington-civil-city-projected-salaries.csv
- Calculate mean values for subsets of data using aggregate() function in R
Part 1. Basic aggregate() function description
The aggregate() function is already built into R so we don’t need to install any additional packages.
The very brief theoretical explanation of the function is the following:
aggregate(data, by= , FUN= )
Here, “data” refers to the dataset you want to calculate summary statistics of subsets for.
“by= ” component is a variable that you would like to perform the grouping by.
“FUN= ” component is the function you want to apply to calculate the summary statistics for the subsets of data.
In simple words, the function follows this logic:
- Choose the dataset to work with
- Choose the grouping variable
- Choose a function to apply
It should be quite intuitive to understand the procedure that the function follows.
Let’s import the dataset and get to an example of using aggregate() function in R!
Part 2. Loading the sample dataset
In this article I will be working with the dataset that shows the salaries by department and by occupation in the city of Bloomington in 2018.
The list of datasets for different years is available here.
You can either download the needed file from the link above or from my attachment: 2018-bloomington-civil-city-projected-salaries.csv
If you have your own data that you want to work with right away, you can import your dataset and follow the same procedures as in this article.
So first we need to import the .csv file into R and then go ahead and work with the aggregate() function.
Let’s call our dataset “salaries” and import it into out environment:
salaries<-read.csv(file="C:/Users/DataSharkie/Desktop/2018-bloomington-civil-city-projected-salaries.csv", header=TRUE, sep=",")
Note: I use Windows, so if you are using Mac, your file destination may look different.
You can take a look at the imported dataset using the following code:
View(salaries)
Also, to make things look a little nicer, let's only select the columns we will be using.
We will need dplyr package and it's select() command.
install.packages("dplyr")
Note: you can skip this step if you already have dplyr package installed.
I will group the dataset by department (in our dataset it's the "Department" variable).
library(dplyr)
salaries<-select(salaries, Department, Projected.2018.Salary)
The dataset is added and ready. Now let's get into examples of how to aggregate data in R!
Part 3. Calculate mean values for subsets of data using aggregate() function in R
In this article I chose to calculate mean as the summary statistic for the grouped subsets of data.
And I will group the dataset by department (in our dataset it's the "Department" variable.
Using the arrange() function will allow to create a dataset will include unique job titles and the corresponding average salary.
Let's call our resulting dataset "agg_salaries".
Use this code to arrive at the final dataset:
agg_salaries <- aggregate(salaries,
by = list(salaries$Department),
FUN = mean)
Looking at the new dataset using View(agg_salaries) we see that the data is now aggregated by department and shows the corresponding average salary for each of them.
This concludes the article on how to use aggregate() function in R.
If you liked this article, I encourage you to take a look at the Data Manipulation in R section where you will find a lot of useful information and master the skill of data wrangling.