In this article we will learn how to create histogram in R using ggplot2 package.
Theory
When we get a new dataset for our analysis or research, often we would like to learn about the frequency of occurrence distribution of the variable of interest.
This is where the skill of creating histograms in R comes in handy.
A histogram is a plot with rectangles, height of which represents the frequency or “count” of the occurrence and width is equal to the grouping interval.
I understand this may often sound too techy, therefore further in the article it will be easier to grasp the information when following the plots we are going to create.
Below I will show an example of the usage of a popular R visualization package ggplot2.
Application
Below are the steps we are going to take to make sure we do master the skill of creating histogram in R:
- Installing ggplot2 package
- Loading sample dataset: trees
- Creating a histogram in R
Part 1. Installing ggplot2 package
R does have a base command hist() built in, which allows you to create histograms. Yet, I personally prefer to create most (if not all) of my visualizations using ggplot2 package.
You can learn more about ggplot2 package here.
In order to install and “call” the package into your workspace, you should use the following code:
install.packages("ggplot2")
library(ggplot2)
Part 2. Loading sample dataset: trees
R has a variety datasets already built into it. Although the step of “loading” this dataset isn’t required, it’s a good practice to get familiar with 🙂
I prefer to call the data I work with “mydata”, so here is the command you would use for that:
mydata<-trees
This built-in dataset is the Diameter, Height and Volume for Black Cherry Trees.
Note: in this article I create my own datasets. If you have your own in a csv or excel files, you can follow the same procedure to arrive at the result.
Take a look at the dataset and the variables it contains:
View(mydata)
We see a 31x3 data frame which contains three variables:
- Girth - diameter of trees in inches
- Height - height of trees in feet
- Volume - volume of trees in cubic feet
In this article we will be plotting the distribution of "Girth" of the trees.
Part 3. Creating a histogram in R
Our goal is to create a histogram to draw some insights about the distribution of the "Girth" variable (or the frequency of occurrence of similar values).
Let's set up the graph theme first (this step isn't necessary, it's my personal preference for the aesthetics purposes).
theme_set(theme_light())
If you are interested, ggplot2 package has a variety of themes to choose from.
Now we are all set to create a histogram in R.
Use the following code to arrive at our histogram:
ggplot(mydata, aes(x=Girth)) +
geom_histogram()
Well done! Our first draft of the histogram in R is created!
What you also observe in this histogram, is that the "bins" are very narrow, which in turn creates these gaps between some of them.
We should figure something out to make it more "grouped" don't you think?
Therefore, we will adjust the width of the bins and see how it will look afterwards.
Let's give it a try!
To change the width of the bins we will need to add binwidth="" to the geom_histogram() field.
Use the following code to create a histogram with custom chosen bin width (I decided to choose 2):
ggplot(mydata, aes(x=Girth)) +
geom_histogram(binwidth = 2)
This looks better!
By adjusting the bin width, we increased the "grouping" which in other terms means that each bin is now more dense, or has more observations in it.
As, you know from my other data visualization articles, I always try to add more details to the graphs/plots, especially colouring (in order to align it with the colours of the blog).
Next, I would like to add (black) borders to each bin and change the fill colour (to turquoise).
This is an easy step and requires adding color="" and fill="" to the geom_histogram() field:
ggplot(mydata, aes(x=Girth)) +
geom_histogram(binwidth = 2, colour="black", fill="#00AFBB")
This is amazing! Congratulations!
Also, from the histogram above we can see that the highest frequency among the observations in the dataset is where "Girth" is between 11 and 13, which is represented by the tallest bin.
If you are interested to learn more about data visualization in R, you can find more articles here.