In this article we will learn how to create scatter plot in R using ggplot2 package.


Theory

We often get a dataset with a bunch of observations, multiple columns as variables, and much more.

We look at it and get lost with what is described by the dataset and especially how does one variable relate to another variable.

Here, the scatter plots come in handy. The scatter plot is very useful to show the relationship between two variables by plotting a point for each row against a column variable of your choice.

Below I will show an example of the usage of a popular R visualization package ggplot2.



Application

Below are the steps we are going to take to make sure we do master the skill of creating scatter plot in R:

  1. Installing ggplot2 package
  2. Loading sample dataset: trees
  3. Creating a scatter plot in R



Part 1. Installing ggplot2 package

R does have a base command plot() built in, which allows you to create histograms. Yet, I personally prefer to create most (if not all) of my visualizations using ggplot2 package.

You can learn more about ggplot2 package here.

In order to install and “call” the package into your workspace, you should use the following code:


install.packages("ggplot2")
library(ggplot2)



Part 2. Loading sample dataset: trees

R has a variety datasets already built into it. Although the step of “loading” this dataset isn’t required, it’s a good practice to get familiar with 🙂

I prefer to call the data I work with “mydata”, so here is the command you would use for that:


mydata<-trees

This built-in dataset is the Diameter, Height and Volume for Black Cherry Trees.

Note: in this article I create my own datasets. If you have your own in a csv or excel files, you can follow the same procedure to arrive at the result.

Take a look at the dataset and the variables it contains:


View(mydata)

We see a 31x3 data frame which contains three variables:

  1. Girth - diameter of trees in inches
  2. Height - height of trees in feet
  3. Volume - volume of trees in cubic feet

The variables we will be plotting in this tutorial are "Girth" against "Height".



Part 3. Creating a scatter plot in R

Our goal is to plot these two variables to draw some insights on the relationship between them.

Let's set up the graph theme first (this step isn't necessary, it's my personal preference for the aesthetics purposes).


theme_set(theme_light())

If you are interested, ggplot2 package has a variety of themes to choose from.

Now we are all set to create a time series plot in R.

Use the following code to arrive at our scatter plot:


ggplot(mydata, aes(x=Girth, y=Height)) +
geom_point()

Now let's add a little colouring (for the purposes of aesthetics I prefer to colour the scatter plot in the colours of my blog:


ggplot(mydata, aes(x=Girth, y=Height)) +
geom_point(colour="#00AFBB")

Overall, we can somewhat identify an upward trend. In simple words, we observe that higher trees will have larger girth (which is quite natural if we think about it).



If you are interested to learn more about data visualization in R, you can find more articles here.