In this article we will learn how to do left outer join in R for two tables using merge() command.
Theory
Did you ever come across a problem of merging two tables together but only for matching rows in “left” table?
For example, we are looking at a sales report and all we see is “customer_id” and “sale_amount”.
We want to find out which location it was shipped to, but that data with customer locations is stored in another table.
The issue is: you have 8 customers who made a purchase based on the sales report. Yet your complete list of customers’ locations includes 14 individual customers.
You care about only getting the locations of the 8 on the sales report though.
Sounds quite common right?
It’s time to perform a left outer join in R!
Below I will show an example of the usage of popular R base command merge().
Application
Below are the steps we are going to take to make sure we do master the skill of doing left outer join in R:
- Basic merge() command description
- Loading the sales.csv and locations.csv files into R
- Performing left outer join in R for these two tables
Part 1. Basic merge() command description
The short theoretical explanation of the function is the following:
merge(x, y, by, by.x, by.y, sort=TRUE)
Here “x” and “y” are tables that we will be merging together.
“by”, “by.x”, and “by.y” allows us to specify what we will be merging by.
You can learn more about merge() command here.
Part 2. Loading the sales.csv and locations.csv into R
For the purpose of this tutorial, I created two sample .csv datasets that you can use to practice doing inner join in R.
The files can be downloaded here: sales.csv, locations.csv
Now let’s go ahead and load it into R using the following commands:
sales<-read.csv("/Users/DataSharkie/Desktop/sales.csv")
locations<-read.csv("/Users/DataSharkie/Desktop/locations.csv")
Note: remember that your location of the file will be different from mine, so adjust the code accordingly!
If you would like to learn more on how to import .csv files into R you can find an article here.
Once the datasets are imported, we can view the table using the following command:
View(sales)
View(locations)
After you familiarized yourself with the two tables, it's time to do left outer join in R!
Part 3. Performing left outer join in R for these two tables
Our goal here is to create a new table "left_join", where we will only have the entries with matching rows from the "left" table.
By default, the merge() command in R performs an inner join, so we will need additional specification in terms of identifying the parameters that we will be merging by.
At this point we are all set to do the left outer join in R.
You can do it using the following code:
left_join<-merge(sales, locations, by = "customer_id", all.x = TRUE)
We are merging two tables based on the "customer_id" column and setting "all.x=TRUE" makes sure that only those rows from the "right" table that match with rows in the "left" table will be added to the new table.
You can take a look at your new table by using this command:
View(left_join)
If you are interested to learn more about different types of merging in R, you can find more articles here.