In this article we will learn how to do linear regression in R using lm() command. The article will cover theoretical part about linear regression (including some math) as well as an applied example on how to do a simple linear regression with lines of simple code you can use for your work.


Theory

Linear regression is one of the most basic approaches in predictive analytics.

The main goal of this approach is to determine whether one or more independent variables (Xs) do a good job at predictive the dependent variable (Y). And whether the coefficients on Xs are statistically significantly different from zero.

The simplest form of linear regression (simple linear regression) includes one dependent variable (Y) and one dependent variable (X) and takes the form of:

If you have more than one independent variable (Xs), there will be more components with beta_ and x_ in the formula, and it is something we call a multiple linear regression.

In the formula above:
Y = dependent variable:
X = independent variable;
Beta = coefficient on X;
Epsilon = error term.

Two major insights we are (generally) trying to get from a regression:

  1. “Strength” of impact of the independent variables (Xs) on dependent variable (Y);
  2. Forecast an outcome using the existing model but given a new independent variable value (new X value).

Below I will show how to do simple linear regression in R using a dataset built into R as well as provide basic regression analysis.

Now it’s time to run some regressions in R!



Application

Below are the steps we are going to take to make sure we do learn how to do linear regression in R:

  1. Loading sample dataset: women
  2. Basic lm() command description
  3. Performing linear regression in R
  4. Basic analysis of regression results in R
  5. Interpreting linear regression coefficients in R
  6. Significance of coefficients in R
  7. Goodness of fit in R



Part 1. Loading sample dataset: women

R has a variety datasets already built into it. Although the step of “loading” this dataset isn’t required, it’s a good practice to get familiar with 🙂

If your you have your own dataset that you would like to practice with by following the steps in this article, you can learn about importing different types of files into R here.

I prefer to call the data I work with “mydata”, so here is the command you would use for that:


mydata<-women

Now we can take a look at the dataset and the variables it contains:


View(mydata)

This dataset has two columns "height" (height of a person in inches) and "weight" (weight of a person in pounds).



Part 2. Basic lm() command description

The very brief theoretical explanation of the function is the following:

lm(formula, data)

"formula" part requires us to specify the dependent and independent variables we will be regressing.
"data" part requires us to specify the dataset from which the dependent and independent variables are.

If you would like to learn the full description of the command, you can read about it here.



Part 3. Performing linear regression in R

So far we established our dataset and the command we will use.

Now let's talk about what kind of relationship between variables we will try to find using the linear regression in R.

For the purpose of this article, the question I propose is: "Does height of a person have an impact on their weight?"

If so, what is the magnitude of this effect?

So let's get to the code!

In order to store the regression results, I will call it "reg1".

To run this regression in R, you will use the following code:


reg1<-lm(weight~height, data=mydata)

Voilà! We just ran the simple linear regression in R!

Let's take a look and interpret our findings in the next section.



Part 4. Basic analysis of regression results in R

Now let's get into the analytics part of the linear regression in R.

In my opinion this part is more fun as it involves some coding as well as drawing conclusions and insights from our steps so far.

To begin with, let's get the short summary of our results using the summary() command:


summary(reg1)

Your output should look like this:

Let's dive into regression analysis!



Part 5. Interpreting linear regression coefficients in R

From the screenshot of the output above, what we will focus on first is our coefficients (betas).

"Beta 0" or our intercept has a value of -87.52, which in simple words means that if other variables have a value of zero, Y will be equal to -87.52.

In our specific case, if the height of a person is 0, then their weight should be -87.52 lbs. Sounds quiet absurd right?

But hold on for a moment! How many times have you seen a person 0 inches tall? Right? So this result is not really applicable to a real life example (specifically at 0 inches).

To show that it actually does work, think about our dataset and the sample of observation we were working with.

We see that the minimum height of a person recorded in our table is 58 inches!

Yet, overall, the estimate of the "beta 0" coefficient seems to make no sense, by examining "beta 1" and testing the model with some numbers, we will see that our regression actually produces accurate predictions.

Now, let's take a look at our "beta 1" coefficient. It has a value of 3.45. It is a coefficient on our X variable, which is height. This means that for every 1 inch increase in height, the person's weight should go up by 3.45 lbs.

Okay, this sounds more reasonable!

Let's test the prediction with one of our observations. For example, take an observations from row 8. Here, height = 65 inches and weight = 135 lbs.

Our regression with coefficients now looks like: Y = -87.52 + 3.45*(X)

Plug in the numbers to get: Y = -87.52 + 3.45*(65) = 136.73

Let's recap: the actual weight is 135 lbs and our regression prediction is 136.73 lbs. How do you think we did? To me it sounds like a great result and certainly an accurate estimation!



Part 6. Significance of coefficients in R

I could address this topic in the "Coefficients" section, but since we are doing everything step by step and I try to make it intuitive and easy to understand, I decided to write about it separately.

So what is the significance of coefficients in linear regression?

Let's start from the beginning 🙂
The p-value for the coefficient tests the null hypothesis (H0) that the coefficient is equal to zero.

Generally we use a 95% confidence interval with hypothesis testing which means that if your p-value is less than 0.05 (1-0.95), then we can reject the null hypothesis and state that the coefficient along with the independent variable (X) are associated with the change in the dependent variable (Y).

Let's take a look at our output table. Specifically the very last column which is "Pr(>|t|)". Here are our p-values for the intercept and "beta 1".

Also notice the stars (***) next to them. These starts are created for convenience to indicate the statistical significance.

The row right under the coefficients table "Signif. codes: ..." indicates how many stars represent which level of significance.

In our case both coefficients have a very low p-value (that is close to zero), therefore we can state that our coefficients are significant at 100% confidence level, and we reject the null hypothesis of these variables having no impact on the prediction of the dependent variable.



Part 7. Goodness of Fit in R

So what is the goodness of fit you may ask?

Goodness of fit of a model basically shows how well it fits the observations from the dataset.

The statistical measure of the goodness of fit is called R-squared.

R-squared is always between 0% and 100% and determines how close the observations from the dataset are to the fitted regression line.

Formula: R-squared = Explained Variation/Total Variation

After looking at the formula, you intuitively get it. It shows how much of the total variation in the model is explained on a scale of 0% to 100%.

Generally, the higher the R-squared, the better. But, as with everything, it is not always the case, and there are multiple other factors that need to be considered in a detailed regression analysis. I will have a separate article on it. For now, let's assume we have the general case.

If you go back to our output table, and find the second last row, you will see it says "Multiple R-squared: 0.991".

This is what we were looking for! Our R-squared is 0.991 or 99.1%.

In other words, 99.1% of the variation is explained by our model.

I think so far we are doing a good job 🙂

The last thing I would like to mention in the aspect of the goodness of fit, is the plotting of the regression line along with our observations.

Now we will plot our observations and add the regression line to it.
You can do it using the following code:


plot(mydata$height, mydata$weight)
abline(reg1)

If you are working in R Studio (which I highly recommend), the plot will appear in the bottom right corner and should look like this:

In conclusion, this allows us to visualize the the observations and how close they are to our regression line. As mentioned above, our R-squared is 99.1% and now we can visually associate this high % number with the accuracy of the plot.



This concludes our article on linear regression in R. You can learn more about regressions and statistical analysis in the Statistics in R section.