Log In Start studying!

Select your language

Suggested languages for you:
StudySmarter - The all-in-one study app.
4.8 • +11k Ratings
More than 3 Million Downloads
Free
|
|

Linear Regression

Linear Regression

In everyday life there are many quantities that are related to each other, for example, the price of fruits is related to their weight, someone's weight may be related to their height, or the temperature in degrees Fahrenheit has its equivalence in degrees Celsius.

Now, if you have two sets of numbers that might be related, how can you find the equation that relates them? If you suspect that the relationship is linear, you can use linear regression.

In this article, you will understand what linear regression is, what the model for linear regression is, what the equation for linear regression is, and what assumptions need to be considered.

Introduction to Linear Regression

Recall that the equation of a straight line is given by \[y=a+bx,\] where \(b\) is called the slope of the line and \(a\) is called the \(y\)-intercept (the value of \(y\) where the line crosses the \(y\)-axis).

As mentioned above, some quantities are related to others in a linear way. For example, the price of mangos. In the United States, the price per kilogram of mango is approximately \(\$1.80\), so the price for \(2\) kilograms would be \(\$3.60\). Thus, the relationship between the price and the weight of the mango is given by the equation \[y=1.80x,\]

where \(x\) is the number of kilograms (the independent variable) and \(y\) is the price (the dependent variable).

Now, suppose you are interested in people's addiction to cell phones. So in your afternoon arts class, you asked \(5\) people how many text messages they sent over the duration of the class, and you got the following information.

Age
Number of text messages sent
\(17\)
\(35\)
\(18\)
\(27\)
\(20\)
\(29\)
\(22\)
\(23\)
\(27\)
\(18\)

Table 1. Ages and number of text messages data.

Is there a relationship between a person's age and the number of text messages they send? Is the relationship linear? How can I find the equation that relates them?

Definition of Linear Regression

Before defining what linear regression is, let's look at the following scatter plot showing the distribution of the data obtained in the text message example.

Linear Regression Scatter plot of the relationship between age and text messages sent StudySmarterFigure 1. Scatter plot of the relationship between age and text messages sent

At a glance, you can see that you can draw several lines that can approximate the behavior of the points.

Linear regression is a statistical technique that consists of finding the best straight line that describes the relationship between a dependent variable and one or more independent variables.

The most commonly used model is the so-called least squares regression line.

With linear regression, you can make a prediction of the data you do not know, from the behavior of the data you obtained in the sampling.

Correlation coefficient

One way to know if two sets of data are linearly related is to look at the scatter plot. The other is by calculating the Pearson correlation coefficient or correlation coefficient.

Let \(x\) and \(y\) be the independent and dependent variable, respectively. If \(\mu_{x}\) and \(s_x\) are the mean and the standard deviation of the sample of \(x\) and \(\mu_{y}\) and \(s_y\) are the mean and the standard deviation of the sample of \(y\).

Then, if the sample has size \(n\), the correlation coefficient is calculated by:

\[r=\frac{\sum z_xz_y}{n-1

where

\[z_x=\frac{x-\mu_{x}}{s_x}\,\text{ and }\, z_y=\frac{y-\mu_{y}}{s_y}\]

To calculate the correlation coefficient for the text messages example, let \(x\) be the variable denoting the age (the independent variable), and \(y\) the number of text messages (the dependent variable).

Then \(\mu_{x}=20.8\) and \(s_x=3.96\) are the sample mean and standard deviation for the age, and \(\mu_{y}=26.4\) and \(s_y=6.39\) are the sample mean and standard deviation for the number of text messages.

Thus, the correlation coefficient is:

\[\begin{align} r&=\frac{(-0.96)(1.35)+(-0.7)(0.09)+(-0.20)(0.41)+(0.3)(-0.53)+(1.56)(-1.31)}{4} \\&=-0.91\end{align}\]

Review our articles on sample mean, standard deviation and \(z\)-score to remember these topics!

Properties of the correlation coefficient

1. The correlation coefficient takes values from \(-1\) to \(1\).

2. If \(|r|=1\) then the relationship between the variable \(x\) and \(y\) is completely linear.

3. If \(r=0\), then there is no linear relationship between \(x\) and \(y\).

4. If \(r>0\), then when \(x\) increases, \(y\) tends to increase and when \(x\) decreases, \(y\) tends to decrease (also called positive correlation).

5. If \(r<0\), then when \(x\) increases, \(y\) tends to decrease and when \(x\) decreases, \(y\) tends to increase (also called negative correlation).

❗❗ Correlation does not imply causation.

Assumptions for Linear Regression

To apply linear regression, you first have to check the following conditions:

1. Quantitative variable condition: Correlation only applies if both variables are quantitative.

2. Straight enough condition: Look at the scatter plot and make sure your data has an approximately linear relationship. Correlation only measures the strength of a linear association.

3. Outlier condition: Outliers can ruin the correlation. When outliers are present, it is best to calculate one correlation including the outliers and another excluding the outliers.

Model of Linear Regression

The regression line is not perfect. It does not pass through all the points, some points will be above and some will be below, but it is the best in the sense that the sum of squares of the residuals (see the Residuals article for more information) is the smallest possible.

The calculations for finding the regression line are often tedious and time-consuming, that's why there are statistical software and graphing calculators that you can use to help you do the calculations.

Equation of Linear Regression

The line of best fit is called the least square regression line.

The equation of the least squares regression line is given by:

\[\hat{y}=a+bx,\] where \[b=\frac{\sum(x-\mu_{x})(y-\mu_{y})}{\sum(x-\mu_{x})^2}\,\text{ and }\,a=\mu{y}-b\mu_{x}\]

In the previous line equation, the value \(\hat{y}\) is the predicted value of \(y\) that results from substituting a particular value \(x\) into the equation. Since \(\hat{y}\) is only a prediction, the difference between the value \(\hat{y}\) and the actual value of \(y\) is called residual and is given by:

\[\varepsilon=y-\hat{y}\]

For the text messages example, the equation for the least square regression line is given by:

\[\hat{y}=-1.47x+57.1\]

Linear Regression Scatter plot of the relationship between age and text messages sent with the line of best fit to the data StudySmarterFigure 2. Scatter plot with the line of best fit to the data

Using this equation, you can predict how many text messages a \(25\) years old would send. Then:

\[\hat{y}=-1.47(25)+57.1=20.35\]

that is, a \(25\) years old would send \(20.35\) text messages.

The linear regression line should only be used to predict values that are within the domain of the \(x\) values in the sample. Otherwise, in the example of text messages, you could erroneously conclude that a \(1\)-year-old sends \(55\) text messages!

Example of Linear Regression

Let's look at an example where outliers can change the regression line.

The following scatter plots show the grades obtained by \(20\) students in a calculus exam and the hours of study dedicated.

In the first image, the linear regression was done with all the data, while in the second image the outliers were omitted, that is, the student who studied \(15\) minutes and scored \(95\) and the student who studied \(6\) hours and scored \(60\) were omitted.

Linear Regression Scatter plot of grades earned versus hours of study with the regression line calculated using all the data StudySmarterFigure 3. A scatter plot with the straight line \(\hat{y}=3.49x+66.1\)

Linear Regression The scatter plot of grades earned versus hours of study with the regression line calculated without considering outliers StudySmarterFigure 4. A scatter plot with the straight line \(\hat{y}=7.82+51.9\)

Which line best fits the data?

Solution:

Note that in the first image because of the outliers, many data were far away from the regression line. While in the second image, by not considering the outliers, the data are closer to the regression line.

Therefore, the second line is a better fit to the data.

Multiple Linear Regression

Multiple linear regression is used to estimate the relationship between one dependent variable and two or more independent variables.

For example, it is known that the cost of a house depends on its size, but it can also depend on the square meters of construction, the age of the property. In this case, the dependent variable is the cost of the house, while the independent variables are the size, the square meters of construction and the age of the property.

For these cases, you can also apply regression, but because the procedure is similar, this topic will not be covered in this article.

Linear Regression - Key takeaways

  • Linear regression allows you to predict data you don't know from the behavior of data you do know.
  • The line of best fit is the least squares regression line.
  • The least squares regression line is given by \[\hat{y}=a+bx,\] where \[b=\frac{\sum(x-\mu_{x})(y-\mu_{y})}{\sum(x-\mu_{x})^2}\,\text{ and }\,a=\mu_{y}-b\mu_{x
  • The correlation coefficient measures how linear the relationship between two variables is.
  • The correlation coefficient is given by \[r=\frac{\sum z_xz_y}{n-1},\] where\[z_x=\frac{x-\mu_{x}}{s_x}\,\text{ and }\, z_y=\frac{y-\mu_{y}}{s_y}\]

Frequently Asked Questions about Linear Regression

Linear regression is a statistical technique that consists of finding the best straight line that describes the relationship between a dependent variable and one or more independent variables.

The regression line is the line that best describes the linear behavior between two variables and allows you to make predictions from it.

No, linear regression is a regression algorithm.

The residual (e) is the difference between the predicted value (y^) by the regression line and the actual value (y), that is, (e)=(y)-(y^).

When the scatter plot of your data has a linear behavior, you can use linear regression.

Final Linear Regression Quiz

Question

The method of analysing in order to determine or establish the strength of the relationship between two variables i.e the independent and dependent variables is called?



Show answer

Answer

Bivariate Regression Analysis

Show question

Question

The best diagram to use in regression analysis to visibly examine data to determine whether two variables are related linearly is a..............



Show answer

Answer

Scatter plot

Show question

Question

Speed and Distance covered can be said to have a..............



Show answer

Answer

Linear regression

Show question

Question

The line that best fits the trend as we observe a linear relationship between two variables in regression analysis is called?



Show answer

Answer

Regression Line or Line of best fit

Show question

Question

A situation where a scatter plot shows a pattern where one variable tends to increase as the other decreases is described as ____________



Show answer

Answer

Negative association

Show question

Question

___________ explains a situation where a scatter plot shows no pattern between the variables being compared or measured


Show answer

Answer

 Zero association

Show question

Question

The words______,______ and ________ are used to describe the strength of a scatter plot.



Show answer

Answer

 Strong, Moderate and Weak

Show question

Question

Finding the relationship between a dependent and multiple independent variables using a linear model is called?


Show answer

Answer

Multivariate Linear Regression

Show question

Question

Two types of Linear Regression are:


Show answer

Answer

Simple linear regression and Multiple linear regression

Show question

Question

 A data point that lies in an abnormal distance from the regression line or line of best fit in a scatter plot called?


Show answer

Answer

Outlier

Show question

Question

 The commonly used regression types are:



Show answer

Answer

Linear Regression and Logistic Regression

Show question

Question

__________ uses only one x (independent variable) to make a prediction on the dependent variable.



Show answer

Answer

Simple linear regression

Show question

Question

The regression line formula is given as:



Show answer

Answer

Y= a + bX.

Show question

Question

 An outlier which affects the slope of the regression line to a very large extent is known as__________?


Show answer

Answer

Influential point.

Show question

Question

The correlation coefficient is determined using _____________.



Show answer

Answer

Regression Analysis.

Show question

Question

If the correlation coefficient is \(r>0\), what happens with \(y\) when \(x\) increases?

Show answer

Answer

\(y\) increases.

Show question

Question

What are the conditions you have to check before using regression?

Show answer

Answer

The quantitative variable condition, the straight enough condition and the outlier condition.

Show question

Question

What does the quantitative variable condition say?


Show answer

Answer

Both variables have to be quantitative.

Show question

Question

What does the straight enough condition say?



Show answer

Answer

The relationship between your data have to be approximately linear.

Show question

Question

What does the outlier condition say?

Show answer

Answer

Be careful with outliers. It is a good idea to present a correlation including outliers and one excluding outliers.

Show question

Question

How do you calculate the residuals?

Show answer

Answer

\(\varepsilon=y-\hat{y}\).

Show question

Question

How do you calculate the correlation coefficient?

Show answer

Answer

\(r=\frac{\sum z_xz_y}{n-1}\).

Show question

Question

If the correlation coefficient is \(r<0\), what happens with \(y\) when \(x\) increases?

Show answer

Answer

\(y\) increases.

Show question

Question

If the correlation coefficient is \(r=0\), what happens with \(y\) when \(x\) increases?

Show answer

Answer

\(y\) increases.

Show question

Question

What is the best regression line?

Show answer

Answer

The least squares regression line.

Show question

Question

What is the formula for \(a\) and \(b\) in the least squares regression line \[\hat{y}=a+bx?\]

Show answer

Answer

\(b=\frac{\sum(x-\mu_{x})(y-\mu_{y})}{\sum(x-\mu_{x})^2}\) and \(a=\mu_{y}-b\mu_{x}\).

Show question

Question

When is multiple linear regression used?

Show answer

Answer

When there are two or more independent variables.

Show question

60%

of the users don't pass the Linear Regression quiz! Will you pass the quiz?

Start Quiz

Discover the right content for your subjects

No need to cheat if you have everything you need to succeed! Packed into one app!

Study Plan

Be perfectly prepared on time with an individual plan.

Quizzes

Test your knowledge with gamified quizzes.

Flashcards

Create and find flashcards in record time.

Notes

Create beautiful notes faster than ever before.

Study Sets

Have all your study materials in one place.

Documents

Upload unlimited documents and save them online.

Study Analytics

Identify your study strength and weaknesses.

Weekly Goals

Set individual study goals and earn points reaching them.

Smart Reminders

Stop procrastinating with our study reminders.

Rewards

Earn points, unlock badges and level up while studying.

Magic Marker

Create flashcards in notes completely automatically.

Smart Formatting

Create the most beautiful study materials using our templates.

Sign up to highlight and take notes. It’s 100% free.

Get FREE ACCESS to all of our study material, tailor-made!

Over 10 million students from across the world are already learning smarter.

Get Started for Free
Illustration