Chapter 10 Correlation and Regression
Learning Outcome
Construct linear regression models and correlation coefficients from datasets.
This chapter continues to build on the analysis of paired sample data. Here, we present methods for determining whether there is a correlation, or association, between two variables. For linear correlations, we identify an equation of a straight line that best fits the data which can be used to predict the value of one variable given the value of the other variable.
10.1 Correlation
A correlation exists between two variables when the values of one variable are somehow associated with the values of the other variable.
A linear correlation exists between two variables when there is a correlation and the plotted points of paired data result in a pattern that can be approximated by a straight line.
Bivariate Data - Scatterplot
- Scatterplots exhibit the relationship between two numeric variables.
- They are used for detecting patterns, trends, and relationships.
Bivariate Data - Positive Association
- Unit of observation used in the plot is ‘county’.
The plot shows positive association between education and income.
Linear Association between Variables
Linear Association: the data points in the scatterplot is clustered around a straight line
Positive (Linear) Association: above average values of one variable tend to go with above average values of the other; scatterplot slopes up.
Negative (Linear) Association: above average values of one variable tend to go with below average values of the other; scatterplot slopes down.
No (Linear) Association: Scatterplot shows no direction
10.2 Correlation Coefficient
Correlation coefficient: {r|−1≤r≤+1}
- measures the strength of the linear correlation (association) between the paired quantitative x values and y values in a sample; in other words, how tightly the points are clustered about a straight line.
- in order to have a linear correlation between two random variables, the pairs of (x,y) data must have a bivariate normal distribution.
10.2.1 Calculating Correlation Coefficient
xyzxzyzx⋅zy12−1.414−0.7771.09823−0.707−0.2910.206310−1.2630460.7071.1660.824561.4141.1661.648ˉx=3,sx=1.58ˉy=3.6,sy=2.30r=1n∑zx⋅zy=0.755
Formula for Calculating Correlation Coefficient
If the data are (xi,yi),1≤i≤n,then
population: ρ=1nn∑i=1(xi−μxσx)(yi−μyσy)sample: r=1n−1n∑i=1(xi−ˉxsx)(yi−ˉysy)
Properties of Correlation Coefficient
- r is a pure number with no units
- −1≤r≤+1
- Adding a constant to one of the variables does not affect r
- Multiplying one of the variables by a positive constant does not affect r
- Multiplying one of the variables by a negative constant switches the sign of r but does not affect the absolute value of r
- r measures the strength of a linear relationship. It is not designed to measure the strength of a non-linear relationship.
- r is very sensitive to outliers.
Correlation coefficient measures linear association
If two variables have a non-zero correlation, then they are related to each other in some way, but that does not mean that one causes the other.
Two variable appear to strongly associated, but r is close to 0. This is because the relationship is clearly nonlinear. r measures linear association. Don’t use it if the scatter diagram is nonlinear.
10.2.2 Hypothesis Testing for Correlation Coefficient
Hypotheses If conducting a formal hypothesis test to determine whether there is a significant linear correlation between two variables, use the following null and alternative hypotheses that use ρ to represent the linear correlation coefficient of the population:
Null Hypothesis H0:ρ=0 (no correlation)Alt. Hypothesis Ha:ρ≠0 (correlation)Test statistic: U=r√1−r2n−2
where r is the correlation coefficient. When H0 is true, U has a Student’s t-distribution with n−2 degrees of freedom.
Then, calculate p-value and compare with significance level 5% to accept or reject null hypothesis.
Example:
H0:ρ=0Ha:ρ≠0From data, r=0.828n=6−2=4t=r√1−r2n−2=0.828√1−0.82826−2=2.955p−value=0.04176<0.05
Hence, we reject null hypothesis. The data suggests that there is a linear association between dinner bill and tips.
10.3 Estimation from One Variable
When we have information available from only one variable, the mean of the variable is the best estimate of an unknown data point in that variable.
Estimate the height of one of these people: Heights (inches) ~ N(67,3)
Let’s say, estimate = c
estimation error = actual height - c
The “best” c is the one that makes the smallest root mean squared (r.m.s) error
The root mean squared (r.m.s) of the errors will be smallest if c=μ
least squared estimate = μ = 67 and r.m.s error = σ = 3
10.4 Regression Model: Estimation from Two Variables
We estimate the value of one variable (y) from another variable (x). Assume, both variables are approximately normally distributed.
The method for finding the equation of the straight line that best fits the points in a scatterplot of paired sample data. The best-fitting straight line is called the regression line, and its equation is called the regression equation. We can use the regression equation to make predictions for the value of one of the variables (y), given the specific value of the other variable (x).
10.4.1 A Linear Probabilistic Model
Regression Equation:
We allow for variation in individual responses (y), while associating the mean linearly with the predictor x. The model we fit is as follows:
E(y|x)=β0+β1xindividual response: y=β0+β1x⏟systematic+ε⏟randomwhere,x:explanatory, predictor, or independent variabley:response or dependent variable
β0 and β1 are unknown parameters, and ε is the random error component corresponding to the response whose distribution we assume is N(0,σ). We assume the error terms are independent from one another. Under this model, we say,
y|x∼N(β0+β1x,σ)
One limitation of linear regression is that we must restrict our interpretation of the model to the range of values of the predictor variables that we observe in our data. We cannot assume this linear relation continues outside the range of our sample data.
Requirements
- E(y|x)=β0+β1x is a linear function.
- For each value of x, the corresponding values of y have a normal distribution.
- For the different values of x, the distributions of the corresponding y-values all the the same variance.
Note: Outliers can have strong effect on the regression equation.
10.4.2 Least Squares Estimation of β0 and β1
We now have the problem of using sample data to compute estimates of the parameters β0 and β1. First, we take a sample of n subjects, observing values y of the response variable and x of the predictor variable. We would like to choose as estimates for β0 and β1 , the values b0 and b1 that ‘best fit’ the sample data. The fitted equation can be given by:
ˆy=b0+b1x
We can choose the estimates b0 and b1 to be the values that minimize the distances of the data points to the fitted line. Now, for each observed response yi, with a corresponding predictor variable xi, we obtain a fitted value ˆyi=b0+b1xi. So, we would like to minimize the sum of the squared distances of each observed response to its fitted value. That is, we want to minimize the error sum of squares, SSE, where
SSE=n∑i=1(yi−ˆyi)2=n∑i=1(yi−(b0+b1xi))2
Derivation of Bivariate Regression Model
From bivariate scatter plot in standard units:
zy=ρ⋅zxy−μyσy=ρ⋅x−μxσxy−μy=ρ⋅σyσx(x−μx)y=ρ⋅σyσx(x−μx)+μyy=(μy−ρ⋅σyσxμx)+(ρ.σyσx)⋅xy=β0+β1⋅xWhere, {slope (β1)=ρ⋅σyσxintercept (β0)=μy−β1.μxWhen, x=μx,y=μy
\bbox[yellow,5px] { \color {black} {\implies \text {The regression line passes through the points of averages } (\mu_x, \mu_y).} }
Example: Finding the Equation
Find the equation of the regression line for estimating weight based on height.
Height (inches) (x): mean = 67 sd = 3
Weight (lb) (y): mean = 174 sd = 21
r = 0.304
\begin{align} \text{slope } (b_1) & = r. \dfrac {s_y}{s_x} \\ \text{intercept } (b_0) & = \bar y - b_1.\bar x \\ \end{align}
slope (b1) = 2.07 lb per inch
intercept (b0) = 35 lb
Regression Equation: Est. weight = 35 + 2.07.(height)
A person who is 60 inches tall is estimated to be 159 lb
10.4.3 Interpretation of Regression Equation
Intercept
Mathematically, the intercept is described as the mean response (Y) value when all predictor variables (X) are set to zero. Sometimes a zero setting for the predictor variable(s) is nonsensical, which makes the intercept noninterpretable.
For example, in the following equation: \hat {Weight} = 35 + 2 \times Height
Height = 0 is nonsensical; therefore, the model intercept has no interpretation.
Why is it still crucial to include the intercept in the model?
The constant in regression model guarantees that the residuals have a mean of zero, which is a key assumption in regression analysis. If we don’t include the constant, the regression line is forced to go through the origin. This means that all of the predictors and the response variable must equal to zero at that point. If the fitted line doesn’t naturally go through the origin, the regression coefficients and predictions will be biased. The constant guarantees that the residuals don’t have an overall positive or negative bias.
Slope
The slope of the regression line measures how much the value of Y changes for every unit of change in X.
For example, in the following equation: \hat {Weight} = 35 + 2 \times Height
The slope is 2 \text { lb per inch} - meaning that if a group of people is one inch taller than another group, the former group will be on average 2 lb heavier than the later.
Remember, the slope should NOT be interpreted as: if one person gets taller by 1 inch, he/she will put on 2 lb of weight.
10.4.4 Residuals
Which line to use?
Objectively, we want a line that produces the least estimation error.
\text {Data } (y) = \text {Fit } (\hat y) + \text {Residual } (e)
Residuals are the leftover variation in the data after accounting for the model fit. These are the difference between observed and expected.
The residual of the i^{th} observation (x_i, y_i) is the difference between the observed response (y_i) and its predicted value based on model fit (\hat y_i): e_i = y_i - \hat y_i
10.4.5 Least Squared Line
In the scatter plot, residual (in other words, estimation error) is shown as the vertical distance between the observed point and the line. If an observation is above the line, then its residual is positive. Observations below the line have negative residuals. One goal in picking the right linear model is for these residuals to be as small as possible.
Common practice is to choose the line that minimizes the sum of squared residuals.
\text{sum of squared residuals} = e_1^2 + e_2^2 + ... + e_n^2
There is only one line that minimizes the sum of squared residuals. It is called the least squared line. Mathematically, it can be shown that the regression line is the least squared line.
r.m.s error of regression
root mean squared (r.m.s) error of regression = r.m.s of residuals =
\bbox[yellow,5px] { \color{black}{\sqrt {1-\rho^2}.\sigma_y} }
- \rho = 1 \text { or } -1: \space Scatter is a perfect straight line; r.m.s error of regression = 0
- \rho = 0: \space No linear association; r.m.s error of regression = \sigma_y
- All other \rho: \space Regression is not perfect, but better than using the average; r.m.s error of regression \lt \sigma_y
Conditions for the Least Squared Line
Residual Plot
A residual plot is a scatterplot of the points x, y - \hat y. The plot helps to examine whether the assumptions of the linear regression model are satisfied. One of the most important assumptions of linear regression is that the residuals are independent, normally distributed, and uncorrelated with the explanatory variable (x). If these assumptions are satisfied, then the residual plot should show random and near symmetrical distribution of the residuals about the line y = 0. A good residual plot should show no pattern.
Another assumption, known as the constant error variance, is tested by examining the plot of residual versus fit (\hat y). If this assumption is satisfied, the plot should show constant variability of the residuals about the line y = 0. The spread of the residuals is roughly equal at each level of the fitted values.
These assumptions are violated in data with correlated observations such as time series data (e.g. daily stock price). Linear regression is not appropriate to analyze data with such underlying structures.
`geom_smooth()` using formula = 'y ~ x'
10.4.6 Coefficient of Determination
\text{total variation = explained variation + unexplained variation} \\ \sum(y - \bar y)^2 = \sum(\hat y - \bar y)^2 + \sum(y - \hat y)^2
R^2 is the proportion of the variance in the dependent variable y that is explained by the linear relationship between x and y.
R^2 = \frac{\text{explained variation}}{\text{total variation}}
R^2 = 58\% suggests that 58\% of the variability in y can be explained by the variability in x.
- R^2 \space provides a measure of how useful the regression line is as a prediction tool.
- If R^2 \space is close to 1, then the regression line is useful.
- If R^2 \space is close to 0, then the regression line is not useful.
10.4.7 Constructing Confidence Intervals for the Slope
The slope b_1 of the least-squares regression line is a point estimate of the population slope \beta_1. When the assumptions of the linear model are satisfied, we can construct a confidence interval for \beta_1. We need a point estimate, a standard error, and the critical value.
To compute the standard error of b_1, we first compute residual standard deviation, denoted as s_e, which measures the spread of the points on the scatterplot around the least-squares regression line. The formula for the residual standard deviation is given by,
s_e = \sqrt{\dfrac{\sum (y - \hat y)^2}{n-2}}
Now, the standard error of b_1 can be found using the following formual,
s_b = \dfrac{s_e}{\sqrt{\sum(x - \bar x)^2}}
Example:
regression model: \hat y = 356.739 + 5.639 x
\begin{align} s_e &= \sqrt{\dfrac{\sum (y - \hat y)^2}{n-2}} = \sqrt{\dfrac{3398.6787}{18-2}} = 14.574547 \\ s_b &= \dfrac{s_e}{\sqrt{\sum(x - \bar x)^2}} = \dfrac{14.574547}{1415.7452} = 0.387349 \end{align}
Under the assumption of the linear model, \dfrac{b_1 - \beta_1}{s_b} has a student’s t distribution with n-2 degrees of freedom. Therefore, the critical value of a 100(1-\alpha)\% confidence interval is the value t_{\alpha/2} for which the area under the t curve with n-2 degrees of freedom between -t_{\alpha/2} and t_{\alpha/2} is 1-\alpha.
The margin of error for a level 100(1-\alpha)\% confidence interval is:
ME = t_{\alpha/2} \cdot s_b
So, a level 100(1-\alpha)\% confidence interval for \beta_1 is b_1 \pm t_{\alpha/2} \cdot s_b \\ = 5.639 \pm (2.12)(0.387349) \\ = [4.818, 6.460]
10.4.8 Test Hypotheses about the Slope
We can use the values of b_1 and s_b to test hypotheses about the population slope \beta_1.
If \beta_1 = 0, then there is no linear relationship between the explanatory variable x and the outcome variable y. For this reason, the null hypothesis most often tested as:
H_0 : \beta_1 = 0 \\ H_a : \beta_1 \ne 0
If this null hypothesis is rejected, we conclude that there is a linear relationship between x and y, and that the explanatory variable x is useful in predicting the outcome variable y.
Because the quantity \dfrac{b_1 - \beta_1}{s_b} has a Student’s t distribution with n-2 degrees of freedom, we can construct the test statistic for testing H_0: \beta_1 = 0 by setting \beta_1 = 0.
The test statistic, t = \dfrac{b_1}{s_b}.
Example:
Perform a test of H_0: \beta_1 = 0 \\ H_a : \beta_1 \gt 0 on the slope of the following regression model: \hat y = 356.739 + 5.639 x
b_1 = 5.639 \\ s_b = 0.387 \\ \\ t = \dfrac{b_1}{s_b} = \dfrac{5.639}{0.387} = 14.56 \\ df = 16
The estimated p-value = 5.973 \times 10^{-11}, which is less than 0.05. Therefore, we reject H_0 and conclude that there is a statistically significant linear relationship between y and x.