Linear Regression

If this page doesn’t load properly, click here

Introduction

This tutorial shows step by step process to conduct a bi-variate linear regression analysis for an introductory statistics class.

Dataset

The “Prestige (available from {car} package in R)” dataset is used in this analysis, which was obtained from Fox, J. and Weisberg, S. (2011) An R Companion to Applied Regression, Second Edition, Sage. The Prestige data set has 102 rows and 6 variables. The observations are occupations.

Variable Definitons

This data frame contains the following columns:
education: Average education of occupational incumbents, years, in 1971.
income: Average income of incumbents, dollars, in 1971.
women: Percentage of incumbents who are women.
prestige: Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s.
census: Canadian Census occupational code.
type: Type of occupation. bc - Blue Collar; prof - Professional, Managerial, and Technical; wc - White Collar.

First 10 rows of the dataset

##                     education income women prestige census type
## gov.administrators      13.11  12351 11.16     68.8   1113 prof
## general.managers        12.26  25879  4.02     69.1   1130 prof
## accountants             12.77   9271 15.70     63.4   1171 prof
## purchasing.officers     11.42   8865  9.11     56.8   1175 prof
## chemists                14.62   8403 11.68     73.5   2111 prof
## physicists              15.64  11030  5.13     77.6   2113 prof
## biologists              15.09   8258 25.65     72.6   2133 prof
## architects              15.44  14163  2.69     78.1   2141 prof
## civil.engineers         14.52  11377  1.03     73.1   2143 prof
## mining.engineers        14.64  11023  0.94     68.8   2153 prof

Examine the data before fitting models

##    education          income          women           prestige    
##  Min.   : 6.380   Min.   :  611   Min.   : 0.000   Min.   :14.80  
##  1st Qu.: 8.445   1st Qu.: 4106   1st Qu.: 3.592   1st Qu.:35.23  
##  Median :10.540   Median : 5930   Median :13.600   Median :43.60  
##  Mean   :10.738   Mean   : 6798   Mean   :28.979   Mean   :46.83  
##  3rd Qu.:12.648   3rd Qu.: 8187   3rd Qu.:52.203   3rd Qu.:59.27  
##  Max.   :15.970   Max.   :25879   Max.   :97.510   Max.   :87.20  
##      census       type   
##  Min.   :1113   bc  :44  
##  1st Qu.:3120   prof:31  
##  Median :5135   wc  :23  
##  Mean   :5402   NA's: 4  
##  3rd Qu.:8312            
##  Max.   :9517

Notice that the ‘type’ variable has 4 missing values.

Correlation Matrix

##            prestige education    income
## prestige  1.0000000 0.8501769 0.7149057
## education 0.8501769 1.0000000 0.5775802
## income    0.7149057 0.5775802 1.0000000

Let’s focus on three variables of our interest - ‘prestige’, ‘education’, and ‘income’. While we observe that prestige is positively correlated with both education and income, education appears to be correlated more strongly than income with prestige.

Plot the data before fitting models

Plot the data to look for outliers, non-linear relationships etc.

As expected, the relationship between education and prestige appears to be more linear than that between income and prestige.