Module 08: Regression

Intended Learning Outcomes

By the end of this session, you should be able to:

  • Load and check a dataset

  • Perform exploratory data analysis

  • Determine when (and which) data transformations are appropriate

  • Interpret regression results from transformed data

  • Compare and evaluate regression models

Scenario

Data on abalone (Haliotis spp.) were collected as part of a research study.

Abalone Background

Abalone (via Spanish abulón) is a common name for any of a group of small to very large sea snails, marine gastropod molluscs in the family Haliotidae. The shells of abalones have a low, open spiral structure, and are characterized by several open respiratory pores in a row near the shell’s outer edge. The thick inner layer of the shell is composed of nacre (mother-of-pearl), which in many species is highly iridescent, giving rise to a range of strong, changeable colors, which make the shells attractive to humans as decorative objects, jewelry, and as a source of colorful mother-of-pearl. The flesh of abalones is widely considered to be a desirable food, and is consumed raw or cooked by a variety of cultures.

Wikipedia

Live Abalone by Tyler Bell | Flickr | https://flic.kr/p/ahw9aM

Live Abalone by Tyler Bell | Flickr | https://flic.kr/p/ahw9aM

Inside of the Abalone Shell by Rojer | Flickr | https://flic.kr/p/7WkcDe

Inside of the Abalone Shell by Rojer | Flickr | https://flic.kr/p/7WkcDe

Abalone Fountain 2 by Dave Herrmann | Flickr | https://flic.kr/p/duVk5T

Abalone Fountain 2 by Dave Herrmann | Flickr | https://flic.kr/p/duVk5T

Agenda

  • Install and Load the ‘AppliedPredictiveModeling’ library
  • Load the abalone data set: ‘data(abalone)’
  • Determine a linear model that predicts Shucked Weight from Diameter.
  • Create a scatter plot that features a line of best fit using the information from your model.
  • Summarize your findings in a single sentence describing how Diameter predicts Shucked Weight.
  • Bonus goal: Create and plot an optimized multivariate linear model that predicts shucked weight.

Data

Exploration

#load the abalone data set
#https://rpubs.com/AlistairGJ/Abalone
data(abalone)

#check the data
head(abalone)
#look at the structure of the data
str(abalone)
## 'data.frame':    4177 obs. of  9 variables:
##  $ Type         : Factor w/ 3 levels "F","I","M": 3 3 1 3 2 2 1 1 3 1 ...
##  $ LongestShell : num  0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
##  $ Diameter     : num  0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
##  $ Height       : num  0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
##  $ WholeWeight  : num  0.514 0.226 0.677 0.516 0.205 ...
##  $ ShuckedWeight: num  0.2245 0.0995 0.2565 0.2155 0.0895 ...
##  $ VisceraWeight: num  0.101 0.0485 0.1415 0.114 0.0395 ...
##  $ ShellWeight  : num  0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
##  $ Rings        : int  15 7 9 10 7 8 20 16 9 19 ...
#store data as a tibble (shorter name also)
#this is more for personal preference
df<-as_tibble(abalone)

#check the data/result
head(df)
#filter by female abalone and drop the Type column 
#df<-filter(df, Type %in% c("F"))

#we don't need this anymore because all data represent female abalone
#df<-df[,-1]

Data Correlations

We can quickly investigate correlations with GGally. From the pairs plot, we notice:

  • some of the relationships appear to be strongly non-linear.

  • some relationships are linear

  • Male (blue) and Female (red) are distributed similarly. Infants (green) are distributed differently vs adults.

  • Some correlations are very high (e.g. LongestShell and Diameter ~ 0.987)

Here is a ‘pairs plot’ using the ggpairs() function.

#explore variable relationships
#this will probably take some time to calculate
#skip Rings since we just care about Age now
#ggpairs(data=abalone, columns=c(2:8,10), title="abalone data")
ggpairs(df, 
        aes(colour = Type, alpha = 0.8), 
        title="Pairs plot") + 
  theme_minimal(base_size = 12) #font size

Here is another ‘pairs plot’ using a subset of variables.

df_subset<-df[,c(1,2,3,4,6)] #subset with specific columns
ggpairs(df_subset, 
        aes(colour = Type, alpha = 0.8), 
        title="Pairs plot") + 
  theme_minimal(base_size = 16) #font size

Single-variable model

Note: “shucked” translates as “descascarado” (e.g. abulón descascarado)

For example, it may seem reasonable that the variable “ShuckedWeight” seems to relate to the variable “Diameter”. We can therefore create a null hypothesis that states the linear relationship of the two variables. We can investigate the strength of this relationship and determine its significance.

\(H_0: Shucked Weight = Intercept + Slope(Diameter) + Error\)

Create a Linear Model

Use lm() to create a linear model. Note the Y~X format used to indicate which variable is predicting which other variable.

#create the linear model
fit<-lm(df$ShuckedWeight~df$Diameter, data=df)

Summary

The summary indicates a decent \(R^2\) value and a significant p-value. From the numbers, Diameter is a decent predictor of Shucked Weight. Is this acceptable?

summary(fit)
## 
## Call:
## lm(formula = df$ShuckedWeight ~ df$Diameter, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.23847 -0.06604 -0.01760  0.04500  0.68491 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.455447   0.006535  -69.69   <2e-16 ***
## df$Diameter  1.997675   0.015568  128.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09984 on 4175 degrees of freedom
## Multiple R-squared:  0.7977, Adjusted R-squared:  0.7977 
## F-statistic: 1.647e+04 on 1 and 4175 DF,  p-value: < 2.2e-16

Initial Plot

It’s a good idea to create a plot even if it is very simple. We will let X = Diameter (predictor) and Y = ShuckedWeight (predicted). We can see that maybe it is not such a good idea to assume this is a linear relationship. We will need to do something about this. What can we do?

plot(df$Diameter,df$ShuckedWeight,xlab="Diameter (mm)", ylab="Shucked Weight (g)")

###Follow-up Plot We can re-plot the data with the line of best fit suggested from the linear model. These data appear to be related in a non-linear way so we can’t rely on a simple linear regression (yet).

plot(df$Diameter,df$ShuckedWeight,xlab="Diameter (mm)", ylab="Shucked Weight (g)")
abline(fit, col="#008EED", lwd=5)

Data Transformation

Nonlinear relationships are very common and one simple solution is to log-transform the data. This fundamentally changes how the results must be interpreted however. You might want to transform one or more variables in your model.

Interpretation Rules (Log)

  • Only independent/predictor variable(s) (X) is/are log-transformed: Interpret the (coefficient/100) as the units of change in Y when X changes by one percent. (e.g. Coefficient = 53; “When X increases by 1%, Y changes by 0.53 units.”)

  • Only the dependent/response variable (Y) is log-transformed: Interpret the coefficient as the percent change in Y when X changes by one unit. (e.g. Coefficient = 1.9; “When X increases by 1 unit, Y changes by 1.9%.”)

  • Both dependent/response variable and independent/predictor variable(s) are log-transformed: Interpret the coefficient as the percent change in Y for every 1% increase in X. (e.g. Coefficient = 3.7; “When X increases by 1%, Y changes by 3.7%.”)

Log Effects

We can take the log of two variables and add the resulting columns to the data frame. Then we can plot the untransformed and transformed variables to see the effects. Notice how log transformations linearize the data.

#log transformation of ShuckedWeight
df$lsw<-log(df$ShuckedWeight)

#log transformation of Diameter
df$ld<-log(df$Diameter)

#create a 2x2 grid of plots
par(mfrow=c(2,2))
plot(df$Diameter, df$ShuckedWeight,xlab="", ylab="")
mtext("Diameter (mm)",      side=1, line=3, col="black", cex=1.5)
mtext("Shucked Weight (g)", side=2, line=2, col="black", cex=1.5)

plot(df$ld, df$ShuckedWeight,xlab="", ylab="")
mtext("log (Diameter (mm))", side=1, line=3, col="red", cex=1.5)
mtext("Shucked Weight (g)", side=2, line=2, col="black", cex=1.5)

plot(df$Diameter, df$lsw,xlab="", ylab="")
mtext("Diameter (mm)",           side=1, line=3, col="black", cex=1.5)
mtext("log (Shucked Weight (g))", side=2, line=2, col="red", cex=1.5)

plot(df$ld, df$lsw,xlab="", ylab="")
mtext("log (Diameter (mm))",      side=1, line=3, col="red", cex=1.5)
mtext("log (Shucked Weight (g))", side=2, line=2, col="red", cex=1.5)

A (better) Linear Model: Cube Root

Another option here is the cube root. In this case, we can transform the dependent variable and achieve similar results to the log transformation. The data are more normally distributed with this transformation.

#cube root transformation on the response variable
#Interpret as: a one unit increase in the cube root of X is related to Y.
#An increase in diameter from 2 to 8 would be associated with the same change in satisfaction as an increase from 8 to 27 or 27 to 64.
plot((df$Diameter),(df$ShuckedWeight^(1/3)))

hist(df$ShuckedWeight^(1/3))

Model Fitting

Use lm() to create a linear model. Note the Y~X format used to indicate which variable is predicting which other variable. We can fit both tranformed models to compare.

#summarize the linear model using log-transformed data
fit_log<-lm(lsw~ld,data=df)
summary(fit_log)
## 
## Call:
## lm(formula = lsw ~ ld, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.76395 -0.13156 -0.01231  0.11617  2.45553 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.38801    0.01093   127.0   <2e-16 ***
## ld           2.86907    0.01117   256.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2101 on 4175 degrees of freedom
## Multiple R-squared:  0.9404, Adjusted R-squared:  0.9404 
## F-statistic: 6.593e+04 on 1 and 4175 DF,  p-value: < 2.2e-16
#create the linear model using cubic root (cr)
fit_cr<-lm(ShuckedWeight^(1/3)~Diameter, data=df)
summary(fit_cr)
## 
## Call:
## lm(formula = ShuckedWeight^(1/3) ~ Diameter, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14820 -0.02860 -0.00414  0.02412  0.43964 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.025056   0.003028   8.276   <2e-16 ***
## Diameter    1.591929   0.007213 220.716   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04625 on 4175 degrees of freedom
## Multiple R-squared:  0.9211, Adjusted R-squared:  0.921 
## F-statistic: 4.872e+04 on 1 and 4175 DF,  p-value: < 2.2e-16

Model Assumptions

We can check some basic assumptions about the residuals using some functions from library(olsrr):

  • The error has a normal distribution (normality assumption).

  • The errors have mean zero.

  • The errors have same but unknown variance (homoscedasticity assumption).

  • The error are independent of each other (independent errors assumption).

NOTE: The following plots use the cubic root model (fit_cr).

#The residuals spread randomly around the 0 line indicating that the relationship is linear.
#The residuals form an approximate horizontal band around the 0 line indicating homogeneity of error variance.
#No one residual is visibly away from the random pattern of the residuals indicating that there are no outliers.

#look for a straight line
ols_plot_resid_qq(fit_cr)

# evenly scattered points around y = 0
ols_plot_resid_fit(fit_cr)

# approximately normal distribution 
ols_plot_resid_hist(fit_cr)

A (better) Summary

How did the \(R^2\) value change following the transformation? Was this an improvement? Our output equation can be created from the summary information.

\(Shucked Weight = InterceptValue + (CoefficientValue * Diameter) + \varepsilon\)

Note: \(\varepsilon \sim N(0, Residual Standard Error^2)\).

#print the model, intercept, and coefficient
fit_cr
## 
## Call:
## lm(formula = ShuckedWeight^(1/3) ~ Diameter, data = df)
## 
## Coefficients:
## (Intercept)     Diameter  
##     0.02506      1.59193
#view only the intercept and slope/coefficient
coef(fit_cr)
## (Intercept)    Diameter 
##  0.02505567  1.59192948
#store the intercept and slope for the linear model
#intercept <- coef(fit_cr)[1]
#slope     <- coef(fit_cr)[2]

#print the summary
summary(fit_cr)
## 
## Call:
## lm(formula = ShuckedWeight^(1/3) ~ Diameter, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14820 -0.02860 -0.00414  0.02412  0.43964 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.025056   0.003028   8.276   <2e-16 ***
## Diameter    1.591929   0.007213 220.716   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04625 on 4175 degrees of freedom
## Multiple R-squared:  0.9211, Adjusted R-squared:  0.921 
## F-statistic: 4.872e+04 on 1 and 4175 DF,  p-value: < 2.2e-16

Follow-up Plot

We can re-plot the data with the line of best fit suggested from the linear model. This relationship is the same as before, but through transformation we can interpret it more accurately: A one unit increase in X results in a cube root unit increase of Y.

plot(df$Diameter,df$ShuckedWeight^(1/3),xlab="", ylab="", pch=19, col="#00000011")
mtext("Diameter (mm)",      side=1, line=3, col="black", cex=1.5)
mtext("Cube Root Shucked Weight (g)", side=2, line=2, col="black", cex=1.5)
abline(fit_cr, col="#008EED", lwd=4)
#curve(0.02506+(1.59193*x), add = TRUE, col="red", lwd=1) #same as abline output line
text(0.1, 1.1, expression(Y == 0.02506+1.59193*x), cex = 1.5, col="#008EED", adj=0)

We can plot the same information in the original data set (back-transforming). Note the equation in the plot.

plot(df$Diameter,df$ShuckedWeight,xlab="", ylab="", pch=19, col="#00000011")
mtext("Diameter (mm)",      side=1, line=3, col="black", cex=1.5)
mtext("Shucked Weight (g)", side=2, line=2, col="black", cex=1.5)
curve(0.02506+(1.59193*x)^3, add = TRUE, col="#008EED", lwd=4)
text(0.1, 1.4, expression(Y == 0.02506+(1.59193*x)^3), cex = 1.5, col="#008EED", adj=0)

Multiple Linear Regression

Multiple regression models include more than one predictor for a single predicted variable. Multivariate regression involves multiple independent and multiple dependent variables. For our example, we will only investigate multiple regression.

Simple Linear Regression

\(Y = b_0 + b_1X\)

Multiple Linear Regression

\(Y = b_0 + b_1X_1 + b_2X_2 + b_3X_3\)

Model Formation

Guiding Questions:

  • How do we create a linear model that includes multiple predictors?

  • How do we decide which variables are useful?

Model Evaluation

Guiding Questions:

  • How do we compare models?

  • Are more variables helpful?

Model Selection

Guiding Questions:

  • How do we select the ‘best’ model?

  • How do we defend our decision?

Models with Multiple Predictors

We can add variables to our original model and test them to see if the addition was useful.

Fit1 is a model containing the original cube-root transformation of Shucked Weight and Diameter.

fit1 <- lm(ShuckedWeight^(1/3)~Diameter,data=df)
summary(fit1)
## 
## Call:
## lm(formula = ShuckedWeight^(1/3) ~ Diameter, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14820 -0.02860 -0.00414  0.02412  0.43964 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.025056   0.003028   8.276   <2e-16 ***
## Diameter    1.591929   0.007213 220.716   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04625 on 4175 degrees of freedom
## Multiple R-squared:  0.9211, Adjusted R-squared:  0.921 
## F-statistic: 4.872e+04 on 1 and 4175 DF,  p-value: < 2.2e-16
#extract only the r-squared value if needed
summary(fit1)$r.squared #or use adj.r.squared
## [1] 0.9210632

Fit2 contains Fit1 + LongestShell

fit2 <- lm(ShuckedWeight^(1/3)~Diameter+LongestShell,data=df)
summary(fit2)
## 
## Call:
## lm(formula = ShuckedWeight^(1/3) ~ Diameter + LongestShell, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.13519 -0.02658 -0.00367  0.02234  0.44587 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.009158   0.003057  -2.996  0.00275 ** 
## Diameter      0.485106   0.041042  11.820  < 2e-16 ***
## LongestShell  0.926857   0.033915  27.329  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04261 on 4174 degrees of freedom
## Multiple R-squared:  0.933,  Adjusted R-squared:  0.933 
## F-statistic: 2.908e+04 on 2 and 4174 DF,  p-value: < 2.2e-16
#extract only the r-squared value if needed
summary(fit2)$r.squared #or use adj.r.squared
## [1] 0.9330438

Fit3 contains Fit2+ Height

fit3 <- lm(ShuckedWeight^(1/3)~Diameter+LongestShell+Height,data=df)
summary(fit3)
## 
## Call:
## lm(formula = ShuckedWeight^(1/3) ~ Diameter + LongestShell + 
##     Height, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.13718 -0.02603 -0.00356  0.02268  0.44535 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.007696   0.003037  -2.534   0.0113 *  
## Diameter      0.420354   0.041445  10.142   <2e-16 ***
## LongestShell  0.911585   0.033691  27.058   <2e-16 ***
## Height        0.236187   0.028354   8.330   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04226 on 4173 degrees of freedom
## Multiple R-squared:  0.9341, Adjusted R-squared:  0.9341 
## F-statistic: 1.973e+04 on 3 and 4173 DF,  p-value: < 2.2e-16
#extract only the r-squared value if needed
summary(fit3)$r.squared #or use adj.r.squared
## [1] 0.9341389

ANOVA

Compare models with ANOVA

anova(fit1, fit2)
anova(fit2, fit3)

The models can be neatly summarized using stargazer().

#summarize models with stargazer
#Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
#R package version 5.2.2. https://CRAN.R-project.org/package=stargazer 
library(stargazer)
stargazer(fit1, fit2, fit3,type="text",no.space=TRUE,omit.stat=c("f"))
## 
## =========================================================================
##                                      Dependent variable:                 
##                     -----------------------------------------------------
##                                         ShuckedWeight                    
##                            (1)               (2)               (3)       
## -------------------------------------------------------------------------
## Diameter                1.592***          0.485***          0.420***     
##                          (0.007)           (0.041)           (0.041)     
## LongestShell                              0.927***          0.912***     
##                                            (0.034)           (0.034)     
## Height                                                      0.236***     
##                                                              (0.028)     
## Constant                0.025***          -0.009***         -0.008**     
##                          (0.003)           (0.003)           (0.003)     
## -------------------------------------------------------------------------
## Observations              4,177             4,177             4,177      
## R2                        0.921             0.933             0.934      
## Adjusted R2               0.921             0.933             0.934      
## Residual Std. Error 0.046 (df = 4175) 0.043 (df = 4174) 0.042 (df = 4173)
## =========================================================================
## Note:                                         *p<0.1; **p<0.05; ***p<0.01

Intended Learning Outcomes

By the end of this session, you should be able to:

  • Load and check a dataset

  • Perform exploratory data analysis

  • Determine when (and which) data transformations are appropriate

  • Interpret regression results from transformed data

  • Compare and evaluate regression models