STA 506 2.0 Linear Regression AnalysisSimple Linear Regression (cont.)Dr Thiyanga S. Talagala1 / 34

Steps

Fit a model.
Visualize the fitted model.
Model Adequacy Checking
Interpret the coefficients.
Make predictions using the fitted model.

2 / 34

Fitted model

library(alr3) # to load the dataset

Loading required package: car

Loading required package: carData

model1 <- lm(Dheight ~ Mheight, data=heights)
model1


Call:
lm(formula = Dheight ~ Mheight, data = heights)
Coefficients:
(Intercept)      Mheight  
    29.9174       0.5417

3 / 34

Model summary

summary(model1)


Call:
lm(formula = Dheight ~ Mheight, data = heights)
Residuals:
   Min     1Q Median     3Q    Max 
-7.397 -1.529  0.036  1.492  9.053 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 29.91744    1.62247   18.44   <2e-16 ***
Mheight      0.54175    0.02596   20.87   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.266 on 1373 degrees of freedom
Multiple R-squared:  0.2408,    Adjusted R-squared:  0.2402 
F-statistic: 435.5 on 1 and 1373 DF,  p-value: < 2.2e-16

4 / 34

Interesting questions come to mind

How well does this equation fit the data?
Is the model likely to be useful as a predictor?
Are any of the basic assumptions violated, and if so, how series is this?

All of these questions must be investigated before using the model.

Residuals play a key role in answering the questions.

5 / 34

6 / 34

7 / 34

Model assumptions

1) The mean of the response, $E (Y_{i})$ , at each value of the predictor, $x_{i}$ , is a Linear function of the $x_{i}$

8 / 34

Model assumptions

2) The error term $ϵ$ has zero mean.

3) At each value of the predictor, $x$ , errors have equal, constant variance $σ^{2}$ .

source: http://reliawiki.org/index.php/Simple_Linear_Regression_Analysis

9 / 34

Model assumptions

4) The error are uncorrelated.

5) At each value of the predictor, $x$ the errors are normally distributed

Taking together 4 and 5 imply the errors are independent random variables.

Assumption 5 is required for parametric statistical inference (Hypothesis testing, Interval estimation).

10 / 34

An alternative way to describe all four (2-5) assumptions is that the errors, $ϵ_{i}$ , are independent normal random variables with mean zero and constant variance, $σ^{2}$ .

11 / 34

Diagnosing violations of the assumptions

Diagnosing methods are primarily based on model residuals.

Residuals

$e_{i} = Observed value - Fitted value$

$e_{i} = y_{i} - \hat{Y_{i}}$

Deviation between the observed value (true value) and fitted value.

12 / 34

df <- alr3::heights
df$fitted <- 30.7 + (0.52*df$M)
head(df,10)

   Mheight Dheight fitted
1     59.7    55.1 61.744
2     58.2    56.5 60.964
3     60.6    56.0 62.212
4     60.7    56.8 62.264
5     61.8    56.0 62.836
6     55.5    57.9 59.560
7     55.4    57.1 59.508
8     56.8    57.6 60.236
9     57.5    57.2 60.600
10    57.3    57.1 60.496

First fitted value: 30.7 + (0.52 * 59.7) = 61.744

13 / 34

df <- alr3::heights
df$fitted <- 30.7 + (0.52*df$M)
df$residuals <- df$Dheight - df$fitted
head(df,10)

   Mheight Dheight fitted residuals
1     59.7    55.1 61.744    -6.644
2     58.2    56.5 60.964    -4.464
3     60.6    56.0 62.212    -6.212
4     60.7    56.8 62.264    -5.464
5     61.8    56.0 62.836    -6.836
6     55.5    57.9 59.560    -1.660
7     55.4    57.1 59.508    -2.408
8     56.8    57.6 60.236    -2.636
9     57.5    57.2 60.600    -3.400
10    57.3    57.1 60.496    -3.396

First fitted value: 30.7 + (0.52 * 59.7) = 61.744

First residual value: 55.1 - 61.744 = -6.644

It is convenient to think of residuals as the realized or observed values of the model error.
Residuals have zero mean.
Residuals are not independent.

14 / 34

Observation-level statistics: `augment()`

library(broom)
library(tidyverse)
model1_fitresid <- augment(model1)
model1_fitresid

# A tibble: 1,375 × 8
   Dheight Mheight .fitted .resid     .hat .sigma .cooksd .std.resid
     <dbl>   <dbl>   <dbl>  <dbl>    <dbl>  <dbl>   <dbl>      <dbl>
 1    55.1    59.7    62.3  -7.16 0.00172    2.26 0.00862     -3.16 
 2    56.5    58.2    61.4  -4.95 0.00310    2.26 0.00743     -2.19 
 3    56      60.6    62.7  -6.75 0.00118    2.26 0.00523     -2.98 
 4    56.8    60.7    62.8  -6.00 0.00113    2.26 0.00397     -2.65 
 5    56      61.8    63.4  -7.40 0.000783   2.26 0.00418     -3.27 
 6    57.9    55.5    60.0  -2.08 0.00707    2.27 0.00303     -0.923
 7    57.1    55.4    59.9  -2.83 0.00725    2.27 0.00574     -1.25 
 8    57.6    56.8    60.7  -3.09 0.00492    2.27 0.00461     -1.37 
 9    57.2    57.5    61.1  -3.87 0.00395    2.26 0.00579     -1.71 
10    57.1    57.3    61.0  -3.86 0.00421    2.26 0.00616     -1.71 
# … with 1,365 more rows

15 / 34

Residual analysis16 / 34

Plot of residuals vs fitted values

This is useful for detecting several common types of model inadequacies.

17 / 34

Our example

1) The relationship between the response $Y$ and the regressors is linear, at least approximately. (Residuals vs Fitted/ Residual vs X - this is optional in Simple Linear Regression)

2) The error term $ϵ$ has zero mean. (Residuals vs Fitted)

3) The error term $ϵ$ has constant variance $σ^{2}$ . (Residuals vs Fitted)

Residuals vs Fitted

Residuals vs X

18 / 34

Note

In simple linear regression, it is not necessary to plot residuals versus both fitted values and regressor variable. The reason is fitted values are linear combinations of the regressor variable, so the plot would only differ in the scale for the abscissa.

19 / 34

4) The error are uncorrelated.

Often, we can conclude that the this assumption is sufficiently met based on a description of the data and how it was collected.

Use a random sample to ensure independence of observations.

If the time sequence in which the data were collected is known, plot of residuals in time sequence.

20 / 34

5) The errors are normally distributed.

ggplot(model1_fitresid, 
       aes(x=.resid))+
  geom_histogram(colour="white")+ggtitle("Distribution of Residuals")

ggplot(model1_fitresid, 
       aes(sample=.resid))+
  stat_qq() + stat_qq_line()+labs(x="Theoretical Quantiles", y="Sample Quantiles")

shapiro.test(model1_fitresid$.resid)


    Shapiro-Wilk normality test
data:  model1_fitresid$.resid
W = 0.99859, p-value = 0.3334

21 / 34

5) The errors are normally distributed (cont.)

$H 0 :$ Errors are normally distributed.

$H 1 :$ Errors are not normally distributed.

22 / 34

Coefficient of Determination23 / 34

Residuals measure the variability in the response variable not explained by the regression model.

24 / 34

25 / 34

Coefficient of Determination

$R^{2} = \frac{S S_{M}}{S S_{T}} = 1 - \frac{S S_{R}}{S S_{T}}$

$S S_{T}$ - A measure of the variability in $y$ without considering the effect of the regressor variable $x$ .

Measures the variation of $y$ values around their mean.

$S S_{M}$ - Explained variation attributable to factors other than the relationship between $x$ and $y$ .

Notation: $S S_{R}$ or $S S_{E}$ - A measure of the variability in $y$ remaining after $x$ has been considered.

$R^{2}$ - Proportion of variation in $Y$ explained by the relation relationship of $Y$ with $x$ .

26 / 34

Coefficient of Determination

$0 \leq R^{2} \leq 1$ .

Values of $R^{2}$ that are close to 1 imply that most of the variability in $Y$ is explained by the regression model.

$R^{2}$ should be interpreted with caution. (We will talk more on this in multiple linear regression analysis)

27 / 34

Our example

summary(model1)


Call:
lm(formula = Dheight ~ Mheight, data = heights)
Residuals:
   Min     1Q Median     3Q    Max 
-7.397 -1.529  0.036  1.492  9.053 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 29.91744    1.62247   18.44   <2e-16 ***
Mheight      0.54175    0.02596   20.87   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.266 on 1373 degrees of freedom
Multiple R-squared:  0.2408,    Adjusted R-squared:  0.2402 
F-statistic: 435.5 on 1 and 1373 DF,  p-value: < 2.2e-16

$24 %$ of the variability in daughter's height is accounted by the regression model.

28 / 34

$R^{2}$ = 24.08%

Maybe you have one or more omitted variables. It is important to consider other factors that might influence the daughter's height:
- Father's height
- Physical activities performed by the daughter
- Food nutrition, etc.
Maybe the functional form of the regression form is incorrect (so you have to add some quadratic, or cubic terms...). At the same time a transformation can be an alternative (if appropriate).
Maybe could be the effect of a group of outlier (maybe not one...).

29 / 34

A large $R^{2}$ does not necessarily imply that the regression model will be an accurate predictor.
$R^{2}$ does not measure the appropriateness of the linear model.
$R^{2}$ will often large even though $Y$ and $X$ are nonlinearly related.

30 / 34

Relationship between $r$ and $R^{2}$

cor(heights$Mheight, heights$Dheight)

[1] 0.4907094

cor(heights$Mheight, heights$Dheight)^2

[1] 0.2407957

31 / 34

Is correlation enough?

Correlation is only a measure of association and is of little use in prediction.
Regression analysis is useful in developing a functional relationship between variables, which can be used for prediction and making inferences.

32 / 34

Next Lecture

More work - Simple Linear Regression, Hypothesis testing, Predictions

33 / 34

Acknowledgement

Introduction to Linear Regression Analysis, Douglas C. Montgomery, Elizabeth A. Peck, G. Geoffrey Vining

Dr. Thiyanga S. Talagala

34 / 34

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

STA 506 2.0 Linear Regression Analysis

Simple Linear Regression (cont.)

Dr Thiyanga S. Talagala

Steps

Fitted model

Model summary

Interesting questions come to mind

Model assumptions

Model assumptions

Model assumptions

Diagnosing violations of the assumptions

Residuals

Observation-level statistics: augment()

Residual analysis

Plot of residuals vs fitted values

Our example

Note

4) The error are uncorrelated.

5) The errors are normally distributed.

5) The errors are normally distributed (cont.)

Coefficient of Determination

Coefficient of Determination

Coefficient of Determination

Our example

R2R2 = 24.08%

Relationship between rr and R2R2

Is correlation enough?

Next Lecture

Steps

Help

Observation-level statistics: `augment()`

$R^{2}$ = 24.08%

Relationship between $r$ and $R^{2}$