STA 506 2.0 Linear Regression AnalysisInterpretation of Regression OutputsDr Thiyanga S. Talagala1 / 43

Recapping what we know

True linear regression model/ Population regression equation

$μ_{Y | X = x} = β_{0} + β_{1} x$

how the mean of $Y$ changes for given values of $X$ .
We can also write the equation in terms of the observed values of $Y$ , rather than the mean. Because the individual observations for any given value of $X$ vary randomly about the mean of $Y | X$ , we need to account for this random variation, or error, in the regression equation. Sample regression line

$Y_{i} = β_{0} + β_{1} x_{i} + ϵ_{i}, i = 1, 2, 3...n$

Please write the assumptions of $ϵ_{i}$

Fitted regression line

$\hat{Y} = {\hat{β}}_{0} + {\hat{β}}_{1} x$

2 / 43

Revisiting the fitted regression model

Fitted model

library(alr3) # to load the dataset

Loading required package: car

Loading required package: carData

model1 <- lm(Dheight ~ Mheight, data=heights)
model1


Call:
lm(formula = Dheight ~ Mheight, data = heights)
Coefficients:
(Intercept)      Mheight  
    29.9174       0.5417

3 / 43

Model summary

summary(model1)


Call:
lm(formula = Dheight ~ Mheight, data = heights)
Residuals:
   Min     1Q Median     3Q    Max 
-7.397 -1.529  0.036  1.492  9.053 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 29.91744    1.62247   18.44   <2e-16 ***
Mheight      0.54175    0.02596   20.87   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.266 on 1373 degrees of freedom
Multiple R-squared:  0.2408,    Adjusted R-squared:  0.2402 
F-statistic: 435.5 on 1 and 1373 DF,  p-value: < 2.2e-16

4 / 43

$R^{2}$ = 24.08%

Maybe you have one or more omitted variables. It is important to consider other factors that might influence the daughter's height:
- Father's height
- Physical activities performed by the daughter
- Food nutrition, etc.
Maybe the functional form of the regression form is incorrect (so you have to add some quadratic, or cubic terms...). At the same time a transformation can be an alternative (if appropriate).
Maybe could be the effect of a group of outlier (maybe not one...).

5 / 43

A large $R^{2}$ does not necessarily imply that the regression model will be an accurate predictor.
$R^{2}$ does not measure the appropriateness of the linear model.
$R^{2}$ will often large even though $Y$ and $X$ are nonlinearly related.

6 / 43

Scatterplot of price vs carat in diamonds (Pearson's correlation coefficient = 0.783)

[1] 0.7834839

7 / 43


Call:
lm(formula = carat ~ price, data = df)
Residuals:
     Min       1Q   Median       3Q      Max 
-2.91911 -0.66197 -0.07846  0.82734  2.97025 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.74996    0.23234   3.228   0.0017 ** 
price        0.18736    0.01501  12.481   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.137 on 98 degrees of freedom
Multiple R-squared:  0.6138,    Adjusted R-squared:  0.6099 
F-statistic: 155.8 on 1 and 98 DF,  p-value: < 2.2e-16

$R^{2}$ is large even though the linear approximation is poor.

In general a good policy is to observe the scatterplot of your data

8 / 43

Y - Crystal size

X - Number of hours it takes crystal to grow

9 / 43

Data set

crystaldata <- data.frame(Size = c(0.08, 0.07,  1.12, 2.01, 4.43, 
                                   4.98, 4.92, 7.18, 
                                   5.57, 8.40, 8.81, 10.81, 11.61, 10.12, 13.12, 15.04, 0.075, 0.075,  1.13, 2.02, 4.45, 
                                   4.99, 4.95, 7.28, 
                                   5.67, 8.40, 8.51, 10.82, 11.62, 10.13, 13.22, 15.03),
  Time= c(2, 2, 4, 4,6, 8, 10, 12, 14, 16, 18, 20, 20, 24, 26, 28, 2, 2, 4, 4,6, 8, 10, 12, 14, 16, 18, 20, 20, 24, 26, 28))
crystaldata

     Size Time
1   0.080    2
2   0.070    2
3   1.120    4
4   2.010    4
5   4.430    6
6   4.980    8
7   4.920   10
8   7.180   12
9   5.570   14
10  8.400   16
11  8.810   18
12 10.810   20
13 11.610   20
14 10.120   24
15 13.120   26
16 15.040   28
17  0.075    2
18  0.075    2
19  1.130    4
20  2.020    4
21  4.450    6
22  4.990    8
23  4.950   10
24  7.280   12
25  5.670   14
26  8.400   16
27  8.510   18
28 10.820   20
29 11.620   20
30 10.130   24
31 13.220   26
32 15.030   28

10 / 43

scatterplot

ggplot(crystaldata, aes(x=Time, y = Size)) + geom_point()

cor(crystaldata$Time, crystaldata$Size)

[1] 0.9741803

11 / 43

Fit a simple linear regression model

mod1 <- lm(Size ~ Time, data=crystaldata)
mod1


Call:
lm(formula = Size ~ Time, data = crystaldata)
Coefficients:
(Intercept)         Time  
    -0.1982       0.5210

12 / 43

Visualise the model

ggplot(crystaldata, aes(y=Size, x=Time))+geom_point(alpha=0.5) +
  geom_abline(intercept = -0.21, slope = 0.52, colour="forestgreen", lwd=2) +
            theme(aspect.ratio = 1)

13 / 43

Questions

How well does this equation fit the data?
Are any of the basic assumptions violated?

14 / 43

summary(mod1)


Call:
lm(formula = Size ~ Time, data = crystaldata)
Residuals:
    Min      1Q  Median      3Q     Max 
-2.1855 -0.7643  0.0313  0.7405  1.5223 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.19821    0.34838  -0.569    0.574    
Time         0.52099    0.02204  23.634   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.05 on 30 degrees of freedom
Multiple R-squared:  0.949,    Adjusted R-squared:  0.9473 
F-statistic: 558.5 on 1 and 30 DF,  p-value: < 2.2e-16

15 / 43

Coefficient of determination


Call:
lm(formula = Size ~ Time, data = crystaldata)
Residuals:
    Min      1Q  Median      3Q     Max 
-2.1855 -0.7643  0.0313  0.7405  1.5223 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.19821    0.34838  -0.569    0.574    
Time         0.52099    0.02204  23.634   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.05 on 30 degrees of freedom
Multiple R-squared:  0.949,    Adjusted R-squared:  0.9473 
F-statistic: 558.5 on 1 and 30 DF,  p-value: < 2.2e-16

94.5% of the variability in size is accounted by the regression model.

94.5% of the variation in size is explained by the number of hours it takes crystal to grow.

16 / 43

Are any of the basic assumptions violated?

mod1


Call:
lm(formula = Size ~ Time, data = crystaldata)
Coefficients:
(Intercept)         Time  
    -0.1982       0.5210

17 / 43

Compute Residuals and Fitted values

library(broom)

Warning: package 'broom' was built under R version 4.1.2

mod1_fitresid <- augment(mod1)
mod1_fitresid

# A tibble: 32 × 8
    Size  Time .fitted  .resid   .hat .sigma  .cooksd .std.resid
   <dbl> <dbl>   <dbl>   <dbl>  <dbl>  <dbl>    <dbl>      <dbl>
 1  0.08     2   0.844 -0.764  0.0883   1.06 0.0281      -0.762 
 2  0.07     2   0.844 -0.774  0.0883   1.06 0.0289      -0.772 
 3  1.12     4   1.89  -0.766  0.0700   1.06 0.0215      -0.756 
 4  2.01     4   1.89   0.124  0.0700   1.07 0.000567     0.123 
 5  4.43     6   2.93   1.50   0.0552   1.03 0.0634       1.47  
 6  4.98     8   3.97   1.01   0.0440   1.05 0.0223       0.984 
 7  4.92    10   5.01  -0.0917 0.0363   1.07 0.000149    -0.0890
 8  7.18    12   6.05   1.13   0.0321   1.05 0.0197       1.09  
 9  5.57    14   7.10  -1.53   0.0314   1.03 0.0354      -1.48  
10  8.4     16   8.14   0.262  0.0343   1.07 0.00115      0.254 
# … with 22 more rows

18 / 43

1) The relationship between the response $Y$ and the regressors is linear, at least approximately.

Plot of residuals vs fitted values and Plot of residuals vs predictor

Residuals vs Fitted

Residuals vs Time

19 / 43

2) The error term $ϵ$ has zero mean.

3) The error term $ϵ$ has constant variance $σ^{2}$ .

Residuals vs Fitted

20 / 43

4) The error are uncorrelated.21 / 43

5) The errors are normally distributed.

ggplot(mod1_fitresid, 
       aes(x=.resid))+
  geom_histogram(colour="white")+ggtitle("Distribution of Residuals")

ggplot(mod1_fitresid, 
       aes(sample=.resid))+
  stat_qq() + stat_qq_line()+labs(x="Theoretical Quantiles", y="Sample Quantiles")

shapiro.test(mod1_fitresid$.resid)


    Shapiro-Wilk normality test
data:  mod1_fitresid$.resid
W = 0.95397, p-value = 0.1867

22 / 43

Interpretations

$μ_{Y | X = x} = β_{0} + β_{1} x$

Slope:

Increase in the mean response for every one unit increase in the explanatory variable.

Intercept:

Mean response when the explanatory variable equals 0

mod1


Call:
lm(formula = Size ~ Time, data = crystaldata)
Coefficients:
(Intercept)         Time  
    -0.1982       0.5210

Interpret slope and intercept

23 / 43

Nonsensical intercept interpretation

Sometimes it doesn't make sense to interpret the intercept: when?
- $X$ can never realistically take the value of zero, then the intercept is meaningless.
- The intercept is negative even though the response can take only positive values.
The intercept helps the line fit the data as closely as possible.
It is fine to have a regression model with nonsensical intercept if it helps the model give better overall prediction accuracy.

24 / 43

Hypothesis testing on the slope and intercept25 / 43

Hypothesis testing - Intercept

$H_{0} : β_{1} = 0$

$H_{1} : β_{1} \neq 0$

summary(mod1)


Call:
lm(formula = Size ~ Time, data = crystaldata)
Residuals:
    Min      1Q  Median      3Q     Max 
-2.1855 -0.7643  0.0313  0.7405  1.5223 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.19821    0.34838  -0.569    0.574    
Time         0.52099    0.02204  23.634   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.05 on 30 degrees of freedom
Multiple R-squared:  0.949,    Adjusted R-squared:  0.9473 
F-statistic: 558.5 on 1 and 30 DF,  p-value: < 2.2e-16

26 / 43

Situations where $H_{0} : β_{1} = 0$ is not rejected.

In-class

27 / 43

Situations where $H_{0} : β_{1} = 0$ is rejected.

In-class

28 / 43

Confidence Interval on $β_{0}$ and $β_{1}$

Confidence intervals $β_{0}$

$[\hat{β_{0}} - t_{α / 2, n - 2} s e (\hat{β_{0}}), \hat{β_{0}} + t_{α / 2, n - 2} s e (\hat{β_{0}})]$

Confidence intervals $β_{1}$

$[\hat{β_{1}} - t_{α / 2, n - 2} s e (\hat{β_{1}}), \hat{β_{1}} + t_{α / 2, n - 2} s e (\hat{β_{1}})]$

29 / 43

t-distribution

$t_{0.025, 30}$

qt(0.975, df=30)

[1] 2.042272

qt(0.025, df=30)

[1] -2.042272

30 / 43

summary(mod1)


Call:
lm(formula = Size ~ Time, data = crystaldata)
Residuals:
    Min      1Q  Median      3Q     Max 
-2.1855 -0.7643  0.0313  0.7405  1.5223 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.19821    0.34838  -0.569    0.574    
Time         0.52099    0.02204  23.634   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.05 on 30 degrees of freedom
Multiple R-squared:  0.949,    Adjusted R-squared:  0.9473 
F-statistic: 558.5 on 1 and 30 DF,  p-value: < 2.2e-16

Compute confidence intervals on slope and intercept

31 / 43

Confidence Interval on $β_{0}$ and $β_{1}$ - Interpretation

Confidence intervals $β_{0}$

$[\hat{β_{0}} - t_{α / 2, n - 2} s e (\hat{β_{0}}), \hat{β_{0}} + t_{α / 2, n - 2} s e (\hat{β_{0}})]$

Confidence intervals $β_{1}$

$[\hat{β_{1}} - t_{α / 2, n - 2} s e (\hat{β_{1}}), \hat{β_{1}} + t_{α / 2, n - 2} s e (\hat{β_{1}})]$

Frequency interpretation:

If we were to take repeated samples of the same size at the same $X$ levels and construct, for example, $95 %$ confidence intervals on the slope for each sample, then $95 %$ of those intervals will contain the true value $β_{1}$ .

We therefore have considerable confidence that the observed interval, here (04792, 0.5608), covers the true $β_{1}$ . The measure of confidence is .95. Here, .95 is called the confidence coefficient.

In general, the larger the confidence coefficient $(1 - α)$ is, the wider the confidence interval.

page 375, Mood and Graybill, Introduction to the theory of statistics

32 / 43

Analysis of variance - ANOVA

We may also use ANOVA to test the significance of regression.
The idea behind the ANOVA is that the variation in the response variable (referred to as the total variation - SST) is partitioned into two parts:

i. how much of the variation is explained by the regression model (referred to as the variable name in the R output) - SSM (Model sum of squares)

ii. how much of the variation is left unexplained by the regression model (called the “Residual” or “Error”) - SSE (Residual or error sum of squares).

$\sum_{i = 1}^{n} (y_{i} - \bar{y})^{2} = \sum_{i = 1}^{n} ({\hat{Y}}_{i} - \bar{y})^{2} + \sum_{i = 1}^{n} (y_{i} - \hat{Y_{i}})^{2}$

$S S T = S S M + S S E$

Degrees of component

$d f_{T} = d f_{M} + d f_{E}$ $n - 1 = 1 + (n - 2)$

33 / 43

ANOVA - F test

$F = \frac{S S M / d f_{M}}{S S E / d f_{E}} = \frac{M S_{M}}{M S_{E}}$

$E (M S_{E}) = σ^{2}$

34 / 43

ANOVA - Hypothesis

$H_{0} : β_{1} = 0$ $H_{1} : β_{1} \neq 0$

In simple linear regression t-test is equivalent to the F test.
The real usefulness of the analysis of variance is in multiple regression models.

anova(mod1)

Analysis of Variance Table
Response: Size
          Df Sum Sq Mean Sq F value    Pr(>F)    
Time       1 615.46  615.46  558.55 < 2.2e-16 ***
Residuals 30  33.06    1.10                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

35 / 43

summary(mod1)


Call:
lm(formula = Size ~ Time, data = crystaldata)
Residuals:
    Min      1Q  Median      3Q     Max 
-2.1855 -0.7643  0.0313  0.7405  1.5223 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.19821    0.34838  -0.569    0.574    
Time         0.52099    0.02204  23.634   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.05 on 30 degrees of freedom
Multiple R-squared:  0.949,    Adjusted R-squared:  0.9473 
F-statistic: 558.5 on 1 and 30 DF,  p-value: < 2.2e-16

36 / 43

Regression through the origin (No intercept)

Appropriate in analysing data from chemical and other manufacturing processes.
Do not misuse the no-intercept model: where the data lie in a region of $x$ -space remote from the origin.
Scatterplot sometimes helps in deciding whether or not to fit the no-intercept model.

How to select the model: Fit both models (with intercept and without intercept) and choose between them based on the quality of the fit.

mod2 <- lm(Size ~ Time -1, data=crystaldata)
mod2


Call:
lm(formula = Size ~ Time - 1, data = crystaldata)
Coefficients:
  Time  
0.5104

37 / 43

summary(mod2)


Call:
lm(formula = Size ~ Time - 1, data = crystaldata)
Residuals:
     Min       1Q   Median       3Q      Max 
-2.12895 -0.92631 -0.04059  0.78642  1.41254 
Coefficients:
     Estimate Std. Error t value Pr(>|t|)    
Time  0.51037    0.01161   43.95   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.038 on 31 degrees of freedom
Multiple R-squared:  0.9842,    Adjusted R-squared:  0.9837 
F-statistic:  1931 on 1 and 31 DF,  p-value: < 2.2e-16

38 / 43

With intercept

ggplot(crystaldata, aes(y=Size, x=Time))+geom_point(alpha=0.5) +
  geom_abline(intercept = -0.21, slope = 0.52, colour="forestgreen", lwd=2) +
            theme(aspect.ratio = 1)

Without intercept

ggplot(crystaldata, aes(y=Size, x=Time))+geom_point(alpha=0.5) +
  geom_abline(intercept = 0, slope = 0.51, colour="red", lwd=2) +
            theme(aspect.ratio = 1)

Perform the residual analysis on the new model before using it.

39 / 43

Prediction of new observations40 / 43

test <- data.frame(Time = c(2, 2, 6, 6, 8, 20, 26))
test

predict(mod2, newdata=test, interval="predict")

        fit        lwr       upr
1  1.020746 -1.0972070  3.138699
2  1.020746 -1.0972070  3.138699
3  3.062237  0.9400507  5.184424
4  3.062237  0.9400507  5.184424
5  4.082983  1.9570988  6.208867
6 10.207457  8.0376929 12.377222
7 13.269695 11.0645387 15.474851

41 / 43

Next Lecture

Multiple linear regression

42 / 43

Acknowledgement

Introduction to Linear Regression Analysis, Douglas C. Montgomery, Elizabeth A. Peck, G. Geoffrey Vining

Dr. Thiyanga S. Talagala

43 / 43

Recapping what we know

True linear regression model/ Population regression equation

μ_{Y | X = x} = β_{0} + β_{1} x

how the mean of $Y$ changes for given values of $X$ .

We can also write the equation in terms of the observed values of $Y$ , rather than the mean. Because the individual observations for any given value of $X$ vary randomly about the mean of $Y | X$ , we need to account for this random variation, or error, in the regression equation. Sample regression line

Y_{i} = β_{0} + β_{1} x_{i} + ϵ_{i}, i = 1, 2, 3...n

Please write the assumptions of $ϵ_{i}$

Fitted regression line

\hat{Y} = {\hat{β}}_{0} + {\hat{β}}_{1} x

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

STA 506 2.0 Linear Regression Analysis

Interpretation of Regression Outputs

Dr Thiyanga S. Talagala

Recapping what we know

Revisiting the fitted regression model

Fitted model

Model summary

R2R2 = 24.08%

Data set

scatterplot

Fit a simple linear regression model

Visualise the model

Questions

Coefficient of determination

Are any of the basic assumptions violated?

Compute Residuals and Fitted values

1) The relationship between the response YY and the regressors is linear, at least approximately.

2) The error term ϵϵ has zero mean.

3) The error term ϵϵ has constant variance σ2σ2.

4) The error are uncorrelated.

5) The errors are normally distributed.

Interpretations

Nonsensical intercept interpretation

Hypothesis testing on the slope and intercept

Hypothesis testing - Intercept

Situations where H0:β1=0H0:β1=0 is not rejected.

Situations where H0:β1=0H0:β1=0 is rejected.

Confidence Interval on β0β0 and β1β1

t-distribution

Confidence Interval on β0β0 and β1β1 - Interpretation

Analysis of variance - ANOVA

ANOVA - F test

ANOVA - Hypothesis

Regression through the origin (No intercept)

Prediction of new observations

Next Lecture

Recapping what we know

Help

$R^{2}$ = 24.08%

1) The relationship between the response $Y$ and the regressors is linear, at least approximately.

2) The error term $ϵ$ has zero mean.

3) The error term $ϵ$ has constant variance $σ^{2}$ .

Situations where $H_{0} : β_{1} = 0$ is not rejected.

Situations where $H_{0} : β_{1} = 0$ is rejected.

Confidence Interval on $β_{0}$ and $β_{1}$

Confidence Interval on $β_{0}$ and $β_{1}$ - Interpretation