Example 1: The Effects of Sources of Pressure on Stress

Read the data set

ex1 <- read.table("lecture2ex1.csv", sep=",", header=TRUE)

Find the descriptive statistics of the variables in the data set

library(psych)
## Warning: package 'psych' was built under R version 4.2.3
describe(ex1)
##           vars   n  mean    sd median trimmed   mad   min   max range  skew
## homepress    1 100 50.25 10.76  51.00   50.76 11.12 18.00 76.00 58.00 -0.44
## workpress    2 100 51.03 10.63  51.00   51.34  9.64 22.00 73.00 51.00 -0.34
## stress       3 100 52.10 11.33  53.92   52.12 12.91 22.81 78.09 55.27 -0.07
##           kurtosis   se
## homepress     0.15 1.08
## workpress     0.02 1.06
## stress       -0.60 1.13

Test whether each correlation in a correlation matrix is significantly different from 0.

corr.test(ex1)
## Call:corr.test(x = ex1)
## Correlation matrix 
##           homepress workpress stress
## homepress      1.00      0.55   0.57
## workpress      0.55      1.00   0.48
## stress         0.57      0.48   1.00
## Sample Size 
## [1] 100
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##           homepress workpress stress
## homepress         0         0      0
## workpress         0         0      0
## stress            0         0      0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

Run multiple regression where stress is the outcome and homepress and workpress are predictors. Save the output of the multiple regression as out1 and use summary to see the output.

out1 <- lm(stress ~ homepress + workpress, data=ex1)
summary(out1)
## 
## Call:
## lm(formula = stress ~ homepress + workpress, data = ex1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.2903  -5.4976   0.2526   5.7299  23.3002 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  16.0114     5.0209   3.189  0.00192 ** 
## homepress     0.4588     0.1026   4.473 2.09e-05 ***
## workpress     0.2553     0.1038   2.461  0.01564 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.142 on 97 degrees of freedom
## Multiple R-squared:  0.3626, Adjusted R-squared:  0.3495 
## F-statistic:  27.6 on 2 and 97 DF,  p-value: 3.254e-10

Find the standardized regression coefficients and their significance tests (\(\ne\) 0) by the betaDelta package

library(betaDelta)
## Warning: package 'betaDelta' was built under R version 4.2.3
BetaDelta(out1)
## Call:
## BetaDelta(object = out1)
## 
## Standardized regression slopes with MVN standard errors:
##              est     se      t df      p   0.05%    0.5%   2.5%  97.5%  99.5%
## homepress 0.4355 0.0898 4.8509 97 0.0000  0.1308  0.1996 0.2573 0.6137 0.6714
## workpress 0.2396 0.0944 2.5388 97 0.0127 -0.0807 -0.0084 0.0523 0.4268 0.4875
##           99.95%
## homepress 0.7402
## workpress 0.5598

Find the variance partitioning by the anova function. Be careful that the resulting sum of squares are in Type 1 (controlling variables in previous lines).

anova(out1)
## Analysis of Variance Table
## 
## Response: stress
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## homepress  1 4106.5  4106.5 49.1369 3.187e-10 ***
## workpress  1  506.0   506.0  6.0547   0.01564 *  
## Residuals 97 8106.6    83.6                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example 2: Factors influencing cuteness.

Read the data set

ex2 <- read.table("lecture2ex2.csv", sep=",", header=TRUE)

Find the descriptive statistics of the variables in the data set

describe(ex2)
##       vars  n  mean    sd median trimmed   mad   min   max range  skew kurtosis
## face     1 50 51.64 15.26  51.50   51.02 15.57 21.00 89.00 68.00  0.27    -0.51
## shape    2 50 50.96 16.75  49.00   50.75 15.57 13.00 99.00 86.00  0.25     0.09
## habit    3 50 48.14 15.27  47.00   47.62 14.08 16.00 90.00 74.00  0.31     0.06
## close    4 50 50.40 16.40  49.50   50.15 17.79 19.00 88.00 69.00  0.14    -0.67
## cute     5 50 52.85 18.50  54.54   53.57 19.76  6.34 88.84 82.51 -0.37    -0.33
##         se
## face  2.16
## shape 2.37
## habit 2.16
## close 2.32
## cute  2.62

Test whether each correlation in a correlation matrix is significantly different from 0.

corr.test(ex2)
## Call:corr.test(x = ex2)
## Correlation matrix 
##        face shape habit close cute
## face   1.00  0.45 -0.02  0.33 0.65
## shape  0.45  1.00 -0.19  0.13 0.31
## habit -0.02 -0.19  1.00  0.54 0.12
## close  0.33  0.13  0.54  1.00 0.23
## cute   0.65  0.31  0.12  0.23 1.00
## Sample Size 
## [1] 50
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##       face shape habit close cute
## face  0.00  0.01  1.00  0.13 0.00
## shape 0.00  0.00  0.73  1.00 0.18
## habit 0.90  0.18  0.00  0.00 1.00
## close 0.02  0.37  0.00  0.00 0.54
## cute  0.00  0.03  0.42  0.11 0.00
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

Run multiple regression, save the output, and run the summary function.

out2 <- lm(cute ~ face + shape + habit + close, data=ex2)
summary(out2)
## 
## Call:
## lm(formula = cute ~ face + shape + habit + close, data = ex2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.372 -11.143   1.349   8.074  30.234 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.36760   10.90784   0.217    0.829    
## face         0.80841    0.15839   5.104  6.5e-06 ***
## shape        0.06468    0.13999   0.462    0.646    
## habit        0.23651    0.16807   1.407    0.166    
## close       -0.11788    0.16246  -0.726    0.472    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.27 on 45 degrees of freedom
## Multiple R-squared:  0.4534, Adjusted R-squared:  0.4048 
## F-statistic: 9.332 on 4 and 45 DF,  p-value: 1.402e-05

Find the standardized regression coefficients and their significance tests (\(\ne\) 0) by the betaDelta package

BetaDelta(out2)
## Call:
## BetaDelta(object = out2)
## 
## Standardized regression slopes with MVN standard errors:
##           est     se       t df      p   0.05%    0.5%    2.5%  97.5%  99.5%
## face   0.6672 0.1036  6.4408 45 0.0000  0.3025  0.3886  0.4586 0.8758 0.9458
## shape  0.0586 0.1202  0.4872 45 0.6285 -0.3646 -0.2647 -0.1835 0.3007 0.3819
## habit  0.1952 0.1313  1.4868 45 0.1440 -0.2670 -0.1579 -0.0692 0.4597 0.5484
## close -0.1045 0.1365 -0.7655 45 0.4480 -0.5851 -0.4717 -0.3795 0.1705 0.2627
##       99.95%
## face  1.0319
## shape 0.4817
## habit 0.6575
## close 0.3761

Find the variance partitioning by the anova function. Be careful that the resulting sum of squares are in Type 1 (controlling variables in previous lines).

anova(out2)
## Analysis of Variance Table
## 
## Response: cute
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## face       1 7190.4  7190.4 35.3173 3.796e-07 ***
## shape      1    3.6     3.6  0.0177    0.8947    
## habit      1  298.8   298.8  1.4678    0.2320    
## close      1  107.2   107.2  0.5264    0.4719    
## Residuals 45 9161.8   203.6                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Centering predictors as follows:

I() is used to run a mathematical operation on predictors before putting them in multiple regression.

out2a <- lm(cute ~ I(face - 50) + I(shape - mean(shape)) + I(habit - 50) + I(close - 50), data=ex2)
summary(out2a)
## 
## Call:
## lm(formula = cute ~ I(face - 50) + I(shape - mean(shape)) + I(habit - 
##     50) + I(close - 50), data = ex2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.372 -11.143   1.349   8.074  30.234 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            52.01609    2.05732  25.283  < 2e-16 ***
## I(face - 50)            0.80841    0.15839   5.104  6.5e-06 ***
## I(shape - mean(shape))  0.06468    0.13999   0.462    0.646    
## I(habit - 50)           0.23651    0.16807   1.407    0.166    
## I(close - 50)          -0.11788    0.16246  -0.726    0.472    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.27 on 45 degrees of freedom
## Multiple R-squared:  0.4534, Adjusted R-squared:  0.4048 
## F-statistic: 9.332 on 4 and 45 DF,  p-value: 1.402e-05

The intercepts are changed but the slopes are not changes. The standardized coefficients and the anova table would not be changed too.

BetaDelta(out2a)
## Call:
## BetaDelta(object = out2a)
## 
## Standardized regression slopes with MVN standard errors:
##                            est     se       t df      p   0.05%    0.5%    2.5%
## I(face - 50)            0.6672 0.1036  6.4408 45 0.0000  0.3025  0.3886  0.4586
## I(shape - mean(shape))  0.0586 0.1202  0.4872 45 0.6285 -0.3646 -0.2647 -0.1835
## I(habit - 50)           0.1952 0.1313  1.4868 45 0.1440 -0.2670 -0.1579 -0.0692
## I(close - 50)          -0.1045 0.1365 -0.7655 45 0.4480 -0.5851 -0.4717 -0.3795
##                         97.5%  99.5% 99.95%
## I(face - 50)           0.8758 0.9458 1.0319
## I(shape - mean(shape)) 0.3007 0.3819 0.4817
## I(habit - 50)          0.4597 0.5484 0.6575
## I(close - 50)          0.1705 0.2627 0.3761
anova(out2a)
## Analysis of Variance Table
## 
## Response: cute
##                        Df Sum Sq Mean Sq F value    Pr(>F)    
## I(face - 50)            1 7190.4  7190.4 35.3173 3.796e-07 ***
## I(shape - mean(shape))  1    3.6     3.6  0.0177    0.8947    
## I(habit - 50)           1  298.8   298.8  1.4678    0.2320    
## I(close - 50)           1  107.2   107.2  0.5264    0.4719    
## Residuals              45 9161.8   203.6                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example 3: Factors influencing eating spicy.

Read and view the data set.

ex3 <- read.table("lecture2ex3.csv", sep=",", header=TRUE)
head(ex3, 15)
##    id chilimg region sex age
## 1   1    4.08      1   1  23
## 2   2   11.15      3   2  43
## 3   3    7.75      2   2  36
## 4   4    0.00      1   1  39
## 5   5   20.09      4   2  24
## 6   6    6.78      2   2  30
## 7   7    2.36      2   1  59
## 8   8    6.18      2   1  45
## 9   9    1.81      2   2  43
## 10 10    0.00      1   1  52
## 11 11    7.47      3   1  21
## 12 12    0.05      2   2  56
## 13 13    1.24      2   1  18
## 14 14    0.00      2   2  50
## 15 15    6.89      1   2  20

Change sex to a factor variable where the first group is the reference group, which is female here. When a variable is in the factor format, R will know that each number represents different categories (not viewed as low or high as in a standard variable).

ex3[,"sex"] <- factor(ex3[,"sex"], labels=c("female", "male"))

Change region to a factor variable where the first group is the reference group, which is north here.

ex3[,"region"] <- factor(ex3[,"region"], labels=c("north", "center", "northeast", "south"))

Change the reference group to south by the relevel function and view the data set again.

ex3[,"region"] <- relevel(ex3[,"region"], ref="south")
head(ex3, 15)
##    id chilimg    region    sex age
## 1   1    4.08     north female  23
## 2   2   11.15 northeast   male  43
## 3   3    7.75    center   male  36
## 4   4    0.00     north female  39
## 5   5   20.09     south   male  24
## 6   6    6.78    center   male  30
## 7   7    2.36    center female  59
## 8   8    6.18    center female  45
## 9   9    1.81    center   male  43
## 10 10    0.00     north female  52
## 11 11    7.47 northeast female  21
## 12 12    0.05    center   male  56
## 13 13    1.24    center female  18
## 14 14    0.00    center   male  50
## 15 15    6.89     north   male  20

Find the descriptive statistics of the variables in the data set

describe(ex2)
##       vars  n  mean    sd median trimmed   mad   min   max range  skew kurtosis
## face     1 50 51.64 15.26  51.50   51.02 15.57 21.00 89.00 68.00  0.27    -0.51
## shape    2 50 50.96 16.75  49.00   50.75 15.57 13.00 99.00 86.00  0.25     0.09
## habit    3 50 48.14 15.27  47.00   47.62 14.08 16.00 90.00 74.00  0.31     0.06
## close    4 50 50.40 16.40  49.50   50.15 17.79 19.00 88.00 69.00  0.14    -0.67
## cute     5 50 52.85 18.50  54.54   53.57 19.76  6.34 88.84 82.51 -0.37    -0.33
##         se
## face  2.16
## shape 2.37
## habit 2.16
## close 2.32
## cute  2.62

Find the frequency tables of both categorical variables.

with(ex3, table(sex))
## sex
## female   male 
##    200    200
with(ex3, table(region))
## region
##     south     north    center northeast 
##       100       100       100       100

Test the correlation excluding categorical variables.

corr.test(ex3[,c("chilimg", "age")])
## Call:corr.test(x = ex3[, c("chilimg", "age")])
## Correlation matrix 
##         chilimg   age
## chilimg    1.00 -0.09
## age       -0.09  1.00
## Sample Size 
## [1] 400
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##         chilimg  age
## chilimg    0.00 0.08
## age        0.08 0.00
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

Run multiple regression where age is centered at 40. sex and region will be transformed to dummy variables automatically because both variables are set as factors previously.

out3 <- lm(chilimg ~ I(age - 40) + sex + region, data=ex3)
summary(out3)
## 
## Call:
## lm(formula = chilimg ~ I(age - 40) + sex + region, data = ex3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2202 -2.5846 -0.0098  2.0427 10.2763 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      13.67700    0.36355  37.621   <2e-16 ***
## I(age - 40)      -0.03047    0.01239  -2.459   0.0144 *  
## sexmale           0.49914    0.32511   1.535   0.1255    
## regionnorth     -10.89126    0.45981 -23.686   <2e-16 ***
## regioncenter     -9.89767    0.46027 -21.504   <2e-16 ***
## regionnortheast  -6.13874    0.45978 -13.351   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.251 on 394 degrees of freedom
## Multiple R-squared:  0.6402, Adjusted R-squared:  0.6356 
## F-statistic: 140.2 on 5 and 394 DF,  p-value: < 2.2e-16

Find the variance partitioning by the anova function. Be careful that the resulting sum of squares are in Type 1 (controlling variables in previous lines).

anova(out3)
## Analysis of Variance Table
## 
## Response: chilimg
##              Df Sum Sq Mean Sq  F value    Pr(>F)    
## I(age - 40)   1   86.2   86.22   8.1574  0.004516 ** 
## sex           1   25.0   25.01   2.3661  0.124802    
## region        3 7297.6 2432.52 230.1526 < 2.2e-16 ***
## Residuals   394 4164.2   10.57                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To test each set of predictors, multiple regression results with and without the set of predictors can be compared by the anova function. First, check whether region is significant.

out3noregion <- lm(chilimg ~ I(age - 40) + sex, data=ex3)
summary(out3noregion)
## 
## Call:
## lm(formula = chilimg ~ I(age - 40) + sex, data = ex3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2233 -4.3438 -0.8815  3.7121 16.9294 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.94329    0.38001  18.271   <2e-16 ***
## I(age - 40) -0.03545    0.02044  -1.735   0.0836 .  
## sexmale      0.50009    0.53733   0.931   0.3526    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.373 on 397 degrees of freedom
## Multiple R-squared:  0.009611,   Adjusted R-squared:  0.004621 
## F-statistic: 1.926 on 2 and 397 DF,  p-value: 0.1471
anova(out3noregion, out3)
## Analysis of Variance Table
## 
## Model 1: chilimg ~ I(age - 40) + sex
## Model 2: chilimg ~ I(age - 40) + sex + region
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1    397 11461.8                                  
## 2    394  4164.2  3    7297.6 230.15 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The anova function is necessary to test region because region consists of many dummy variables. sex and age can be compared by anova too but the p value is the same as t-test in the summary function of the multiple regression with all predictors.

out3nosex <- lm(chilimg ~ I(age - 40) + region, data=ex3)
anova(out3nosex, out3)
## Analysis of Variance Table
## 
## Model 1: chilimg ~ I(age - 40) + region
## Model 2: chilimg ~ I(age - 40) + sex + region
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    395 4189.2                           
## 2    394 4164.2  1    24.913 2.3571 0.1255
out3noage <- lm(chilimg ~ sex + region, data=ex3)
anova(out3noage, out3)
## Analysis of Variance Table
## 
## Model 1: chilimg ~ sex + region
## Model 2: chilimg ~ I(age - 40) + sex + region
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    395 4228.2                              
## 2    394 4164.2  1    63.907 6.0465 0.01436 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Even though it is possible to find standardized regression coefficients of dummy variables. To compare whether age, sex, or region has the most predictive power on the dependent variable, standardized regression coefficients cannot be used directly because region has three dummy variables whereas other variables have only one predictor.

BetaDelta(out3)
## Call:
## BetaDelta(object = out3)
## 
## Standardized regression slopes with MVN standard errors:
##                     est     se        t  df      p   0.05%    0.5%    2.5%
## I(age - 40)     -0.0745 0.0301  -2.4725 394 0.0138 -0.1744 -0.1525 -0.1337
## sexmale          0.0464 0.0300   1.5457 394 0.1230 -0.0531 -0.0313 -0.0126
## regionnorth     -0.8768 0.0389 -22.5656 394 0.0000 -1.0056 -0.9773 -0.9532
## regioncenter    -0.7968 0.0401 -19.8661 394 0.0000 -0.9298 -0.9006 -0.8756
## regionnortheast -0.4942 0.0390 -12.6839 394 0.0000 -0.6234 -0.5950 -0.5708
##                   97.5%   99.5%  99.95%
## I(age - 40)     -0.0153  0.0035  0.0254
## sexmale          0.1054  0.1241  0.1459
## regionnorth     -0.8004 -0.7762 -0.7480
## regioncenter    -0.7179 -0.6930 -0.6638
## regionnortheast -0.4176 -0.3933 -0.3650

To compare all predictors, \(\Delta R^2\) provides how much each variable explain unique variances of the dependent variable. summary(lmoutput)$r.squared is used to extract \(R^2\). That is, the summary method provides an output object from the lm function, which its type is list. One element of the output list is called r.squared. You can extract by using $ to get the element.

summary(out3)$r.squared - summary(out3noage)$r.squared
## [1] 0.005522048
summary(out3)$r.squared - summary(out3nosex)$r.squared
## [1] 0.002152659
summary(out3)$r.squared - summary(out3noregion)$r.squared
## [1] 0.6305658