Example 1: Interviewees’ Rating Scores within Interviewers

In this lecture, the steps for multiple imputation by the mice package will be explained. First, read the dataset investigating the interaction between interviewer’s and interviewee’s sex on interview ratings. In this data, some data points are missing.

dat1 <-read.table("lecture14ex1.csv", sep=",", header=TRUE, na.strings="999999")

To show the basic of the mice package, only three variables and one level-2 ID will be used.

dat1a <- dat1[,c("erid", "score", "eesex", "ersex")]

Next, check the missing patterns in the dataset.

library(mice)
library(miceadds)
md.pattern(dat1a)

##      erid eesex ersex score     
## 7907    1     1     1     1    0
## 2093    1     1     1     0    1
##         0     0     0  2093 2093
Step 1: Transform Variable to Reduce Computational Burden

Most variables should be on a standard score (or the mean and standard deviation close to standard score). If a variable is in a big number (e.g., 1,000), the multiplication between two numbers can lead to even bigger number (e.g., 1,000,000). If a variable has many decimals (e.g., 0.001), the multiplication can lead to even smaller decimals (e.g., 0.000001). In this case, both sex variables are in dummy variables so the original scale is retained. However, the score is subtracted by its mean. Note that the subtracted mean will be added after the data are imputed.

m_score1a <- mean(dat1a$score, na.rm=TRUE)
dat1a$score <- dat1a$score - m_score1a
Step 2: Make Imputation Method Vector and Prediction Matrix

First, make the blueprints of the imputation method vector and predictor matrix.

pred1a <- make.predictorMatrix(dat1a)
pred1a
##       erid score eesex ersex
## erid     0     1     1     1
## score    1     0     1     1
## eesex    1     1     0     1
## ersex    1     1     1     0
meth1a <- make.method(dat1a)
meth1a
##  erid score eesex ersex 
##    "" "pmm"    ""    ""

Next, the prediction matrix is adjusted. Only score has missing data. Thus, the rows involving other variables (erid, eesex, and ersex) are 0 because no missing data prediction is involved. The values put in the score row is as follows:

  • erid is -2 because this variable is the level-2 ID variable
  • score is 0 because the missing data cannot be predicted by the same variable.
  • eesex is 4. This variable is the main variable hoping to have random effects on score. Thus, the group means and the original scale with random effect of eesex is used to predict score.
  • ersex is 1. This variable is in Level 2. It should have only fixed effects on score so 1 is used.

Note that -2 means level-2 variable, 0 means no prediction, 1 means only fixed effects, 2 means fixed effects and random effect, 3 means fixed effect with original variable and group means, 4, means the fixed and random effects from the original variable and the fixed effect of group means.

pred1a[,] <- 0
pred1a["score",] <- c(-2, 0, 4, 1)

Next, the imputation vector is adjusted. erid, eesex, and ersex do not have missing data so "" is used in the vector. For score, 2l.pan is used because score is a continuous variable and expect to have equal variances across level-2 units. Other options are as follows:

  • "" means to not predict missing values in this variable
  • 2l.pan is used when the target variable is in level 1, is continuous, and is assumed to have equal residual variances across level-2 units
  • 2l.norm is used when the target variable is in level 1, is continuous, and is assumed to have unequal residual variances across level-2 units
  • 2l.bin is used when the target variable is level-1 binary variable
  • 2lonly.norm is used when the target variable is level-2 continuous variable
  • 2lonly.pmm is used when the target variable is level-2 binary (or continuous) variable
  • 2l.groupmean is used to calculate group means (like the ave function). In prediction matrix, -2 is used for the ID variable and 2 is used for the target variable to find group means. Note that the miceadds package is needed to use 2l.groupmean.
meth1a["score"] <- "2l.pan"
Step 3: Imputation

The mice function is used to impute data to multiple sets. The first argument is the target data. pred is for prediction matrix. meth is for imputation method vector. m is the number of imputed data sets. maxit is the number of iterations (or loops) to impute the data. seed is the random seed number which users can set to get the same results in later analyses.

imp1 <- mice(dat1a, pred=pred1a, meth=meth1a, m=10, maxit=10, seed=123321)
Step 4: Check Convergence

Check whether the means and standard deviations of the imputed variables are converged. The convergence means that means and standard deviations from all imputed data sets are distributed around the same areas. If the distributions of means and standard deviations in later iterations have the same amount of variability (not increasing or not decreasing), the convergence can be assumed.

plot(imp1, c("score"))

From this plot, it is hard to assume convergence. Thus, m and maxit are increased.

imp12 <- mice(dat1a, pred=pred1a, meth=meth1a, m=30, maxit=20, seed=123321)

Check the convergence again. The convergence can be assumed.

plot(imp12, c("score"))

Step 5: Transform to the original scale

The subtracted means from Step 1 are added back to the imputed data. First, make the imputed data set in the long format where all imputed data sets are stacked as new rows by the complete function. Then, the mean is added. Finally, the data in long format is transformed into the mice package output by the as.mids function.

impdat1a <- complete(imp12, "long", include = TRUE)
impdat1a$score <- impdat1a$score + m_score1a
imp1a <- as.mids(impdat1a)
Step 6: Analyze Each Imputed Data

The with function means that analyze each data set shown in the first argument by the second argument. This example will compare the random slopes of eesex.

library(lme4)
fit1 <- with(imp1a, lmer(score ~ eesex + ersex + (1|erid), REML=FALSE))
fit2 <- with(imp1a, lmer(score ~ eesex + ersex + (1 + eesex|erid), REML=FALSE))
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00270948 (tol = 0.002, component 1)
Step 7: Pool Analysis Results

The analysis results from all imputed data sets are saved in fit1 and fit2. Those results must be pooled to get single parameter estimates and standard errors. The pool function is used to pool fixed effects. However, when the lme4 package is used for data analysis, the broom.mixed package must be called to make the pool function run appropriately.

library(broom.mixed)
out1 <- pool(fit1)
summary(out1)
##          term    estimate  std.error   statistic        df   p.value
## 1 (Intercept) 72.74481011 0.20080224 362.2709110 6578.6691 0.0000000
## 2       eesex -0.04883249 0.09974143  -0.4895909  530.5782 0.6246258
## 3       ersex  0.43230753 0.27559283   1.5686458 7623.1215 0.1167720
out2 <- pool(fit2)
summary(out2)
##          term    estimate std.error   statistic        df   p.value
## 1 (Intercept) 72.78826618 0.1994367 364.9692121 5718.6305 0.0000000
## 2       eesex -0.04883249 0.1063862  -0.4590114  675.1393 0.6463737
## 3       ersex  0.34539539 0.2761916   1.2505644 6347.6615 0.2111395

To pool the random effect, I wrote a function to extract the covariance of random effects from each analysis result and average them. Users must put the "mimlmtools.R" file in the working directory and run the source function to read the code inside the file. The ranefMI function is used to find the covariance of the random effects.

source("mimlmtools.R")
ranefMI(fit1)
##             (Intercept)
## (Intercept)    16.51926
## attr(,"stddev")
## (Intercept) 
##    4.064389 
## attr(,"correlation")
##             (Intercept)
## (Intercept)           1
ranefMI(fit2)
##             (Intercept)    eesex
## (Intercept)   15.895677 0.283141
## eesex          0.283141 1.540886
## attr(,"stddev")
## (Intercept)       eesex 
##    3.986938    1.241324 
## attr(,"correlation")
##             (Intercept)     eesex
## (Intercept)   1.0000000 0.0572108
## eesex         0.0572108 1.0000000

The sigmaMI function is used to find the level-1 residual variance.

sigmaMI(fit1)
## [1] 19.25922
sigmaMI(fit2)
## [1] 18.8312

The anovaMI function is used to pool the likelihood ratio tests from each imputed data set.

anovaMI(fit1, fit2)
##       F     df1     df2     p.F 
##   4.530   2.000 705.487   0.011

The r2mlmMI function is used to pool R-squared results.

r2mlmMI(fit1)
##           f           v           m          fv         fvm 
## 0.001349975 0.000000000 0.461079304 0.001349975 0.462429279

Example 2: Interviewees’ Rating Scores within Interviewers (with More Complex Missing Data Patterns)

In this example, four variables and one level-2 ID will be used from the interview data set. All four variables have missing data.

dat1b <- dat1[,c("erid", "score", "eesalary", "eeworkexp", "erexp")]

Next, check the missing patterns in the data set.

library(mice)
md.pattern(dat1b)

##      erid eeworkexp eesalary erexp score     
## 4030    1         1        1     1     1    0
## 1083    1         1        1     1     0    1
## 1049    1         1        1     0     1    1
## 257     1         1        1     0     0    2
## 1007    1         1        0     1     1    1
## 266     1         1        0     1     0    2
## 271     1         1        0     0     1    2
## 60      1         1        0     0     0    3
## 954     1         0        1     1     1    1
## 272     1         0        1     1     0    2
## 278     1         0        1     0     1    2
## 67      1         0        1     0     0    3
## 248     1         0        0     1     1    2
## 70      1         0        0     1     0    3
## 70      1         0        0     0     1    3
## 18      1         0        0     0     0    4
##         0      1977     2010  2070  2093 8150
Step 1: Transform Variable to Reduce Computational Burden

All target variables are subtracted by means and divided by standard deviations. The mean and sd functions are used on the data set directly for level-1 variables.

m_score1b <- mean(dat1b$score, na.rm=TRUE)
m_eesalary1b <- mean(dat1b$eesalary, na.rm=TRUE)
m_eeworkexp1b <- mean(dat1b$eeworkexp, na.rm=TRUE)
sd_score1b <- sd(dat1b$score, na.rm=TRUE)
sd_eesalary1b <- sd(dat1b$eesalary, na.rm=TRUE)
sd_eeworkexp1b <- sd(dat1b$eeworkexp, na.rm=TRUE)

For level-2 variables, the new data set which rows represent level-2 units are created and then the means and standard deviations are calculated.

dat1bg <- dat1b[!duplicated(dat1b$erid),]
m_erexp1b <- mean(dat1bg$erexp, na.rm=TRUE)
sd_erexp1b <- sd(dat1bg$erexp, na.rm=TRUE)

Then, all variables are transformed by the calculated means and standard deviations.

dat1b$score <- (dat1b$score - m_score1b) / sd_score1b
dat1b$eesalary <- (dat1b$eesalary - m_eesalary1b) / sd_eesalary1b
dat1b$eeworkexp <- (dat1b$eeworkexp - m_eeworkexp1b) / sd_eeworkexp1b
dat1b$erexp <- (dat1b$erexp - m_erexp1b) / sd_erexp1b
Step 2: Group Mean Centering

Because the relationship between variables in this target data set is assumed to have cross-level interactions, the group means and group-mean-centered variables are created. However, these variables have missing data, the group means calculation need to be in the process of missing imputation. That is, the missing data are imputed and then the group means and the deviations from group means are created. To create the slots for imputation, these group means and deviations are named and attached to the data set. All members in the created variables are NA.

dat1b$avescore <- NA
dat1b$aveeesalary <- NA
dat1b$aveeeworkexp <- NA

dat1b$diffscore <- NA
dat1b$diffeesalary <- NA
dat1b$diffeeworkexp <- NA
Step 3: Interaction

In this example, assume that interviewers’ work experience (eeworkexp) and interviewer’s interview experience (erexp) have the interaction effect on other variables. We need to account for this interaction in the imputation model. However, the work experience is separated into the group mean (aveeeworkexp) and the deviation from group mean (diffeeworkexp). Then, two interactions are created. Like group mean centering, the interaction is created from variables with missing values. Thus, the missing values must be imputed before the interaction is created in the imputation model. To create the slots for imputation, these interactions are named and attached to the data set. All members in the created variables are NA.

dat1b$int1 <- NA # diffeeworkexp * erexp
dat1b$int2 <- NA # aveeeworkexp * erexp
Step 4: Make Imputation Method Vector and Prediction Matrix

First, make the blueprints of the imputation method vector and predictor matrix.

meth1b <- make.method(dat1b)
meth1b
##          erid         score      eesalary     eeworkexp         erexp 
##            ""         "pmm"         "pmm"         "pmm"         "pmm" 
##      avescore   aveeesalary  aveeeworkexp     diffscore  diffeesalary 
##      "logreg"      "logreg"      "logreg"      "logreg"      "logreg" 
## diffeeworkexp          int1          int2 
##      "logreg"      "logreg"      "logreg"
pred1b <- make.predictorMatrix(dat1b)
pred1b
##               erid score eesalary eeworkexp erexp avescore aveeesalary
## erid             0     1        1         1     1        1           1
## score            1     0        1         1     1        1           1
## eesalary         1     1        0         1     1        1           1
## eeworkexp        1     1        1         0     1        1           1
## erexp            1     1        1         1     0        1           1
## avescore         1     1        1         1     1        0           1
## aveeesalary      1     1        1         1     1        1           0
## aveeeworkexp     1     1        1         1     1        1           1
## diffscore        1     1        1         1     1        1           1
## diffeesalary     1     1        1         1     1        1           1
## diffeeworkexp    1     1        1         1     1        1           1
## int1             1     1        1         1     1        1           1
## int2             1     1        1         1     1        1           1
##               aveeeworkexp diffscore diffeesalary diffeeworkexp int1 int2
## erid                     1         1            1             1    1    1
## score                    1         1            1             1    1    1
## eesalary                 1         1            1             1    1    1
## eeworkexp                1         1            1             1    1    1
## erexp                    1         1            1             1    1    1
## avescore                 1         1            1             1    1    1
## aveeesalary              1         1            1             1    1    1
## aveeeworkexp             0         1            1             1    1    1
## diffscore                1         0            1             1    1    1
## diffeesalary             1         1            0             1    1    1
## diffeeworkexp            1         1            1             0    1    1
## int1                     1         1            1             1    0    1
## int2                     1         1            1             1    1    0

The imputation method vector is set as follows:

  • score, eesalary, eeworkexp is predicted by 2l.pan because they are level-1 continuous variables.
  • erexp is predicted by 2lonly.norm because they are level-2 continuous variable.
  • avescore, aveeesalary, aveeeworkexp are the group means calculated from score, eesalary, and eeworkexp respectively. The 2l.groupmean is used.
  • diffscore, diffeesalary, and diffeeworkexp are all deviation scores from the group means. They are simply the differences between the original variables and the group means. ~I() is used to tell mice that the simple calculation between variables is used. Then, put the formula inside ~I()
  • int1 and int2 are the interactions between variables in the data set. Again, ~I() is used and the calculation is put inside it.
meth1b[c("score", "eesalary", "eeworkexp")] <- "2l.pan"
meth1b["erexp"] <- "2lonly.norm"
meth1b[c("avescore", "aveeesalary", "aveeeworkexp")] <- "2l.groupmean"
meth1b["diffscore"] <- "~I(score - avescore)"
meth1b["diffeesalary"] <- "~I(eesalary - aveeesalary)"
meth1b["diffeeworkexp"] <- "~I(eeworkexp - aveeeworkexp)"
meth1b["int1"] <- "~I(diffeeworkexp*erexp)"
meth1b["int2"] <- "~I(aveeeworkexp*erexp)"

Next, the prediction matrix is adjusted. All elements in a row are 0 if the imputation method matrix is "" or ~I(). Thus, these following rows are adjusted: score, eesalary, eeworkexp, erexp, avescore, aveeesalary, and aveeeworkexp.

First, set all members in the prediction matrix as 0.

pred1b[,] <- 0

Second, in these adjusted rows, set the column of erid, which is level-2 ID, as -2.

pred1b[c("score", "eesalary", "eeworkexp", "erexp"), "erid"] <- -2
pred1b[c("avescore", "aveeesalary", "aveeeworkexp"), "erid"] <- -2

Third, let’s change the score row. The level-2 predictors (excluding avescore) are predicted using only fixed effects (set as 1). The level-1 deviation scores from group means (excluding diffscore) are predicted with fixed and random effects (set as 2). The interactions, which are not calculated from score, are predicted with only fixed effects (set as 1).

pred1b["score", c("aveeesalary", "aveeeworkexp", "erexp")] <- 1
pred1b["score", c("diffeesalary", "diffeeworkexp")] <- 2
pred1b["score", c("int1", "int2")] <- 1

Fourth, let’s change the eesalary row. The level-2 predictors (excluding aveeesalary) are predicted using only fixed effects (set as 1). The level-1 deviation scores from group means (excluding diffeesalary) are predicted with fixed and random effects (set as 2). The interactions, which are not calculated from eesalary, are predicted with only fixed effects (set as 1).

pred1b["eesalary", c("avescore", "aveeeworkexp", "erexp")] <- 1
pred1b["eesalary", c("diffscore", "diffeeworkexp")] <- 2
pred1b["eesalary", c("int1", "int2")] <- 1

Fifth, let’s change the eeworkexp row. The level-2 predictors (excluding aveeeworkexp) are predicted using only fixed effects (set as 1). The level-1 deviation scores from group means (excluding diffeeworkexp) are predicted with fixed and random effects (set as 2). The interactions are not used because the interactions are calculated from eeworkexp.

pred1b["eeworkexp", c("avescore", "aveeesalary", "erexp")] <- 1
pred1b["eeworkexp", c("diffscore", "diffeesalary")] <- 2

Sixth, let’s change the erexp row. Because this variable is in Level 2, only other level-2 predictors can predict its missing data.

pred1b["erexp", c("avescore", "aveeesalary", "aveeeworkexp")] <- 1

Finally, group means are calculated from their original scales. In the prediction matrix, set the corresponding original scales as 2.

pred1b["avescore", "score"] <- 2
pred1b["aveeesalary", "eesalary"] <- 2
pred1b["aveeeworkexp", "eeworkexp"] <- 2
Step 5: Determine the Sequence of Imputation

When a variable is imputed, the variables based on the calculation of the imputed variable must be calculated right after the imputation. For example, if the original scale of eeworkexp is imputed, the group means (aveeeworkexp), deviations from group means (diffeeworkexp), and interactions (int1 and int2) must be imputed right after it. Thus, the sequence of imputation is as follows:

visit1b <- c("score", "avescore", "diffscore", 
             "eesalary", "aveeesalary", "diffeesalary", 
             "erexp", "int1", "int2",
             "eeworkexp", "aveeeworkexp", "diffeeworkexp",
             "int1", "int2")
Step 6: Imputation

The mice function is used to impute data to multiple sets. The additional arguments from the previous example are visit (the vector determining the imputation sequence) and allow.na, set as TRUE to allow variables that all members are NA.

imp1b <- mice(dat1b, pred=pred1b, meth=meth1b, m=10, maxit=10, visit=visit1b, seed=123321, allow.na = TRUE)
Step 7: Check Convergence

Check whether the means and standard deviations of the imputed variables are converged. Note that if there are many imputed variables, users may focus on the variables that are important in your main analysis and have high propotions of missing data.

plot(imp1b, c("score", "eesalary"))

plot(imp1b, c("eeworkexp", "erexp"))

From this plot, it is hard to assume convergence. Thus, m and maxit are increased.

imp1b2 <- mice(dat1b, pred=pred1b, meth=meth1b, m=30, maxit=20, 
              visit=visit1b, seed=123321, allow.na = TRUE)

Check the convergence again. The convergence can be assumed.

plot(imp1b2, c("score", "eesalary"))

plot(imp1b2, c("eeworkexp", "erexp"))

Step 8: Transform to the original scale

The means and standard deviations from Step 1 are used to transform the imputed variables back to the original scales.

impdat1b <- complete(imp1b2, "long", include = TRUE)
impdat1b$score <- sd_score1b*impdat1b$score + m_score1b
impdat1b$eesalary <- sd_eesalary1b*impdat1b$eesalary + m_eesalary1b
impdat1b$eeworkexp <- sd_eeworkexp1b*impdat1b$eeworkexp + m_eeworkexp1b
impdat1b$erexp <- sd_erexp1b*impdat1b$erexp + m_erexp1b
imp1b2a <- as.mids(impdat1b)
Step 9: Analyze Each Imputed Data

The with function means that analyze each data set shown in the first argument by the second argument. This example will compare the fixed effect of the interaction, eeworkexp*erexp.

library(lme4)
fit1b1 <- with(imp1b2a, lmer(score ~ eesalary + eeworkexp 
                            + erexp + (1|erid), REML=FALSE))
fit1b2 <- with(imp1b2a, lmer(score ~ eesalary + eeworkexp 
                            + erexp + eeworkexp*erexp + (1|erid), REML=FALSE))
Step 10: Pool Analysis Results

First, pool the fixed effects.

out1b1 <- pool(fit1b1)
summary(out1b1)
##          term    estimate  std.error   statistic       df      p.value
## 1 (Intercept) 72.01478732 0.20785624 346.4643936 383.4840 0.000000e+00
## 2    eesalary -0.04047357 0.07017071  -0.5767872 152.1339 5.649357e-01
## 3   eeworkexp  0.26110492 0.01558478  16.7538450 122.3739 1.813885e-33
## 4       erexp -0.25988753 0.03902953  -6.6587411 544.0240 6.765469e-11
out1b2 <- pool(fit1b2)
summary(out1b2)
##              term      estimate   std.error   statistic       df      p.value
## 1     (Intercept) 71.9980069007 0.220014545 327.2420332 295.4005 0.000000e+00
## 2        eesalary -0.0404137470 0.070141526  -0.5761743 152.4444 5.653470e-01
## 3       eeworkexp  0.2637330423 0.018420710  14.3172031 118.5766 1.300435e-27
## 4           erexp -0.2537930506 0.045438762  -5.5853866 296.7702 5.272596e-08
## 5 eeworkexp:erexp -0.0009653401 0.003162478  -0.3052480 200.4964 7.604939e-01

Next, pool the covariance between random effects.

source("mimlmtools.R")
ranefMI(fit1b1)
##             (Intercept)
## (Intercept)    15.51966
## attr(,"stddev")
## (Intercept) 
##    3.939499 
## attr(,"correlation")
##             (Intercept)
## (Intercept)           1
ranefMI(fit1b2)
##             (Intercept)
## (Intercept)    15.52003
## attr(,"stddev")
## (Intercept) 
##    3.939547 
## attr(,"correlation")
##             (Intercept)
## (Intercept)           1

The sigmaMI function is used to find the level-1 residual variance.

sigmaMI(fit1b1)
## [1] 17.83708
sigmaMI(fit1b2)
## [1] 17.83566

The anovaMI function is used to pool the likelihood ratio tests from each imputed data set.

anovaMI(fit1b1, fit1b2)
##        F      df1      df2      p.F 
##    0.160    1.000 1596.384    0.689

Example 3: Predicting Students’ Self-Efficacy from Math Achievement

Read the data set of students nested in schools. The self-efficacy is predicted by math achievement.

dat2 <- read.table("lecture14ex2.csv", sep=",", header=TRUE, na.strings="999999")

Next, check the missing patterns in the dataset.

md.pattern(dat2)

##      id schoolid private efficacy ach    
## 1166  1        1       1        1   1   0
## 305   1        1       1        1   0   1
## 291   1        1       1        0   1   1
## 91    1        1       1        0   0   2
##       0        0       0      382 396 778
Step 1: Transform Variable to Reduce Computational Burden

All target variables are subtracted by means and divided by standard deviations. Note that all transformed variables are in the student level.

m_efficacy2 <- mean(dat2$efficacy, na.rm=TRUE)
m_ach2 <- mean(dat2$ach, na.rm=TRUE)
sd_efficacy2 <- sd(dat2$efficacy, na.rm=TRUE)
sd_ach2 <- sd(dat2$ach, na.rm=TRUE)
dat2$efficacy <- (dat2$efficacy - m_efficacy2) / sd_efficacy2
dat2$ach <- (dat2$ach - m_ach2) / sd_ach2
Step 2: Group Mean Centering

Create the blank group means and deviations from group means for imputation.

dat2$aveach <- NA
dat2$aveefficacy <- NA
dat2$diffach <- NA
dat2$diffefficacy <- NA
Step 3: Interaction

Assume that the research hypothesis expected to have the cross-level interaction effect of school means and deviation from school means of math achievement on self-efficacy. The imputation model must address that cross-level interaction by putting the cross-level interaction effect in both direction: ach on efficacy and efficacy on ach. Create the blank interaction variables for imputation.

dat2$intach <- NA # aveach*diffach
dat2$intefficacy <- NA # aveefficacy*diffefficacy
Step 4: Make Imputation Method Vector and Prediction Matrix

First, make the blueprints of the imputation method vector and predictor matrix.

meth2 <- make.method(dat2)
meth2
##           id     schoolid     efficacy          ach      private       aveach 
##           ""           ""        "pmm"        "pmm"           ""     "logreg" 
##  aveefficacy      diffach diffefficacy       intach  intefficacy 
##     "logreg"     "logreg"     "logreg"     "logreg"     "logreg"
pred2 <- make.predictorMatrix(dat2)
pred2
##              id schoolid efficacy ach private aveach aveefficacy diffach
## id            0        1        1   1       1      1           1       1
## schoolid      1        0        1   1       1      1           1       1
## efficacy      1        1        0   1       1      1           1       1
## ach           1        1        1   0       1      1           1       1
## private       1        1        1   1       0      1           1       1
## aveach        1        1        1   1       1      0           1       1
## aveefficacy   1        1        1   1       1      1           0       1
## diffach       1        1        1   1       1      1           1       0
## diffefficacy  1        1        1   1       1      1           1       1
## intach        1        1        1   1       1      1           1       1
## intefficacy   1        1        1   1       1      1           1       1
##              diffefficacy intach intefficacy
## id                      1      1           1
## schoolid                1      1           1
## efficacy                1      1           1
## ach                     1      1           1
## private                 1      1           1
## aveach                  1      1           1
## aveefficacy             1      1           1
## diffach                 1      1           1
## diffefficacy            0      1           1
## intach                  1      0           1
## intefficacy             1      1           0

The imputation method vector is set as follows:

  • efficacy and ach are predicted by 2l.pan because they are level-1 continuous variables.
  • aveefficacy and aveach are the group means calculated from efficacy and ach respectively. The 2l.groupmean is used.
  • diffach and diffefficacy are all deviation scores from the group means. Use ~I() to calculate them.
  • intach and intefficacy are the interactions between variables in the data set. Again, ~I() is used and the calculation is put inside it.
meth2[c("efficacy", "ach")] <- "2l.pan"
meth2[c("aveefficacy", "aveach")] <- "2l.groupmean"
meth2["diffach"] <- "~I(ach - aveach)"
meth2["diffefficacy"] <- "~I(efficacy - aveefficacy)"
meth2["intach"] <- "~I(aveach*diffach)"
meth2["intefficacy"] <- "~I(aveefficacy*diffefficacy)"

Next, the prediction matrix is adjusted. All elements in a row are 0 if the imputation method matrix is "" or ~I(). Thus, these following rows are adjusted: ach, efficacy, aveach, and aveefficacy.

First, set all members in the prediction matrix as 0.

pred1b[,] <- 0

Second, in these adjusted rows, set the column of erid, which is level-2 ID, as -2.

pred2[c("ach", "efficacy", "aveach", "aveefficacy"), "schoolid"] <- -2

Third, let’s change the efficacy row. The level-2 predictors (excluding aveefficacy) are predicted using only fixed effects (set as 1). The level-1 deviation scores from group means (excluding diffefficacy) are predicted with fixed and random effects (set as 2). The interactions, which are not calculated from efficacy, are predicted with only fixed effects (set as 1).

pred2["efficacy", c("aveach", "diffach", "intach", "private")] <- c(1, 2, 1, 1)

Fourth, let’s change the ach row. The level-2 predictors (excluding aveach) are predicted using only fixed effects (set as 1). The level-1 deviation scores from group means (excluding diffach) are predicted with fixed and random effects (set as 2). The interactions, which are not calculated from ach, are predicted with only fixed effects (set as 1).

pred2["ach", c("aveefficacy", "diffefficacy", "intefficacy", "private")] <- c(1, 2, 1, 1)

Finally, group means are calculated from their original scales. In the prediction matrix, set the corresponding original scales as 2.

pred2["aveach", "ach"] <- 2
pred2["aveefficacy", "efficacy"] <- 2
Step 5: Determine the Sequence of Imputation

When a variable is imputed, the variables based on the calculation of the imputed variable must be calculated right after the imputation. The sequence of imputation is as follows:

visit2 <- c("efficacy", "aveefficacy", "diffefficacy", "intefficacy", 
            "ach", "aveach", "diffach", "intach")
Step 6: Imputation

Use the mice function to impute data to multiple sets.

imp2 <- mice(dat2, pred=pred2, meth=meth2, m=30, maxit=30, 
             visit=visit2, seed=123321, allow.na = TRUE)
Step 7: Check Convergence

Check whether the means and standard deviations of the imputed variables are converged.

plot(imp2, c("ach", "efficacy"))

Step 8: Transform to the original scale

The means and standard deviations from Step 1 are used to transform the imputed variables back to the original scales.

impdat2 <- complete(imp2, "long", include = TRUE)
impdat2$efficacy <- sd_efficacy2*impdat2$efficacy + m_efficacy2
impdat2$ach <- sd_ach2*impdat2$ach + m_ach2
imp22 <- as.mids(impdat2)
Step 9: Analyze Each Imputed Data

The with function means that analyze each data set shown in the first argument by the second argument. This example will compare the fixed effect of the interaction, diffach*aveach.

library(lme4)
fit21 <- with(imp22, lmer(efficacy ~ diffach + aveach 
                   + private + (1 + diffach|schoolid), 
                   REML=FALSE, control = lmerControl(optimizer ="Nelder_Mead")))
fit22 <- with(imp22, lmer(efficacy ~ diffach*aveach
                   + private + (1 + diffach|schoolid), 
                   REML=FALSE, control = lmerControl(optimizer ="Nelder_Mead")))
Step 10: Pool Analysis Results

First, pool the fixed effects.

out21 <- pool(fit21)
summary(out21)
##          term   estimate std.error   statistic       df      p.value
## 1 (Intercept) 37.0845396 0.3175583 116.7802586 1745.489 0.0000000000
## 2     diffach  1.2384357 0.3587338   3.4522411 1782.665 0.0005689315
## 3      aveach  1.1878294 0.3197328   3.7150688 1368.165 0.0002113291
## 4     private  0.1485249 0.4598419   0.3229912 1734.684 0.7467408364
out22 <- pool(fit22)
summary(out22)
##             term   estimate std.error   statistic       df      p.value
## 1    (Intercept) 37.1113988 0.3168150 117.1390032 1741.740 0.000000e+00
## 2        diffach  1.0030075 0.3027603   3.3128763 1750.987 9.422329e-04
## 3         aveach  0.9677560 0.3164757   3.0579154 1733.227 2.262922e-03
## 4        private  0.1495255 0.4599023   0.3251244 1733.376 7.451262e-01
## 5 diffach:aveach  1.9065289 0.4112595   4.6358291 1535.080 3.856417e-06

Next, pool the covariance between random effects.

source("mimlmtools.R")
ranefMI(fit21)
##             (Intercept)    diffach
## (Intercept)   1.6460188 -0.7297848
## diffach      -0.7297848  6.0979224
## attr(,"stddev")
## (Intercept)     diffach 
##    1.282973    2.469397 
## attr(,"correlation")
##             (Intercept)    diffach
## (Intercept)   1.0000000 -0.2303491
## diffach      -0.2303491  1.0000000
ranefMI(fit22)
##             (Intercept)    diffach
## (Intercept)   1.6185364 -0.5002983
## diffach      -0.5002983  4.1289792
## attr(,"stddev")
## (Intercept)     diffach 
##    1.272217    2.031989 
## attr(,"correlation")
##             (Intercept)    diffach
## (Intercept)   1.0000000 -0.1935292
## diffach      -0.1935292  1.0000000

The sigmaMI function is used to find the level-1 residual variance.

sigmaMI(fit21)
## [1] 4.063061
sigmaMI(fit22)
## [1] 4.063007

The anovaMI function is used to pool the likelihood ratio tests from each imputed data set.

anovaMI(fit21, fit22)
##          F        df1        df2        p.F 
##      9.177      1.000 129843.243      0.002

Example 4: Changes in Sense of Belonging in Undergraduates

Read the data set of years nested in undergraduate students. Sense of belonging is the dependent variable. The missing patterns reflect many students dropped out from the program.

dat3 <- read.table("lecture14ex3.csv", sep=",", header=TRUE, na.strings="999999")

Next, check the missing patterns in the dataset.

md.pattern(dat3)

##      rowid pid time cohort female act ext mem stress     
## 2466     1   1    1      1      1   1   1   1      1    0
## 934      1   1    1      1      1   1   1   0      0    2
##          0   0    0      0      0   0   0 934    934 1868

Check the proportion of missing at each time point.

aggregate(stress ~ time, data = dat3, FUN = function(x) sum(is.na(x)), na.action = NULL)
##   time stress
## 1    1      0
## 2    2    189
## 3    3    332
## 4    4    413
Step 1: Transform Variable to Reduce Computational Burden

All target variables are subtracted by means and divided by standard deviations. Let’s start with level-1 (time-varying) variables.

m_mem3 <- mean(dat3$mem, na.rm=TRUE)
m_stress3 <- mean(dat3$stress, na.rm=TRUE)
sd_mem3 <- sd(dat3$mem, na.rm=TRUE)
sd_stress3 <- sd(dat3$stress, na.rm=TRUE)
dat3$mem <- (dat3$mem - m_mem3) / sd_mem3
dat3$stress <- (dat3$stress - m_stress3) / sd_stress3

For level-2 (time-invariant) variables, new data which rows represent individuals are created before finding means and standard deviations.

dat3g <- dat3[dat3$time == 1,]
m_act3 <- mean(dat3g$act, na.rm=TRUE)
m_ext3 <- mean(dat3g$ext, na.rm=TRUE)
sd_act3 <- sd(dat3g$act, na.rm=TRUE)
sd_ext3 <- sd(dat3g$ext, na.rm=TRUE)
dat3$act <- (dat3$act - m_act3) / sd_act3
dat3$ext <- (dat3$ext - m_ext3) / sd_ext3
Step 2: Centering

First, the time variable is centered at the first time point.

dat3$time <- dat3$time - 1

Create the blank students’ means and deviations from students’ means for imputation.

dat3$avemem <- NA
dat3$avestress <- NA
dat3$diffmem <- NA
dat3$diffstress <- NA
Step 3: Interaction

The interaction terms in this example are based on time multipled with other variables, which can be separated into two sets. First, time is multipled with other variables that do not have missing values. These interactions do not need to be imputed because all values are observed. The interaction variables are simply calculated and attached into the data sets.

dat3$intact <- dat3$time * dat3$act
dat3$intext <- dat3$time * dat3$ext
dat3$intcohort <- dat3$time * dat3$cohort
dat3$intfemale <- dat3$time * dat3$female

Second, time is multipled with variables with missing values. In this case, create the blank interaction variables for imputation.

dat3$intmem1 <- NA # time*avemem
dat3$intmem2 <- NA # time*diffmem
dat3$intstress1 <- NA # time*avestress
dat3$intstress2 <- NA # time*diffstress
Step 4: Make Imputation Method Vector and Prediction Matrix

First, make the blueprints of the imputation method vector and predictor matrix.

pred3 <- make.predictorMatrix(dat3)
pred3
##            rowid pid time mem stress cohort female act ext avemem avestress
## rowid          0   1    1   1      1      1      1   1   1      1         1
## pid            1   0    1   1      1      1      1   1   1      1         1
## time           1   1    0   1      1      1      1   1   1      1         1
## mem            1   1    1   0      1      1      1   1   1      1         1
## stress         1   1    1   1      0      1      1   1   1      1         1
## cohort         1   1    1   1      1      0      1   1   1      1         1
## female         1   1    1   1      1      1      0   1   1      1         1
## act            1   1    1   1      1      1      1   0   1      1         1
## ext            1   1    1   1      1      1      1   1   0      1         1
## avemem         1   1    1   1      1      1      1   1   1      0         1
## avestress      1   1    1   1      1      1      1   1   1      1         0
## diffmem        1   1    1   1      1      1      1   1   1      1         1
## diffstress     1   1    1   1      1      1      1   1   1      1         1
## intact         1   1    1   1      1      1      1   1   1      1         1
## intext         1   1    1   1      1      1      1   1   1      1         1
## intcohort      1   1    1   1      1      1      1   1   1      1         1
## intfemale      1   1    1   1      1      1      1   1   1      1         1
## intmem1        1   1    1   1      1      1      1   1   1      1         1
## intmem2        1   1    1   1      1      1      1   1   1      1         1
## intstress1     1   1    1   1      1      1      1   1   1      1         1
## intstress2     1   1    1   1      1      1      1   1   1      1         1
##            diffmem diffstress intact intext intcohort intfemale intmem1 intmem2
## rowid            1          1      1      1         1         1       1       1
## pid              1          1      1      1         1         1       1       1
## time             1          1      1      1         1         1       1       1
## mem              1          1      1      1         1         1       1       1
## stress           1          1      1      1         1         1       1       1
## cohort           1          1      1      1         1         1       1       1
## female           1          1      1      1         1         1       1       1
## act              1          1      1      1         1         1       1       1
## ext              1          1      1      1         1         1       1       1
## avemem           1          1      1      1         1         1       1       1
## avestress        1          1      1      1         1         1       1       1
## diffmem          0          1      1      1         1         1       1       1
## diffstress       1          0      1      1         1         1       1       1
## intact           1          1      0      1         1         1       1       1
## intext           1          1      1      0         1         1       1       1
## intcohort        1          1      1      1         0         1       1       1
## intfemale        1          1      1      1         1         0       1       1
## intmem1          1          1      1      1         1         1       0       1
## intmem2          1          1      1      1         1         1       1       0
## intstress1       1          1      1      1         1         1       1       1
## intstress2       1          1      1      1         1         1       1       1
##            intstress1 intstress2
## rowid               1          1
## pid                 1          1
## time                1          1
## mem                 1          1
## stress              1          1
## cohort              1          1
## female              1          1
## act                 1          1
## ext                 1          1
## avemem              1          1
## avestress           1          1
## diffmem             1          1
## diffstress          1          1
## intact              1          1
## intext              1          1
## intcohort           1          1
## intfemale           1          1
## intmem1             1          1
## intmem2             1          1
## intstress1          0          1
## intstress2          1          0
meth3 <- make.method(dat3)
meth3
##      rowid        pid       time        mem     stress     cohort     female 
##         ""         ""         ""      "pmm"      "pmm"         ""         "" 
##        act        ext     avemem  avestress    diffmem diffstress     intact 
##         ""         ""   "logreg"   "logreg"   "logreg"   "logreg"         "" 
##     intext  intcohort  intfemale    intmem1    intmem2 intstress1 intstress2 
##         ""         ""         ""   "logreg"   "logreg"   "logreg"   "logreg"

The imputation method vector is set as follows:

  • mem and stress are predicted by 2l.pan because they are level-1 continuous variables.
  • avemem and avestress are the group means calculated from mem and stress respectively. The 2l.groupmean is used.
  • diffmem and diffstress are all deviation scores from the group means. Use ~I() to calculate them.
  • intmem1, intmem2, intstress1, and intstress2 are the interactions between variables in the data set. Again, ~I() is used and the calculation is put inside it.
meth3[c("mem", "stress")] <- "2l.pan"
meth3[c("avemem", "avestress")] <- "2l.groupmean"
meth3["diffmem"] <- "~I(mem - avemem)"
meth3["diffstress"] <- "~I(stress - avestress)"
meth3["intmem1"] <- "~I(time*avemem)"
meth3["intmem2"] <- "~I(time*diffmem)"
meth3["intstress1"] <- "~I(time*avestress)"
meth3["intstress2"] <- "~I(time*diffstress)"

Next, the prediction matrix is adjusted. All elements in a row are 0 if the imputation method matrix is "" or ~I(). Thus, these following rows are adjusted: mem, stress, avemem, and avestress.

First, set all members in the prediction matrix as 0.

pred3[,] <- 0

Second, in these adjusted rows, set the column of erid, which is level-2 ID, as -2.

pred3[c("mem", "stress", "avemem", "avestress"), "pid"] <- -2

Third, let’s change the mem row. The level-2 predictors (excluding avemem) are predicted using only fixed effects (set as 1). The level-1 deviation scores from group means (excluding diffmem) are predicted with fixed and random effects (set as 2). The interactions, which are not calculated from mem, are predicted with only fixed effects (set as 1).

pred3["mem", c("time", "diffstress")] <- 2
pred3["mem", c("cohort", "female", "act", "ext", "avestress")] <- 1
pred3["mem", c("intstress1", "intstress2", "intact", "intext", 
               "intcohort", "intfemale")] <- 1

Fourth, let’s change the stress row. The level-2 predictors (excluding avestress) are predicted using only fixed effects (set as 1). The level-1 deviation scores from group means (excluding diffstress) are predicted with fixed and random effects (set as 2). The interactions, which are not calculated from stress, are predicted with only fixed effects (set as 1).

pred3["stress", c("time", "diffmem")] <- 2
pred3["stress", c("cohort", "female", "act", "ext", "avemem")] <- 1
pred3["stress", c("intmem1", "intmem2", "intact", "intext", 
                  "intcohort", "intfemale")] <- 1

Finally, group means are calculated from their original scales. In the prediction matrix, set the corresponding original scales as 2.

pred3["avemem", "mem"] <- 2
pred3["avestress", "stress"] <- 2
Step 5: Determine the Sequence of Imputation

When a variable is imputed, the variables based on the calculation of the imputed variable must be calculated right after the imputation. The sequence of imputation is as follows:

visit3 <- c("mem", "avemem", "diffmem", "intmem1", "intmem2", 
            "stress", "avestress", "diffstress", "intstress1", "intstress2")
Step 6: Imputation

Use the mice function to impute data to multiple sets.

imp3 <- mice(dat3, pred=pred3, meth=meth3, m=30, maxit=20, 
             visit=visit3, seed=123321, allow.na = TRUE)
Step 7: Check Convergence

Check whether the means and standard deviations of the imputed variables are converged.

plot(imp3, c("mem", "stress"))

Step 8: Transform to the original scale

The means and standard deviations from Step 1 are used to transform the imputed variables back to the original scales.

impdat3 <- complete(imp3, "long", include = TRUE)
impdat3$mem <- sd_mem3*impdat3$mem + m_mem3
impdat3$stress <- sd_stress3*impdat3$stress + m_stress3
impdat3$act <- sd_act3*impdat3$act + m_act3
impdat3$ext <- sd_ext3*impdat3$ext + m_ext3
imp32 <- as.mids(impdat3)
Step 9: Analyze Each Imputed Data

The with function means that analyze each data set shown in the first argument by the second argument. This example will compare the random effect of time.

library(lme4)
fit31 <- with(imp32, lmer(mem ~ time + (1|pid), REML=FALSE, 
                          control = lmerControl(optimizer ="Nelder_Mead")))
fit32 <- with(imp32, lmer(mem ~ time + (1 + time|pid), REML=FALSE, 
                          control = lmerControl(optimizer ="Nelder_Mead")))
Step 10: Pool Analysis Results

First, pool the fixed effects.

out31 <- pool(fit31)
summary(out31)
##          term  estimate std.error statistic        df      p.value
## 1 (Intercept) 61.996159 0.6246470  99.24990 3283.5109 0.000000e+00
## 2        time -1.938696 0.2242104  -8.64677  109.8626 4.864041e-14
out32 <- pool(fit32)
summary(out32)
##          term  estimate std.error  statistic        df      p.value
## 1 (Intercept) 61.996159 0.5500611 112.707769 3227.3325 0.000000e+00
## 2        time -1.938696 0.2849086  -6.804622  270.8376 6.488912e-11

Next, pool the covariance between random effects.

source("mimlmtools.R")
ranefMI(fit31)
##             (Intercept)
## (Intercept)    252.0579
## attr(,"stddev")
## (Intercept) 
##    15.87633 
## attr(,"correlation")
##             (Intercept)
## (Intercept)           1
ranefMI(fit32)
##             (Intercept)      time
## (Intercept)   223.55141 -14.57612
## time          -14.57612  39.40087
## attr(,"stddev")
## (Intercept)        time 
##   14.951636    6.277011 
## attr(,"correlation")
##             (Intercept)       time
## (Intercept)   1.0000000 -0.1553103
## time         -0.1553103  1.0000000

The sigmaMI function is used to find the level-1 residual variance.

sigmaMI(fit31)
## [1] 107.4716
sigmaMI(fit32)
## [1] 41.80344

The anovaMI function is used to pool the likelihood ratio tests from each imputed data set.

anovaMI(fit31, fit32)
##       F     df1     df2     p.F 
## 202.312   2.000 476.362   0.000

Example 5: Changes in Sense of Belonging in Undergraduates (Imptation with Wide-Format Data)

Let’s read the data set again to remove additional variables in the previous example.

dat3 <- read.table("lecture14ex3.csv", sep=",", header=TRUE, na.strings="999999")

Remove level-1 ID variable.

dat3 <- dat3[,-1]

Check the proportion of missing at each time point.

library(psych)
describe(dat3)
##        vars    n   mean     sd median trimmed    mad min max range  skew
## pid       1 3400 425.50 245.41  425.5  425.50 315.05   1 850   849  0.00
## time      2 3400   2.50   1.12    2.5    2.50   1.48   1   4     3  0.00
## mem       3 2466  59.61  18.34   59.0   59.61  19.27   3 113   110  0.00
## stress    4 2466  50.45  13.57   50.0   50.39  13.34   3  99    96  0.05
## cohort    5 3400   0.50   0.50    1.0    0.50   0.00   0   1     1  0.00
## female    6 3400   0.50   0.50    0.0    0.49   0.00   0   1     1  0.02
## act       7 3400  50.22  13.12   50.5   50.29  14.08   9  87    78 -0.06
## ext       8 3400  49.53   9.73   49.0   49.47   8.90  23  82    59  0.10
##        kurtosis   se
## pid       -1.20 4.21
## time      -1.36 0.02
## mem       -0.29 0.37
## stress     0.18 0.27
## cohort    -2.00 0.01
## female    -2.00 0.01
## act       -0.46 0.23
## ext        0.00 0.17
Step 1: Transform Data from Long to Wide Format

Change data from long to wide format by the reshape function. In the long format, rows represent level-1 ID. In contrast, in the wide format, rows represent level-2 ID. A level-1 variable in different time points will be shown as multiple variables in the wide format. For example, stress is the level-1 variable in the long format and this data has four time points. In the wide format, each row will have stress1, stress2, stress3, and stress4. The arguments of the reshape function are as follows:

  • data is the target data set in the long format
  • idvar is the level-2 ID variable
  • timevar is the time variable
  • v.names is a vector of time-varying variables
  • direction is the target data set format, which is wide
dat3w <- reshape(data=dat3, idvar="pid", timevar="time", v.names=c("mem", "stress"), direction="wide")

Next, the missing patterns can be checked from the wide format.

md.pattern(dat3w)

##     pid cohort female act ext mem.1 stress.1 mem.2 stress.2 mem.3 stress.3
## 437   1      1      1   1   1     1        1     1        1     1        1
## 81    1      1      1   1   1     1        1     1        1     1        1
## 143   1      1      1   1   1     1        1     1        1     0        0
## 189   1      1      1   1   1     1        1     0        0     0        0
##       0      0      0   0   0     0        0   189      189   332      332
##     mem.4 stress.4     
## 437     1        1    0
## 81      0        0    2
## 143     0        0    4
## 189     0        0    6
##       413      413 1868

Note that the transformation to the standard scale can be done but not necessary. The imputation in multilevel data structure is computational intensive so the transformation helps reduce computation time. However, the imputation in single level data structure is much quicker. The transformation is not necessary.

Further, the group mean centering is not necessary. In the wide format, the group means are shown in the average level of time-varying variables and the deviation from group means are shown in the variation across the values in time varying variables in each row. The time-varying variable at all time points will predict other variables simultaneously so the group means and their deviation from group means have been accounted. The interaction of time and other variables are not needed as well because all time-varying variables are separated to different variables for different time points. The predictions in a variable in Time 1 and the same variable in Time 2 are based on different equations. Therefore, the interaction with time has been accounted already.

Step 2: Make Imputation Method Vector and Prediction Matrix

First, make the blueprints of the imputation method vector and predictor matrix.

pred3w <- make.predictorMatrix(dat3w)
pred3w
##          pid cohort female act ext mem.1 stress.1 mem.2 stress.2 mem.3 stress.3
## pid        0      1      1   1   1     1        1     1        1     1        1
## cohort     1      0      1   1   1     1        1     1        1     1        1
## female     1      1      0   1   1     1        1     1        1     1        1
## act        1      1      1   0   1     1        1     1        1     1        1
## ext        1      1      1   1   0     1        1     1        1     1        1
## mem.1      1      1      1   1   1     0        1     1        1     1        1
## stress.1   1      1      1   1   1     1        0     1        1     1        1
## mem.2      1      1      1   1   1     1        1     0        1     1        1
## stress.2   1      1      1   1   1     1        1     1        0     1        1
## mem.3      1      1      1   1   1     1        1     1        1     0        1
## stress.3   1      1      1   1   1     1        1     1        1     1        0
## mem.4      1      1      1   1   1     1        1     1        1     1        1
## stress.4   1      1      1   1   1     1        1     1        1     1        1
##          mem.4 stress.4
## pid          1        1
## cohort       1        1
## female       1        1
## act          1        1
## ext          1        1
## mem.1        1        1
## stress.1     1        1
## mem.2        1        1
## stress.2     1        1
## mem.3        1        1
## stress.3     1        1
## mem.4        0        1
## stress.4     1        0
meth3w <- make.method(dat3w)
meth3w
##      pid   cohort   female      act      ext    mem.1 stress.1    mem.2 
##       ""       ""       ""       ""       ""       ""       ""    "pmm" 
## stress.2    mem.3 stress.3    mem.4 stress.4 
##    "pmm"    "pmm"    "pmm"    "pmm"    "pmm"

In the imputation method vector, all variables with missing values are set as pmm which is predictive mean matching. This is suitable for both binary and continuous variables.

meth3w[c("mem.2", "mem.3", "mem.4", "stress.2", "stress.3", "stress.4")] <- "pmm"

Next, the prediction matrix is adjusted. All variables with missing values are predicted by other variables except ID and themselves. Note that the imputation in wide format is better than long format because the missing values in time-varying variables in one time point can be predicted by the same variable in other time points. Thus, if possible, the imputation in the wide format is recommended.

pred3w[,] <- 0
pred3w[c("mem.2", "mem.3", "mem.4", "stress.2", "stress.3", "stress.4"), ] <- 1
pred3w[, "pid"] <- 0
diag(pred3w) <- 0

Note that extra variables (e.g., group means or interactions) are not created. The sequence of imputation will be not adjusted.

Step 3: Imputation

Use the mice function to impute data to multiple sets.

imp3w <- mice(dat3w, pred=pred3w, meth=meth3w, m=30, maxit=20, seed=123321)
Step 4: Check Convergence

Check whether the means and standard deviations of the imputed variables are converged.

plot(imp3w, c("mem.2", "mem.3"))

plot(imp3w, c("mem.4", "stress.2"))

plot(imp3w, c("stress.3", "stress.4"))

Step 5: Transform Data from Wide to Long Format

First, change the result of the imputation to stacked data set.

impdat3w <- complete(imp3w, "long", include = TRUE)

Next, list the same time-varying variables with different time points in the same vector. Then, combine them in a list.

mem <- c("mem.1", "mem.2", "mem.3", "mem.4")
stress <- c("stress.1", "stress.2", "stress.3", "stress.4")
timevarying <- list(mem, stress)

Next, create a variable representing the ID of rows in the stack data set. When changing to the long format, this ID variable will be used similar to level 2 IDs but taking unique values across imputed data set. If the stacked wide data set has 26350 rows, the stacked long will have \(26350 \times 4 = 105400\) rows. This new variable will be only used in the reshaping process.

impdat3w$id2 <- 1:nrow(impdat3w)

Next, the reshape function is used again to change from the wide format to the long format. The arguments are as follows:

  • data is the target data set in the wide format
  • idvar is the level-2 ID variable. In this case, it is simply the variable that have different unique values across rows. The original level-2 variable cannot be used because it is duplicated across imputed data set.
  • times is the values of time used in the long data format. Users may simply use 1, 2, 3, and 4 to represent the school years. In this example, 0, 1, 2, and 3 is used as the centered time at the beginning year.
  • timevar is the name of the time variable. time is used here.
  • varying is the list of time varying variables created above
  • v.names is the name of each time-varying variable
  • direction is the target data set format, which is long
dat3l <- reshape(data=impdat3w, idvar="id2", times=0:3, timevar="time", varying = timevarying, v.names=c("mem", "stress"), direction="long")

dat3l is in the long format. pid would be the level-2 ID variable in each imputed data set. However, there is no level-1 ID variable in each imputed data set. Use ave with seq_along to create level-1 ID variable.

dat3l$.id <- ave(dat3l$id2, dat3l$.imp, FUN=seq_along)

Finally, change to the mice result format for further analyses.

imp3w2 <- as.mids(dat3l)
Step 6: Analyze Each Imputed Data

The with function means that analyze each data set shown in the first argument by the second argument. This example will compare the random effect of time.

library(lme4)
fit3w1 <- with(imp3w2, lmer(mem ~ time + (1|pid), REML=FALSE, 
                          control = lmerControl(optimizer ="Nelder_Mead")))
fit3w2 <- with(imp3w2, lmer(mem ~ time + (1 + time|pid), REML=FALSE, 
                          control = lmerControl(optimizer ="Nelder_Mead")))
Step 7: Pool Analysis Results

First, pool the fixed effects.

out3w1 <- pool(fit3w1)
summary(out3w1)
##          term  estimate std.error statistic       df     p.value
## 1 (Intercept) 61.531745 0.6430733 95.683871 584.0731 0.00000e+00
## 2        time -1.329333 0.3212821 -4.137589  77.8088 8.80872e-05
out3w2 <- pool(fit3w2)
summary(out3w2)
##          term  estimate std.error  statistic       df      p.value
## 1 (Intercept) 61.531745 0.6130082 100.376716 498.6143 0.0000000000
## 2        time -1.329333 0.3510202  -3.787056 109.9214 0.0002489821

Next, pool the covariance between random effects.

source("mimlmtools.R")
ranefMI(fit3w1)
##             (Intercept)
## (Intercept)    157.1224
## attr(,"stddev")
## (Intercept) 
##    12.53485 
## attr(,"correlation")
##             (Intercept)
## (Intercept)           1
ranefMI(fit3w2)
##             (Intercept)      time
## (Intercept)   154.76228 -14.79118
## time          -14.79118  25.49105
## attr(,"stddev")
## (Intercept)        time 
##   12.440349    5.048866 
## attr(,"correlation")
##             (Intercept)       time
## (Intercept)   1.0000000 -0.2354921
## time         -0.2354921  1.0000000

The sigmaMI function is used to find the level-1 residual variance.

sigmaMI(fit3w1)
## [1] 178.5253
sigmaMI(fit3w2)
## [1] 136.0403

The anovaMI function is used to pool the likelihood ratio tests from each imputed data set.

anovaMI(fit3w1, fit3w2)
##      F    df1    df2    p.F 
## 13.457  2.000 54.541  0.000