In this lecture, the steps for multiple imputation by the
mice
package will be explained. First, read the dataset
investigating the interaction between interviewer’s and interviewee’s
sex on interview ratings. In this data, some data points are
missing.
dat1 <-read.table("lecture14ex1.csv", sep=",", header=TRUE, na.strings="999999")
To show the basic of the mice
package, only three
variables and one level-2 ID will be used.
dat1a <- dat1[,c("erid", "score", "eesex", "ersex")]
Next, check the missing patterns in the dataset.
library(mice)
library(miceadds)
md.pattern(dat1a)
## erid eesex ersex score
## 7907 1 1 1 1 0
## 2093 1 1 1 0 1
## 0 0 0 2093 2093
Most variables should be on a standard score (or the mean and standard deviation close to standard score). If a variable is in a big number (e.g., 1,000), the multiplication between two numbers can lead to even bigger number (e.g., 1,000,000). If a variable has many decimals (e.g., 0.001), the multiplication can lead to even smaller decimals (e.g., 0.000001). In this case, both sex variables are in dummy variables so the original scale is retained. However, the score is subtracted by its mean. Note that the subtracted mean will be added after the data are imputed.
m_score1a <- mean(dat1a$score, na.rm=TRUE)
dat1a$score <- dat1a$score - m_score1a
First, make the blueprints of the imputation method vector and predictor matrix.
pred1a <- make.predictorMatrix(dat1a)
pred1a
## erid score eesex ersex
## erid 0 1 1 1
## score 1 0 1 1
## eesex 1 1 0 1
## ersex 1 1 1 0
meth1a <- make.method(dat1a)
meth1a
## erid score eesex ersex
## "" "pmm" "" ""
Next, the prediction matrix is adjusted. Only score
has
missing data. Thus, the rows involving other variables
(erid
, eesex
, and ersex
) are 0
because no missing data prediction is involved. The values put in the
score
row is as follows:
erid
is -2 because this variable is the level-2 ID
variablescore
is 0 because the missing data cannot be predicted
by the same variable.eesex
is 4. This variable is the main variable hoping
to have random effects on score
. Thus, the group means and
the original scale with random effect of eesex
is used to
predict score.
ersex
is 1. This variable is in Level 2. It should have
only fixed effects on score
so 1 is used.Note that -2 means level-2 variable, 0 means no prediction, 1 means only fixed effects, 2 means fixed effects and random effect, 3 means fixed effect with original variable and group means, 4, means the fixed and random effects from the original variable and the fixed effect of group means.
pred1a[,] <- 0
pred1a["score",] <- c(-2, 0, 4, 1)
Next, the imputation vector is adjusted. erid
,
eesex
, and ersex
do not have missing data so
""
is used in the vector. For score
,
2l.pan
is used because score is a continuous variable and
expect to have equal variances across level-2 units. Other options are
as follows:
""
means to not predict missing values in this
variable2l.pan
is used when the target variable is in level 1,
is continuous, and is assumed to have equal residual variances across
level-2 units2l.norm
is used when the target variable is in level 1,
is continuous, and is assumed to have unequal residual variances across
level-2 units2l.bin
is used when the target variable is level-1
binary variable2lonly.norm
is used when the target variable is level-2
continuous variable2lonly.pmm
is used when the target variable is level-2
binary (or continuous) variable2l.groupmean
is used to calculate group means (like the
ave
function). In prediction matrix, -2 is used for the ID
variable and 2 is used for the target variable to find group means. Note
that the miceadds
package is needed to use
2l.groupmean
.meth1a["score"] <- "2l.pan"
The mice function is used to impute data to multiple sets. The first
argument is the target data. pred
is for prediction matrix.
meth
is for imputation method vector. m
is the
number of imputed data sets. maxit
is the number of
iterations (or loops) to impute the data. seed
is the
random seed number which users can set to get the same results in later
analyses.
imp1 <- mice(dat1a, pred=pred1a, meth=meth1a, m=10, maxit=10, seed=123321)
Check whether the means and standard deviations of the imputed variables are converged. The convergence means that means and standard deviations from all imputed data sets are distributed around the same areas. If the distributions of means and standard deviations in later iterations have the same amount of variability (not increasing or not decreasing), the convergence can be assumed.
plot(imp1, c("score"))
From this plot, it is hard to assume convergence. Thus,
m
and maxit
are increased.
imp12 <- mice(dat1a, pred=pred1a, meth=meth1a, m=30, maxit=20, seed=123321)
Check the convergence again. The convergence can be assumed.
plot(imp12, c("score"))
The subtracted means from Step 1 are added back to the imputed data.
First, make the imputed data set in the long format where all imputed
data sets are stacked as new rows by the complete
function.
Then, the mean is added. Finally, the data in long format is transformed
into the mice package output by the as.mids
function.
impdat1a <- complete(imp12, "long", include = TRUE)
impdat1a$score <- impdat1a$score + m_score1a
imp1a <- as.mids(impdat1a)
The with function means that analyze each data set shown in the first
argument by the second argument. This example will compare the random
slopes of eesex
.
library(lme4)
fit1 <- with(imp1a, lmer(score ~ eesex + ersex + (1|erid), REML=FALSE))
fit2 <- with(imp1a, lmer(score ~ eesex + ersex + (1 + eesex|erid), REML=FALSE))
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00270948 (tol = 0.002, component 1)
The analysis results from all imputed data sets are saved in
fit1
and fit2
. Those results must be pooled to
get single parameter estimates and standard errors. The
pool
function is used to pool fixed effects. However, when
the lme4
package is used for data analysis, the
broom.mixed
package must be called to make the
pool
function run appropriately.
library(broom.mixed)
out1 <- pool(fit1)
summary(out1)
## term estimate std.error statistic df p.value
## 1 (Intercept) 72.74481011 0.20080224 362.2709110 6578.6691 0.0000000
## 2 eesex -0.04883249 0.09974143 -0.4895909 530.5782 0.6246258
## 3 ersex 0.43230753 0.27559283 1.5686458 7623.1215 0.1167720
out2 <- pool(fit2)
summary(out2)
## term estimate std.error statistic df p.value
## 1 (Intercept) 72.78826618 0.1994367 364.9692121 5718.6305 0.0000000
## 2 eesex -0.04883249 0.1063862 -0.4590114 675.1393 0.6463737
## 3 ersex 0.34539539 0.2761916 1.2505644 6347.6615 0.2111395
To pool the random effect, I wrote a function to extract the
covariance of random effects from each analysis result and average them.
Users must put the "mimlmtools.R"
file in the working
directory and run the source
function to read the code
inside the file. The ranefMI
function is used to find the
covariance of the random effects.
source("mimlmtools.R")
ranefMI(fit1)
## (Intercept)
## (Intercept) 16.51926
## attr(,"stddev")
## (Intercept)
## 4.064389
## attr(,"correlation")
## (Intercept)
## (Intercept) 1
ranefMI(fit2)
## (Intercept) eesex
## (Intercept) 15.895677 0.283141
## eesex 0.283141 1.540886
## attr(,"stddev")
## (Intercept) eesex
## 3.986938 1.241324
## attr(,"correlation")
## (Intercept) eesex
## (Intercept) 1.0000000 0.0572108
## eesex 0.0572108 1.0000000
The sigmaMI
function is used to find the level-1
residual variance.
sigmaMI(fit1)
## [1] 19.25922
sigmaMI(fit2)
## [1] 18.8312
The anovaMI
function is used to pool the likelihood
ratio tests from each imputed data set.
anovaMI(fit1, fit2)
## F df1 df2 p.F
## 4.530 2.000 705.487 0.011
The r2mlmMI
function is used to pool R-squared
results.
r2mlmMI(fit1)
## f v m fv fvm
## 0.001349975 0.000000000 0.461079304 0.001349975 0.462429279
In this example, four variables and one level-2 ID will be used from the interview data set. All four variables have missing data.
dat1b <- dat1[,c("erid", "score", "eesalary", "eeworkexp", "erexp")]
Next, check the missing patterns in the data set.
library(mice)
md.pattern(dat1b)
## erid eeworkexp eesalary erexp score
## 4030 1 1 1 1 1 0
## 1083 1 1 1 1 0 1
## 1049 1 1 1 0 1 1
## 257 1 1 1 0 0 2
## 1007 1 1 0 1 1 1
## 266 1 1 0 1 0 2
## 271 1 1 0 0 1 2
## 60 1 1 0 0 0 3
## 954 1 0 1 1 1 1
## 272 1 0 1 1 0 2
## 278 1 0 1 0 1 2
## 67 1 0 1 0 0 3
## 248 1 0 0 1 1 2
## 70 1 0 0 1 0 3
## 70 1 0 0 0 1 3
## 18 1 0 0 0 0 4
## 0 1977 2010 2070 2093 8150
All target variables are subtracted by means and divided by standard
deviations. The mean
and sd
functions are used
on the data set directly for level-1 variables.
m_score1b <- mean(dat1b$score, na.rm=TRUE)
m_eesalary1b <- mean(dat1b$eesalary, na.rm=TRUE)
m_eeworkexp1b <- mean(dat1b$eeworkexp, na.rm=TRUE)
sd_score1b <- sd(dat1b$score, na.rm=TRUE)
sd_eesalary1b <- sd(dat1b$eesalary, na.rm=TRUE)
sd_eeworkexp1b <- sd(dat1b$eeworkexp, na.rm=TRUE)
For level-2 variables, the new data set which rows represent level-2 units are created and then the means and standard deviations are calculated.
dat1bg <- dat1b[!duplicated(dat1b$erid),]
m_erexp1b <- mean(dat1bg$erexp, na.rm=TRUE)
sd_erexp1b <- sd(dat1bg$erexp, na.rm=TRUE)
Then, all variables are transformed by the calculated means and standard deviations.
dat1b$score <- (dat1b$score - m_score1b) / sd_score1b
dat1b$eesalary <- (dat1b$eesalary - m_eesalary1b) / sd_eesalary1b
dat1b$eeworkexp <- (dat1b$eeworkexp - m_eeworkexp1b) / sd_eeworkexp1b
dat1b$erexp <- (dat1b$erexp - m_erexp1b) / sd_erexp1b
Because the relationship between variables in this target data set is
assumed to have cross-level interactions, the group means and
group-mean-centered variables are created. However, these variables have
missing data, the group means calculation need to be in the process of
missing imputation. That is, the missing data are imputed and then the
group means and the deviations from group means are created. To create
the slots for imputation, these group means and deviations are named and
attached to the data set. All members in the created variables are
NA
.
dat1b$avescore <- NA
dat1b$aveeesalary <- NA
dat1b$aveeeworkexp <- NA
dat1b$diffscore <- NA
dat1b$diffeesalary <- NA
dat1b$diffeeworkexp <- NA
In this example, assume that interviewers’ work experience
(eeworkexp
) and interviewer’s interview experience
(erexp
) have the interaction effect on other variables. We
need to account for this interaction in the imputation model. However,
the work experience is separated into the group mean
(aveeeworkexp
) and the deviation from group mean
(diffeeworkexp
). Then, two interactions are created. Like
group mean centering, the interaction is created from variables with
missing values. Thus, the missing values must be imputed before the
interaction is created in the imputation model. To create the slots for
imputation, these interactions are named and attached to the data set.
All members in the created variables are NA
.
dat1b$int1 <- NA # diffeeworkexp * erexp
dat1b$int2 <- NA # aveeeworkexp * erexp
First, make the blueprints of the imputation method vector and predictor matrix.
meth1b <- make.method(dat1b)
meth1b
## erid score eesalary eeworkexp erexp
## "" "pmm" "pmm" "pmm" "pmm"
## avescore aveeesalary aveeeworkexp diffscore diffeesalary
## "logreg" "logreg" "logreg" "logreg" "logreg"
## diffeeworkexp int1 int2
## "logreg" "logreg" "logreg"
pred1b <- make.predictorMatrix(dat1b)
pred1b
## erid score eesalary eeworkexp erexp avescore aveeesalary
## erid 0 1 1 1 1 1 1
## score 1 0 1 1 1 1 1
## eesalary 1 1 0 1 1 1 1
## eeworkexp 1 1 1 0 1 1 1
## erexp 1 1 1 1 0 1 1
## avescore 1 1 1 1 1 0 1
## aveeesalary 1 1 1 1 1 1 0
## aveeeworkexp 1 1 1 1 1 1 1
## diffscore 1 1 1 1 1 1 1
## diffeesalary 1 1 1 1 1 1 1
## diffeeworkexp 1 1 1 1 1 1 1
## int1 1 1 1 1 1 1 1
## int2 1 1 1 1 1 1 1
## aveeeworkexp diffscore diffeesalary diffeeworkexp int1 int2
## erid 1 1 1 1 1 1
## score 1 1 1 1 1 1
## eesalary 1 1 1 1 1 1
## eeworkexp 1 1 1 1 1 1
## erexp 1 1 1 1 1 1
## avescore 1 1 1 1 1 1
## aveeesalary 1 1 1 1 1 1
## aveeeworkexp 0 1 1 1 1 1
## diffscore 1 0 1 1 1 1
## diffeesalary 1 1 0 1 1 1
## diffeeworkexp 1 1 1 0 1 1
## int1 1 1 1 1 0 1
## int2 1 1 1 1 1 0
The imputation method vector is set as follows:
score
, eesalary
, eeworkexp
is
predicted by 2l.pan
because they are level-1 continuous
variables.erexp
is predicted by 2lonly.norm
because
they are level-2 continuous variable.avescore
, aveeesalary
,
aveeeworkexp
are the group means calculated from
score
, eesalary
, and eeworkexp
respectively. The 2l.groupmean
is used.diffscore
, diffeesalary
, and
diffeeworkexp
are all deviation scores from the group
means. They are simply the differences between the original variables
and the group means. ~I()
is used to tell mice
that the simple calculation between variables is used. Then, put the
formula inside ~I()
int1
and int2
are the interactions between
variables in the data set. Again, ~I()
is used and the
calculation is put inside it.meth1b[c("score", "eesalary", "eeworkexp")] <- "2l.pan"
meth1b["erexp"] <- "2lonly.norm"
meth1b[c("avescore", "aveeesalary", "aveeeworkexp")] <- "2l.groupmean"
meth1b["diffscore"] <- "~I(score - avescore)"
meth1b["diffeesalary"] <- "~I(eesalary - aveeesalary)"
meth1b["diffeeworkexp"] <- "~I(eeworkexp - aveeeworkexp)"
meth1b["int1"] <- "~I(diffeeworkexp*erexp)"
meth1b["int2"] <- "~I(aveeeworkexp*erexp)"
Next, the prediction matrix is adjusted. All elements in a row are 0
if the imputation method matrix is ""
or ~I()
.
Thus, these following rows are adjusted: score
,
eesalary
, eeworkexp
, erexp
,
avescore
, aveeesalary
, and
aveeeworkexp
.
First, set all members in the prediction matrix as 0.
pred1b[,] <- 0
Second, in these adjusted rows, set the column of erid
,
which is level-2 ID, as -2.
pred1b[c("score", "eesalary", "eeworkexp", "erexp"), "erid"] <- -2
pred1b[c("avescore", "aveeesalary", "aveeeworkexp"), "erid"] <- -2
Third, let’s change the score
row. The level-2
predictors (excluding avescore
) are predicted using only
fixed effects (set as 1). The level-1 deviation scores from group means
(excluding diffscore
) are predicted with fixed and random
effects (set as 2). The interactions, which are not calculated from
score
, are predicted with only fixed effects (set as
1).
pred1b["score", c("aveeesalary", "aveeeworkexp", "erexp")] <- 1
pred1b["score", c("diffeesalary", "diffeeworkexp")] <- 2
pred1b["score", c("int1", "int2")] <- 1
Fourth, let’s change the eesalary
row. The level-2
predictors (excluding aveeesalary
) are predicted using only
fixed effects (set as 1). The level-1 deviation scores from group means
(excluding diffeesalary
) are predicted with fixed and
random effects (set as 2). The interactions, which are not calculated
from eesalary
, are predicted with only fixed effects (set
as 1).
pred1b["eesalary", c("avescore", "aveeeworkexp", "erexp")] <- 1
pred1b["eesalary", c("diffscore", "diffeeworkexp")] <- 2
pred1b["eesalary", c("int1", "int2")] <- 1
Fifth, let’s change the eeworkexp
row. The level-2
predictors (excluding aveeeworkexp
) are predicted using
only fixed effects (set as 1). The level-1 deviation scores from group
means (excluding diffeeworkexp
) are predicted with fixed
and random effects (set as 2). The interactions are not used because the
interactions are calculated from eeworkexp.
pred1b["eeworkexp", c("avescore", "aveeesalary", "erexp")] <- 1
pred1b["eeworkexp", c("diffscore", "diffeesalary")] <- 2
Sixth, let’s change the erexp
row. Because this variable
is in Level 2, only other level-2 predictors can predict its missing
data.
pred1b["erexp", c("avescore", "aveeesalary", "aveeeworkexp")] <- 1
Finally, group means are calculated from their original scales. In the prediction matrix, set the corresponding original scales as 2.
pred1b["avescore", "score"] <- 2
pred1b["aveeesalary", "eesalary"] <- 2
pred1b["aveeeworkexp", "eeworkexp"] <- 2
When a variable is imputed, the variables based on the calculation of
the imputed variable must be calculated right after the imputation. For
example, if the original scale of eeworkexp
is imputed, the
group means (aveeeworkexp
), deviations from group means
(diffeeworkexp
), and interactions (int1
and
int2
) must be imputed right after it. Thus, the sequence of
imputation is as follows:
visit1b <- c("score", "avescore", "diffscore",
"eesalary", "aveeesalary", "diffeesalary",
"erexp", "int1", "int2",
"eeworkexp", "aveeeworkexp", "diffeeworkexp",
"int1", "int2")
The mice
function is used to impute data to multiple
sets. The additional arguments from the previous example are
visit
(the vector determining the imputation sequence) and
allow.na
, set as TRUE
to allow variables that
all members are NA
.
imp1b <- mice(dat1b, pred=pred1b, meth=meth1b, m=10, maxit=10, visit=visit1b, seed=123321, allow.na = TRUE)
Check whether the means and standard deviations of the imputed variables are converged. Note that if there are many imputed variables, users may focus on the variables that are important in your main analysis and have high propotions of missing data.
plot(imp1b, c("score", "eesalary"))
plot(imp1b, c("eeworkexp", "erexp"))
From this plot, it is hard to assume convergence. Thus,
m
and maxit
are increased.
imp1b2 <- mice(dat1b, pred=pred1b, meth=meth1b, m=30, maxit=20,
visit=visit1b, seed=123321, allow.na = TRUE)
Check the convergence again. The convergence can be assumed.
plot(imp1b2, c("score", "eesalary"))
plot(imp1b2, c("eeworkexp", "erexp"))
The means and standard deviations from Step 1 are used to transform the imputed variables back to the original scales.
impdat1b <- complete(imp1b2, "long", include = TRUE)
impdat1b$score <- sd_score1b*impdat1b$score + m_score1b
impdat1b$eesalary <- sd_eesalary1b*impdat1b$eesalary + m_eesalary1b
impdat1b$eeworkexp <- sd_eeworkexp1b*impdat1b$eeworkexp + m_eeworkexp1b
impdat1b$erexp <- sd_erexp1b*impdat1b$erexp + m_erexp1b
imp1b2a <- as.mids(impdat1b)
The with
function means that analyze each data set shown
in the first argument by the second argument. This example will compare
the fixed effect of the interaction, eeworkexp*erexp
.
library(lme4)
fit1b1 <- with(imp1b2a, lmer(score ~ eesalary + eeworkexp
+ erexp + (1|erid), REML=FALSE))
fit1b2 <- with(imp1b2a, lmer(score ~ eesalary + eeworkexp
+ erexp + eeworkexp*erexp + (1|erid), REML=FALSE))
First, pool the fixed effects.
out1b1 <- pool(fit1b1)
summary(out1b1)
## term estimate std.error statistic df p.value
## 1 (Intercept) 72.01478732 0.20785624 346.4643936 383.4840 0.000000e+00
## 2 eesalary -0.04047357 0.07017071 -0.5767872 152.1339 5.649357e-01
## 3 eeworkexp 0.26110492 0.01558478 16.7538450 122.3739 1.813885e-33
## 4 erexp -0.25988753 0.03902953 -6.6587411 544.0240 6.765469e-11
out1b2 <- pool(fit1b2)
summary(out1b2)
## term estimate std.error statistic df p.value
## 1 (Intercept) 71.9980069007 0.220014545 327.2420332 295.4005 0.000000e+00
## 2 eesalary -0.0404137470 0.070141526 -0.5761743 152.4444 5.653470e-01
## 3 eeworkexp 0.2637330423 0.018420710 14.3172031 118.5766 1.300435e-27
## 4 erexp -0.2537930506 0.045438762 -5.5853866 296.7702 5.272596e-08
## 5 eeworkexp:erexp -0.0009653401 0.003162478 -0.3052480 200.4964 7.604939e-01
Next, pool the covariance between random effects.
source("mimlmtools.R")
ranefMI(fit1b1)
## (Intercept)
## (Intercept) 15.51966
## attr(,"stddev")
## (Intercept)
## 3.939499
## attr(,"correlation")
## (Intercept)
## (Intercept) 1
ranefMI(fit1b2)
## (Intercept)
## (Intercept) 15.52003
## attr(,"stddev")
## (Intercept)
## 3.939547
## attr(,"correlation")
## (Intercept)
## (Intercept) 1
The sigmaMI
function is used to find the level-1
residual variance.
sigmaMI(fit1b1)
## [1] 17.83708
sigmaMI(fit1b2)
## [1] 17.83566
The anovaMI
function is used to pool the likelihood
ratio tests from each imputed data set.
anovaMI(fit1b1, fit1b2)
## F df1 df2 p.F
## 0.160 1.000 1596.384 0.689
Read the data set of students nested in schools. The self-efficacy is predicted by math achievement.
dat2 <- read.table("lecture14ex2.csv", sep=",", header=TRUE, na.strings="999999")
Next, check the missing patterns in the dataset.
md.pattern(dat2)
## id schoolid private efficacy ach
## 1166 1 1 1 1 1 0
## 305 1 1 1 1 0 1
## 291 1 1 1 0 1 1
## 91 1 1 1 0 0 2
## 0 0 0 382 396 778
All target variables are subtracted by means and divided by standard deviations. Note that all transformed variables are in the student level.
m_efficacy2 <- mean(dat2$efficacy, na.rm=TRUE)
m_ach2 <- mean(dat2$ach, na.rm=TRUE)
sd_efficacy2 <- sd(dat2$efficacy, na.rm=TRUE)
sd_ach2 <- sd(dat2$ach, na.rm=TRUE)
dat2$efficacy <- (dat2$efficacy - m_efficacy2) / sd_efficacy2
dat2$ach <- (dat2$ach - m_ach2) / sd_ach2
Create the blank group means and deviations from group means for imputation.
dat2$aveach <- NA
dat2$aveefficacy <- NA
dat2$diffach <- NA
dat2$diffefficacy <- NA
Assume that the research hypothesis expected to have the cross-level
interaction effect of school means and deviation from school means of
math achievement on self-efficacy. The imputation model must address
that cross-level interaction by putting the cross-level interaction
effect in both direction: ach
on efficacy
and
efficacy
on ach
. Create the blank interaction
variables for imputation.
dat2$intach <- NA # aveach*diffach
dat2$intefficacy <- NA # aveefficacy*diffefficacy
First, make the blueprints of the imputation method vector and predictor matrix.
meth2 <- make.method(dat2)
meth2
## id schoolid efficacy ach private aveach
## "" "" "pmm" "pmm" "" "logreg"
## aveefficacy diffach diffefficacy intach intefficacy
## "logreg" "logreg" "logreg" "logreg" "logreg"
pred2 <- make.predictorMatrix(dat2)
pred2
## id schoolid efficacy ach private aveach aveefficacy diffach
## id 0 1 1 1 1 1 1 1
## schoolid 1 0 1 1 1 1 1 1
## efficacy 1 1 0 1 1 1 1 1
## ach 1 1 1 0 1 1 1 1
## private 1 1 1 1 0 1 1 1
## aveach 1 1 1 1 1 0 1 1
## aveefficacy 1 1 1 1 1 1 0 1
## diffach 1 1 1 1 1 1 1 0
## diffefficacy 1 1 1 1 1 1 1 1
## intach 1 1 1 1 1 1 1 1
## intefficacy 1 1 1 1 1 1 1 1
## diffefficacy intach intefficacy
## id 1 1 1
## schoolid 1 1 1
## efficacy 1 1 1
## ach 1 1 1
## private 1 1 1
## aveach 1 1 1
## aveefficacy 1 1 1
## diffach 1 1 1
## diffefficacy 0 1 1
## intach 1 0 1
## intefficacy 1 1 0
The imputation method vector is set as follows:
efficacy
and ach
are predicted by
2l.pan
because they are level-1 continuous variables.aveefficacy
and aveach
are the group means
calculated from efficacy
and ach
respectively.
The 2l.groupmean
is used.diffach
and diffefficacy
are all deviation
scores from the group means. Use ~I()
to calculate
them.intach
and intefficacy
are the
interactions between variables in the data set. Again, ~I()
is used and the calculation is put inside it.meth2[c("efficacy", "ach")] <- "2l.pan"
meth2[c("aveefficacy", "aveach")] <- "2l.groupmean"
meth2["diffach"] <- "~I(ach - aveach)"
meth2["diffefficacy"] <- "~I(efficacy - aveefficacy)"
meth2["intach"] <- "~I(aveach*diffach)"
meth2["intefficacy"] <- "~I(aveefficacy*diffefficacy)"
Next, the prediction matrix is adjusted. All elements in a row are 0
if the imputation method matrix is ""
or ~I()
.
Thus, these following rows are adjusted: ach
,
efficacy
, aveach
, and
aveefficacy
.
First, set all members in the prediction matrix as 0.
pred1b[,] <- 0
Second, in these adjusted rows, set the column of erid
,
which is level-2 ID, as -2.
pred2[c("ach", "efficacy", "aveach", "aveefficacy"), "schoolid"] <- -2
Third, let’s change the efficacy
row. The level-2
predictors (excluding aveefficacy
) are predicted using only
fixed effects (set as 1). The level-1 deviation scores from group means
(excluding diffefficacy
) are predicted with fixed and
random effects (set as 2). The interactions, which are not calculated
from efficacy
, are predicted with only fixed effects (set
as 1).
pred2["efficacy", c("aveach", "diffach", "intach", "private")] <- c(1, 2, 1, 1)
Fourth, let’s change the ach
row. The level-2 predictors
(excluding aveach
) are predicted using only fixed effects
(set as 1). The level-1 deviation scores from group means (excluding
diffach
) are predicted with fixed and random effects (set
as 2). The interactions, which are not calculated from ach
,
are predicted with only fixed effects (set as 1).
pred2["ach", c("aveefficacy", "diffefficacy", "intefficacy", "private")] <- c(1, 2, 1, 1)
Finally, group means are calculated from their original scales. In the prediction matrix, set the corresponding original scales as 2.
pred2["aveach", "ach"] <- 2
pred2["aveefficacy", "efficacy"] <- 2
When a variable is imputed, the variables based on the calculation of the imputed variable must be calculated right after the imputation. The sequence of imputation is as follows:
visit2 <- c("efficacy", "aveefficacy", "diffefficacy", "intefficacy",
"ach", "aveach", "diffach", "intach")
Use the mice
function to impute data to multiple
sets.
imp2 <- mice(dat2, pred=pred2, meth=meth2, m=30, maxit=30,
visit=visit2, seed=123321, allow.na = TRUE)
Check whether the means and standard deviations of the imputed variables are converged.
plot(imp2, c("ach", "efficacy"))
The means and standard deviations from Step 1 are used to transform the imputed variables back to the original scales.
impdat2 <- complete(imp2, "long", include = TRUE)
impdat2$efficacy <- sd_efficacy2*impdat2$efficacy + m_efficacy2
impdat2$ach <- sd_ach2*impdat2$ach + m_ach2
imp22 <- as.mids(impdat2)
The with
function means that analyze each data set shown
in the first argument by the second argument. This example will compare
the fixed effect of the interaction, diffach*aveach
.
library(lme4)
fit21 <- with(imp22, lmer(efficacy ~ diffach + aveach
+ private + (1 + diffach|schoolid),
REML=FALSE, control = lmerControl(optimizer ="Nelder_Mead")))
fit22 <- with(imp22, lmer(efficacy ~ diffach*aveach
+ private + (1 + diffach|schoolid),
REML=FALSE, control = lmerControl(optimizer ="Nelder_Mead")))
First, pool the fixed effects.
out21 <- pool(fit21)
summary(out21)
## term estimate std.error statistic df p.value
## 1 (Intercept) 37.0845396 0.3175583 116.7802586 1745.489 0.0000000000
## 2 diffach 1.2384357 0.3587338 3.4522411 1782.665 0.0005689315
## 3 aveach 1.1878294 0.3197328 3.7150688 1368.165 0.0002113291
## 4 private 0.1485249 0.4598419 0.3229912 1734.684 0.7467408364
out22 <- pool(fit22)
summary(out22)
## term estimate std.error statistic df p.value
## 1 (Intercept) 37.1113988 0.3168150 117.1390032 1741.740 0.000000e+00
## 2 diffach 1.0030075 0.3027603 3.3128763 1750.987 9.422329e-04
## 3 aveach 0.9677560 0.3164757 3.0579154 1733.227 2.262922e-03
## 4 private 0.1495255 0.4599023 0.3251244 1733.376 7.451262e-01
## 5 diffach:aveach 1.9065289 0.4112595 4.6358291 1535.080 3.856417e-06
Next, pool the covariance between random effects.
source("mimlmtools.R")
ranefMI(fit21)
## (Intercept) diffach
## (Intercept) 1.6460188 -0.7297848
## diffach -0.7297848 6.0979224
## attr(,"stddev")
## (Intercept) diffach
## 1.282973 2.469397
## attr(,"correlation")
## (Intercept) diffach
## (Intercept) 1.0000000 -0.2303491
## diffach -0.2303491 1.0000000
ranefMI(fit22)
## (Intercept) diffach
## (Intercept) 1.6185364 -0.5002983
## diffach -0.5002983 4.1289792
## attr(,"stddev")
## (Intercept) diffach
## 1.272217 2.031989
## attr(,"correlation")
## (Intercept) diffach
## (Intercept) 1.0000000 -0.1935292
## diffach -0.1935292 1.0000000
The sigmaMI
function is used to find the level-1
residual variance.
sigmaMI(fit21)
## [1] 4.063061
sigmaMI(fit22)
## [1] 4.063007
The anovaMI
function is used to pool the likelihood
ratio tests from each imputed data set.
anovaMI(fit21, fit22)
## F df1 df2 p.F
## 9.177 1.000 129843.243 0.002
Read the data set of years nested in undergraduate students. Sense of belonging is the dependent variable. The missing patterns reflect many students dropped out from the program.
dat3 <- read.table("lecture14ex3.csv", sep=",", header=TRUE, na.strings="999999")
Next, check the missing patterns in the dataset.
md.pattern(dat3)
## rowid pid time cohort female act ext mem stress
## 2466 1 1 1 1 1 1 1 1 1 0
## 934 1 1 1 1 1 1 1 0 0 2
## 0 0 0 0 0 0 0 934 934 1868
Check the proportion of missing at each time point.
aggregate(stress ~ time, data = dat3, FUN = function(x) sum(is.na(x)), na.action = NULL)
## time stress
## 1 1 0
## 2 2 189
## 3 3 332
## 4 4 413
All target variables are subtracted by means and divided by standard deviations. Let’s start with level-1 (time-varying) variables.
m_mem3 <- mean(dat3$mem, na.rm=TRUE)
m_stress3 <- mean(dat3$stress, na.rm=TRUE)
sd_mem3 <- sd(dat3$mem, na.rm=TRUE)
sd_stress3 <- sd(dat3$stress, na.rm=TRUE)
dat3$mem <- (dat3$mem - m_mem3) / sd_mem3
dat3$stress <- (dat3$stress - m_stress3) / sd_stress3
For level-2 (time-invariant) variables, new data which rows represent individuals are created before finding means and standard deviations.
dat3g <- dat3[dat3$time == 1,]
m_act3 <- mean(dat3g$act, na.rm=TRUE)
m_ext3 <- mean(dat3g$ext, na.rm=TRUE)
sd_act3 <- sd(dat3g$act, na.rm=TRUE)
sd_ext3 <- sd(dat3g$ext, na.rm=TRUE)
dat3$act <- (dat3$act - m_act3) / sd_act3
dat3$ext <- (dat3$ext - m_ext3) / sd_ext3
First, the time variable is centered at the first time point.
dat3$time <- dat3$time - 1
Create the blank students’ means and deviations from students’ means for imputation.
dat3$avemem <- NA
dat3$avestress <- NA
dat3$diffmem <- NA
dat3$diffstress <- NA
The interaction terms in this example are based on time multipled
with other variables, which can be separated into two sets. First,
time
is multipled with other variables that do not have
missing values. These interactions do not need to be imputed because all
values are observed. The interaction variables are simply calculated and
attached into the data sets.
dat3$intact <- dat3$time * dat3$act
dat3$intext <- dat3$time * dat3$ext
dat3$intcohort <- dat3$time * dat3$cohort
dat3$intfemale <- dat3$time * dat3$female
Second, time
is multipled with variables with missing
values. In this case, create the blank interaction variables for
imputation.
dat3$intmem1 <- NA # time*avemem
dat3$intmem2 <- NA # time*diffmem
dat3$intstress1 <- NA # time*avestress
dat3$intstress2 <- NA # time*diffstress
First, make the blueprints of the imputation method vector and predictor matrix.
pred3 <- make.predictorMatrix(dat3)
pred3
## rowid pid time mem stress cohort female act ext avemem avestress
## rowid 0 1 1 1 1 1 1 1 1 1 1
## pid 1 0 1 1 1 1 1 1 1 1 1
## time 1 1 0 1 1 1 1 1 1 1 1
## mem 1 1 1 0 1 1 1 1 1 1 1
## stress 1 1 1 1 0 1 1 1 1 1 1
## cohort 1 1 1 1 1 0 1 1 1 1 1
## female 1 1 1 1 1 1 0 1 1 1 1
## act 1 1 1 1 1 1 1 0 1 1 1
## ext 1 1 1 1 1 1 1 1 0 1 1
## avemem 1 1 1 1 1 1 1 1 1 0 1
## avestress 1 1 1 1 1 1 1 1 1 1 0
## diffmem 1 1 1 1 1 1 1 1 1 1 1
## diffstress 1 1 1 1 1 1 1 1 1 1 1
## intact 1 1 1 1 1 1 1 1 1 1 1
## intext 1 1 1 1 1 1 1 1 1 1 1
## intcohort 1 1 1 1 1 1 1 1 1 1 1
## intfemale 1 1 1 1 1 1 1 1 1 1 1
## intmem1 1 1 1 1 1 1 1 1 1 1 1
## intmem2 1 1 1 1 1 1 1 1 1 1 1
## intstress1 1 1 1 1 1 1 1 1 1 1 1
## intstress2 1 1 1 1 1 1 1 1 1 1 1
## diffmem diffstress intact intext intcohort intfemale intmem1 intmem2
## rowid 1 1 1 1 1 1 1 1
## pid 1 1 1 1 1 1 1 1
## time 1 1 1 1 1 1 1 1
## mem 1 1 1 1 1 1 1 1
## stress 1 1 1 1 1 1 1 1
## cohort 1 1 1 1 1 1 1 1
## female 1 1 1 1 1 1 1 1
## act 1 1 1 1 1 1 1 1
## ext 1 1 1 1 1 1 1 1
## avemem 1 1 1 1 1 1 1 1
## avestress 1 1 1 1 1 1 1 1
## diffmem 0 1 1 1 1 1 1 1
## diffstress 1 0 1 1 1 1 1 1
## intact 1 1 0 1 1 1 1 1
## intext 1 1 1 0 1 1 1 1
## intcohort 1 1 1 1 0 1 1 1
## intfemale 1 1 1 1 1 0 1 1
## intmem1 1 1 1 1 1 1 0 1
## intmem2 1 1 1 1 1 1 1 0
## intstress1 1 1 1 1 1 1 1 1
## intstress2 1 1 1 1 1 1 1 1
## intstress1 intstress2
## rowid 1 1
## pid 1 1
## time 1 1
## mem 1 1
## stress 1 1
## cohort 1 1
## female 1 1
## act 1 1
## ext 1 1
## avemem 1 1
## avestress 1 1
## diffmem 1 1
## diffstress 1 1
## intact 1 1
## intext 1 1
## intcohort 1 1
## intfemale 1 1
## intmem1 1 1
## intmem2 1 1
## intstress1 0 1
## intstress2 1 0
meth3 <- make.method(dat3)
meth3
## rowid pid time mem stress cohort female
## "" "" "" "pmm" "pmm" "" ""
## act ext avemem avestress diffmem diffstress intact
## "" "" "logreg" "logreg" "logreg" "logreg" ""
## intext intcohort intfemale intmem1 intmem2 intstress1 intstress2
## "" "" "" "logreg" "logreg" "logreg" "logreg"
The imputation method vector is set as follows:
mem
and stress
are predicted by
2l.pan
because they are level-1 continuous variables.avemem
and avestress
are the group means
calculated from mem
and stress
respectively.
The 2l.groupmean
is used.diffmem
and diffstress
are all deviation
scores from the group means. Use ~I()
to calculate
them.intmem1
, intmem2
, intstress1
,
and intstress2
are the interactions between variables in
the data set. Again, ~I()
is used and the calculation is
put inside it.meth3[c("mem", "stress")] <- "2l.pan"
meth3[c("avemem", "avestress")] <- "2l.groupmean"
meth3["diffmem"] <- "~I(mem - avemem)"
meth3["diffstress"] <- "~I(stress - avestress)"
meth3["intmem1"] <- "~I(time*avemem)"
meth3["intmem2"] <- "~I(time*diffmem)"
meth3["intstress1"] <- "~I(time*avestress)"
meth3["intstress2"] <- "~I(time*diffstress)"
Next, the prediction matrix is adjusted. All elements in a row are 0
if the imputation method matrix is ""
or ~I()
.
Thus, these following rows are adjusted: mem
,
stress
, avemem
, and
avestress
.
First, set all members in the prediction matrix as 0.
pred3[,] <- 0
Second, in these adjusted rows, set the column of erid
,
which is level-2 ID, as -2.
pred3[c("mem", "stress", "avemem", "avestress"), "pid"] <- -2
Third, let’s change the mem
row. The level-2 predictors
(excluding avemem
) are predicted using only fixed effects
(set as 1). The level-1 deviation scores from group means (excluding
diffmem
) are predicted with fixed and random effects (set
as 2). The interactions, which are not calculated from mem
,
are predicted with only fixed effects (set as 1).
pred3["mem", c("time", "diffstress")] <- 2
pred3["mem", c("cohort", "female", "act", "ext", "avestress")] <- 1
pred3["mem", c("intstress1", "intstress2", "intact", "intext",
"intcohort", "intfemale")] <- 1
Fourth, let’s change the stress
row. The level-2
predictors (excluding avestress
) are predicted using only
fixed effects (set as 1). The level-1 deviation scores from group means
(excluding diffstress
) are predicted with fixed and random
effects (set as 2). The interactions, which are not calculated from
stress
, are predicted with only fixed effects (set as
1).
pred3["stress", c("time", "diffmem")] <- 2
pred3["stress", c("cohort", "female", "act", "ext", "avemem")] <- 1
pred3["stress", c("intmem1", "intmem2", "intact", "intext",
"intcohort", "intfemale")] <- 1
Finally, group means are calculated from their original scales. In the prediction matrix, set the corresponding original scales as 2.
pred3["avemem", "mem"] <- 2
pred3["avestress", "stress"] <- 2
When a variable is imputed, the variables based on the calculation of the imputed variable must be calculated right after the imputation. The sequence of imputation is as follows:
visit3 <- c("mem", "avemem", "diffmem", "intmem1", "intmem2",
"stress", "avestress", "diffstress", "intstress1", "intstress2")
Use the mice
function to impute data to multiple
sets.
imp3 <- mice(dat3, pred=pred3, meth=meth3, m=30, maxit=20,
visit=visit3, seed=123321, allow.na = TRUE)
Check whether the means and standard deviations of the imputed variables are converged.
plot(imp3, c("mem", "stress"))
The means and standard deviations from Step 1 are used to transform the imputed variables back to the original scales.
impdat3 <- complete(imp3, "long", include = TRUE)
impdat3$mem <- sd_mem3*impdat3$mem + m_mem3
impdat3$stress <- sd_stress3*impdat3$stress + m_stress3
impdat3$act <- sd_act3*impdat3$act + m_act3
impdat3$ext <- sd_ext3*impdat3$ext + m_ext3
imp32 <- as.mids(impdat3)
The with
function means that analyze each data set shown
in the first argument by the second argument. This example will compare
the random effect of time
.
library(lme4)
fit31 <- with(imp32, lmer(mem ~ time + (1|pid), REML=FALSE,
control = lmerControl(optimizer ="Nelder_Mead")))
fit32 <- with(imp32, lmer(mem ~ time + (1 + time|pid), REML=FALSE,
control = lmerControl(optimizer ="Nelder_Mead")))
First, pool the fixed effects.
out31 <- pool(fit31)
summary(out31)
## term estimate std.error statistic df p.value
## 1 (Intercept) 61.996159 0.6246470 99.24990 3283.5109 0.000000e+00
## 2 time -1.938696 0.2242104 -8.64677 109.8626 4.864041e-14
out32 <- pool(fit32)
summary(out32)
## term estimate std.error statistic df p.value
## 1 (Intercept) 61.996159 0.5500611 112.707769 3227.3325 0.000000e+00
## 2 time -1.938696 0.2849086 -6.804622 270.8376 6.488912e-11
Next, pool the covariance between random effects.
source("mimlmtools.R")
ranefMI(fit31)
## (Intercept)
## (Intercept) 252.0579
## attr(,"stddev")
## (Intercept)
## 15.87633
## attr(,"correlation")
## (Intercept)
## (Intercept) 1
ranefMI(fit32)
## (Intercept) time
## (Intercept) 223.55141 -14.57612
## time -14.57612 39.40087
## attr(,"stddev")
## (Intercept) time
## 14.951636 6.277011
## attr(,"correlation")
## (Intercept) time
## (Intercept) 1.0000000 -0.1553103
## time -0.1553103 1.0000000
The sigmaMI
function is used to find the level-1
residual variance.
sigmaMI(fit31)
## [1] 107.4716
sigmaMI(fit32)
## [1] 41.80344
The anovaMI
function is used to pool the likelihood
ratio tests from each imputed data set.
anovaMI(fit31, fit32)
## F df1 df2 p.F
## 202.312 2.000 476.362 0.000
Let’s read the data set again to remove additional variables in the previous example.
dat3 <- read.table("lecture14ex3.csv", sep=",", header=TRUE, na.strings="999999")
Remove level-1 ID variable.
dat3 <- dat3[,-1]
Check the proportion of missing at each time point.
library(psych)
describe(dat3)
## vars n mean sd median trimmed mad min max range skew
## pid 1 3400 425.50 245.41 425.5 425.50 315.05 1 850 849 0.00
## time 2 3400 2.50 1.12 2.5 2.50 1.48 1 4 3 0.00
## mem 3 2466 59.61 18.34 59.0 59.61 19.27 3 113 110 0.00
## stress 4 2466 50.45 13.57 50.0 50.39 13.34 3 99 96 0.05
## cohort 5 3400 0.50 0.50 1.0 0.50 0.00 0 1 1 0.00
## female 6 3400 0.50 0.50 0.0 0.49 0.00 0 1 1 0.02
## act 7 3400 50.22 13.12 50.5 50.29 14.08 9 87 78 -0.06
## ext 8 3400 49.53 9.73 49.0 49.47 8.90 23 82 59 0.10
## kurtosis se
## pid -1.20 4.21
## time -1.36 0.02
## mem -0.29 0.37
## stress 0.18 0.27
## cohort -2.00 0.01
## female -2.00 0.01
## act -0.46 0.23
## ext 0.00 0.17
Change data from long to wide format by the reshape
function. In the long format, rows represent level-1 ID. In contrast, in
the wide format, rows represent level-2 ID. A level-1 variable in
different time points will be shown as multiple variables in the wide
format. For example, stress
is the level-1 variable in the
long format and this data has four time points. In the wide format, each
row will have stress1
, stress2
,
stress3
, and stress4
. The arguments of the
reshape
function are as follows:
data
is the target data set in the long formatidvar
is the level-2 ID variabletimevar
is the time variablev.names
is a vector of time-varying variablesdirection
is the target data set format, which is
widedat3w <- reshape(data=dat3, idvar="pid", timevar="time", v.names=c("mem", "stress"), direction="wide")
Next, the missing patterns can be checked from the wide format.
md.pattern(dat3w)
## pid cohort female act ext mem.1 stress.1 mem.2 stress.2 mem.3 stress.3
## 437 1 1 1 1 1 1 1 1 1 1 1
## 81 1 1 1 1 1 1 1 1 1 1 1
## 143 1 1 1 1 1 1 1 1 1 0 0
## 189 1 1 1 1 1 1 1 0 0 0 0
## 0 0 0 0 0 0 0 189 189 332 332
## mem.4 stress.4
## 437 1 1 0
## 81 0 0 2
## 143 0 0 4
## 189 0 0 6
## 413 413 1868
Note that the transformation to the standard scale can be done but not necessary. The imputation in multilevel data structure is computational intensive so the transformation helps reduce computation time. However, the imputation in single level data structure is much quicker. The transformation is not necessary.
Further, the group mean centering is not necessary. In the wide format, the group means are shown in the average level of time-varying variables and the deviation from group means are shown in the variation across the values in time varying variables in each row. The time-varying variable at all time points will predict other variables simultaneously so the group means and their deviation from group means have been accounted. The interaction of time and other variables are not needed as well because all time-varying variables are separated to different variables for different time points. The predictions in a variable in Time 1 and the same variable in Time 2 are based on different equations. Therefore, the interaction with time has been accounted already.
First, make the blueprints of the imputation method vector and predictor matrix.
pred3w <- make.predictorMatrix(dat3w)
pred3w
## pid cohort female act ext mem.1 stress.1 mem.2 stress.2 mem.3 stress.3
## pid 0 1 1 1 1 1 1 1 1 1 1
## cohort 1 0 1 1 1 1 1 1 1 1 1
## female 1 1 0 1 1 1 1 1 1 1 1
## act 1 1 1 0 1 1 1 1 1 1 1
## ext 1 1 1 1 0 1 1 1 1 1 1
## mem.1 1 1 1 1 1 0 1 1 1 1 1
## stress.1 1 1 1 1 1 1 0 1 1 1 1
## mem.2 1 1 1 1 1 1 1 0 1 1 1
## stress.2 1 1 1 1 1 1 1 1 0 1 1
## mem.3 1 1 1 1 1 1 1 1 1 0 1
## stress.3 1 1 1 1 1 1 1 1 1 1 0
## mem.4 1 1 1 1 1 1 1 1 1 1 1
## stress.4 1 1 1 1 1 1 1 1 1 1 1
## mem.4 stress.4
## pid 1 1
## cohort 1 1
## female 1 1
## act 1 1
## ext 1 1
## mem.1 1 1
## stress.1 1 1
## mem.2 1 1
## stress.2 1 1
## mem.3 1 1
## stress.3 1 1
## mem.4 0 1
## stress.4 1 0
meth3w <- make.method(dat3w)
meth3w
## pid cohort female act ext mem.1 stress.1 mem.2
## "" "" "" "" "" "" "" "pmm"
## stress.2 mem.3 stress.3 mem.4 stress.4
## "pmm" "pmm" "pmm" "pmm" "pmm"
In the imputation method vector, all variables with missing values
are set as pmm
which is predictive mean matching. This is
suitable for both binary and continuous variables.
meth3w[c("mem.2", "mem.3", "mem.4", "stress.2", "stress.3", "stress.4")] <- "pmm"
Next, the prediction matrix is adjusted. All variables with missing values are predicted by other variables except ID and themselves. Note that the imputation in wide format is better than long format because the missing values in time-varying variables in one time point can be predicted by the same variable in other time points. Thus, if possible, the imputation in the wide format is recommended.
pred3w[,] <- 0
pred3w[c("mem.2", "mem.3", "mem.4", "stress.2", "stress.3", "stress.4"), ] <- 1
pred3w[, "pid"] <- 0
diag(pred3w) <- 0
Note that extra variables (e.g., group means or interactions) are not created. The sequence of imputation will be not adjusted.
Use the mice
function to impute data to multiple
sets.
imp3w <- mice(dat3w, pred=pred3w, meth=meth3w, m=30, maxit=20, seed=123321)
Check whether the means and standard deviations of the imputed variables are converged.
plot(imp3w, c("mem.2", "mem.3"))
plot(imp3w, c("mem.4", "stress.2"))
plot(imp3w, c("stress.3", "stress.4"))
First, change the result of the imputation to stacked data set.
impdat3w <- complete(imp3w, "long", include = TRUE)
Next, list the same time-varying variables with different time points in the same vector. Then, combine them in a list.
mem <- c("mem.1", "mem.2", "mem.3", "mem.4")
stress <- c("stress.1", "stress.2", "stress.3", "stress.4")
timevarying <- list(mem, stress)
Next, create a variable representing the ID of rows in the stack data set. When changing to the long format, this ID variable will be used similar to level 2 IDs but taking unique values across imputed data set. If the stacked wide data set has 26350 rows, the stacked long will have \(26350 \times 4 = 105400\) rows. This new variable will be only used in the reshaping process.
impdat3w$id2 <- 1:nrow(impdat3w)
Next, the reshape function is used again to change from the wide format to the long format. The arguments are as follows:
data
is the target data set in the wide formatidvar
is the level-2 ID variable. In this case, it is
simply the variable that have different unique values across rows. The
original level-2 variable cannot be used because it is duplicated across
imputed data set.times
is the values of time used in the long data
format. Users may simply use 1, 2, 3, and 4 to represent the school
years. In this example, 0, 1, 2, and 3 is used as the centered time at
the beginning year.timevar
is the name of the time variable.
time
is used here.varying
is the list of time varying variables created
abovev.names
is the name of each time-varying variabledirection
is the target data set format, which is
longdat3l <- reshape(data=impdat3w, idvar="id2", times=0:3, timevar="time", varying = timevarying, v.names=c("mem", "stress"), direction="long")
dat3l
is in the long format. pid
would be
the level-2 ID variable in each imputed data set. However, there is no
level-1 ID variable in each imputed data set. Use ave
with
seq_along
to create level-1 ID variable.
dat3l$.id <- ave(dat3l$id2, dat3l$.imp, FUN=seq_along)
Finally, change to the mice
result format for further
analyses.
imp3w2 <- as.mids(dat3l)
The with
function means that analyze each data set shown
in the first argument by the second argument. This example will compare
the random effect of time
.
library(lme4)
fit3w1 <- with(imp3w2, lmer(mem ~ time + (1|pid), REML=FALSE,
control = lmerControl(optimizer ="Nelder_Mead")))
fit3w2 <- with(imp3w2, lmer(mem ~ time + (1 + time|pid), REML=FALSE,
control = lmerControl(optimizer ="Nelder_Mead")))
First, pool the fixed effects.
out3w1 <- pool(fit3w1)
summary(out3w1)
## term estimate std.error statistic df p.value
## 1 (Intercept) 61.531745 0.6430733 95.683871 584.0731 0.00000e+00
## 2 time -1.329333 0.3212821 -4.137589 77.8088 8.80872e-05
out3w2 <- pool(fit3w2)
summary(out3w2)
## term estimate std.error statistic df p.value
## 1 (Intercept) 61.531745 0.6130082 100.376716 498.6143 0.0000000000
## 2 time -1.329333 0.3510202 -3.787056 109.9214 0.0002489821
Next, pool the covariance between random effects.
source("mimlmtools.R")
ranefMI(fit3w1)
## (Intercept)
## (Intercept) 157.1224
## attr(,"stddev")
## (Intercept)
## 12.53485
## attr(,"correlation")
## (Intercept)
## (Intercept) 1
ranefMI(fit3w2)
## (Intercept) time
## (Intercept) 154.76228 -14.79118
## time -14.79118 25.49105
## attr(,"stddev")
## (Intercept) time
## 12.440349 5.048866
## attr(,"correlation")
## (Intercept) time
## (Intercept) 1.0000000 -0.2354921
## time -0.2354921 1.0000000
The sigmaMI
function is used to find the level-1
residual variance.
sigmaMI(fit3w1)
## [1] 178.5253
sigmaMI(fit3w2)
## [1] 136.0403
The anovaMI
function is used to pool the likelihood
ratio tests from each imputed data set.
anovaMI(fit3w1, fit3w2)
## F df1 df2 p.F
## 13.457 2.000 54.541 0.000