--- title: "Intro to Multilevel Model: Lecture 14 Example Code" author: "Sunthud Pornprasertmanit" output: html_document date: "`r format(Sys.time(), '%d/%m/%Y')`" --- ```{css style settings, echo = FALSE} blockquote { padding: 10px 20px; margin: 0 0 20px; font-size: 14px; border-left: 5px solid #eee; font-family: "Courier New"; } ``` ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ```{r xx, include=FALSE} library(lme4) library(psych) library(interactions) library(MASS) library(r2mlm) library(mice) library(miceadds) library(broom.mixed) out <- readr::read_rds("lecture14saveresults.rds") imp1 <- out[[1]] imp12 <- out[[2]] imp1b <- out[[3]] imp1b2 <- out[[4]] imp2 <- out[[5]] imp3 <- out[[6]] imp3w <- out[[7]] ``` ## Example 1: Interviewees' Rating Scores within Interviewers In this lecture, the steps for multiple imputation by the `mice` package will be explained. First, read the dataset investigating the interaction between interviewer's and interviewee's sex on interview ratings. In this data, some data points are missing. ```{r readdata1} dat1 <-read.table("lecture14ex1.csv", sep=",", header=TRUE, na.strings="999999") ``` To show the basic of the `mice` package, only three variables and one level-2 ID will be used. ```{r readdata12} dat1a <- dat1[,c("erid", "score", "eesex", "ersex")] ``` Next, check the missing patterns in the dataset. ```{r missingpattern1} library(mice) library(miceadds) md.pattern(dat1a) ``` ##### Step 1: Transform Variable to Reduce Computational Burden Most variables should be on a standard score (or the mean and standard deviation close to standard score). If a variable is in a big number (e.g., 1,000), the multiplication between two numbers can lead to even bigger number (e.g., 1,000,000). If a variable has many decimals (e.g., 0.001), the multiplication can lead to even smaller decimals (e.g., 0.000001). In this case, both sex variables are in dummy variables so the original scale is retained. However, the score is subtracted by its mean. Note that the subtracted mean will be added after the data are imputed. ```{r firststep1} m_score1a <- mean(dat1a$score, na.rm=TRUE) dat1a$score <- dat1a$score - m_score1a ``` ##### Step 2: Make Imputation Method Vector and Prediction Matrix First, make the blueprints of the imputation method vector and predictor matrix. ```{r secondstep1} pred1a <- make.predictorMatrix(dat1a) pred1a meth1a <- make.method(dat1a) meth1a ``` Next, the prediction matrix is adjusted. Only `score` has missing data. Thus, the rows involving other variables (`erid`, `eesex`, and `ersex`) are 0 because no missing data prediction is involved. The values put in the `score` row is as follows: - `erid` is -2 because this variable is the level-2 ID variable - `score` is 0 because the missing data cannot be predicted by the same variable. - `eesex` is 4. This variable is the main variable hoping to have random effects on `score`. Thus, the group means and the original scale with random effect of `eesex` is used to predict `score.` - `ersex` is 1. This variable is in Level 2. It should have only fixed effects on `score` so 1 is used. Note that -2 means level-2 variable, 0 means no prediction, 1 means only fixed effects, 2 means fixed effects and random effect, 3 means fixed effect with original variable and group means, 4, means the fixed and random effects from the original variable and the fixed effect of group means. ```{r secondstep12} pred1a[,] <- 0 pred1a["score",] <- c(-2, 0, 4, 1) ``` Next, the imputation vector is adjusted. `erid`, `eesex`, and `ersex` do not have missing data so `""` is used in the vector. For `score`, `2l.pan` is used because score is a continuous variable and expect to have equal variances across level-2 units. Other options are as follows: - `""` means to not predict missing values in this variable - `2l.pan` is used when the target variable is in level 1, is continuous, and is assumed to have equal residual variances across level-2 units - `2l.norm` is used when the target variable is in level 1, is continuous, and is assumed to have unequal residual variances across level-2 units - `2l.bin` is used when the target variable is level-1 binary variable - `2lonly.norm` is used when the target variable is level-2 continuous variable - `2lonly.pmm` is used when the target variable is level-2 binary (or continuous) variable - `2l.groupmean` is used to calculate group means (like the `ave` function). In prediction matrix, -2 is used for the ID variable and 2 is used for the target variable to find group means. Note that the `miceadds` package is needed to use `2l.groupmean`. ```{r secondstep13} meth1a["score"] <- "2l.pan" ``` ##### Step 3: Imputation The mice function is used to impute data to multiple sets. The first argument is the target data. `pred` is for prediction matrix. `meth` is for imputation method vector. `m` is the number of imputed data sets. `maxit` is the number of iterations (or loops) to impute the data. `seed` is the random seed number which users can set to get the same results in later analyses. ```{r thirdstep1, eval=FALSE} imp1 <- mice(dat1a, pred=pred1a, meth=meth1a, m=10, maxit=10, seed=123321) ``` ##### Step 4: Check Convergence Check whether the means and standard deviations of the imputed variables are converged. The convergence means that means and standard deviations from all imputed data sets are distributed around the same areas. If the distributions of means and standard deviations in later iterations have the same amount of variability (not increasing or not decreasing), the convergence can be assumed. ```{r fourthstep1} plot(imp1, c("score")) ``` From this plot, it is hard to assume convergence. Thus, `m` and `maxit` are increased. ```{r thirdstep12, eval=FALSE} imp12 <- mice(dat1a, pred=pred1a, meth=meth1a, m=30, maxit=20, seed=123321) ``` Check the convergence again. The convergence can be assumed. ```{r fourthstep12} plot(imp12, c("score")) ``` ##### Step 5: Transform to the original scale The subtracted means from Step 1 are added back to the imputed data. First, make the imputed data set in the long format where all imputed data sets are stacked as new rows by the `complete` function. Then, the mean is added. Finally, the data in long format is transformed into the mice package output by the `as.mids` function. ```{r fifthstep1} impdat1a <- complete(imp12, "long", include = TRUE) impdat1a$score <- impdat1a$score + m_score1a imp1a <- as.mids(impdat1a) ``` ##### Step 6: Analyze Each Imputed Data The with function means that analyze each data set shown in the first argument by the second argument. This example will compare the random slopes of `eesex`. ```{r sixthstep1} library(lme4) fit1 <- with(imp1a, lmer(score ~ eesex + ersex + (1|erid), REML=FALSE)) fit2 <- with(imp1a, lmer(score ~ eesex + ersex + (1 + eesex|erid), REML=FALSE)) ``` ##### Step 7: Pool Analysis Results The analysis results from all imputed data sets are saved in `fit1` and `fit2`. Those results must be pooled to get single parameter estimates and standard errors. The `pool` function is used to pool fixed effects. However, when the `lme4` package is used for data analysis, the `broom.mixed` package must be called to make the `pool` function run appropriately. ```{r seventhstep1} library(broom.mixed) out1 <- pool(fit1) summary(out1) out2 <- pool(fit2) summary(out2) ``` To pool the random effect, I wrote a function to extract the covariance of random effects from each analysis result and average them. Users must put the `"mimlmtools.R"` file in the working directory and run the `source` function to read the code inside the file. The `ranefMI` function is used to find the covariance of the random effects. ```{r seventhstep12} source("mimlmtools.R") ranefMI(fit1) ranefMI(fit2) ``` The `sigmaMI` function is used to find the level-1 residual variance. ```{r seventhstep13} sigmaMI(fit1) sigmaMI(fit2) ``` The `anovaMI` function is used to pool the likelihood ratio tests from each imputed data set. ```{r seventhstep14} anovaMI(fit1, fit2) ``` The `r2mlmMI` function is used to pool R-squared results. ```{r seventhstep15} r2mlmMI(fit1) ``` ## Example 2: Interviewees' Rating Scores within Interviewers (with More Complex Missing Data Patterns) In this example, four variables and one level-2 ID will be used from the interview data set. All four variables have missing data. ```{r readdata2} dat1b <- dat1[,c("erid", "score", "eesalary", "eeworkexp", "erexp")] ``` Next, check the missing patterns in the data set. ```{r missingpattern2} library(mice) md.pattern(dat1b) ``` ##### Step 1: Transform Variable to Reduce Computational Burden All target variables are subtracted by means and divided by standard deviations. The `mean` and `sd` functions are used on the data set directly for level-1 variables. ```{r firststep2} m_score1b <- mean(dat1b$score, na.rm=TRUE) m_eesalary1b <- mean(dat1b$eesalary, na.rm=TRUE) m_eeworkexp1b <- mean(dat1b$eeworkexp, na.rm=TRUE) sd_score1b <- sd(dat1b$score, na.rm=TRUE) sd_eesalary1b <- sd(dat1b$eesalary, na.rm=TRUE) sd_eeworkexp1b <- sd(dat1b$eeworkexp, na.rm=TRUE) ``` For level-2 variables, the new data set which rows represent level-2 units are created and then the means and standard deviations are calculated. ```{r firststep22} dat1bg <- dat1b[!duplicated(dat1b$erid),] m_erexp1b <- mean(dat1bg$erexp, na.rm=TRUE) sd_erexp1b <- sd(dat1bg$erexp, na.rm=TRUE) ``` Then, all variables are transformed by the calculated means and standard deviations. ```{r firststep23} dat1b$score <- (dat1b$score - m_score1b) / sd_score1b dat1b$eesalary <- (dat1b$eesalary - m_eesalary1b) / sd_eesalary1b dat1b$eeworkexp <- (dat1b$eeworkexp - m_eeworkexp1b) / sd_eeworkexp1b dat1b$erexp <- (dat1b$erexp - m_erexp1b) / sd_erexp1b ``` ##### Step 2: Group Mean Centering Because the relationship between variables in this target data set is assumed to have cross-level interactions, the group means and group-mean-centered variables are created. However, these variables have missing data, the group means calculation need to be in the process of missing imputation. That is, the missing data are imputed and then the group means and the deviations from group means are created. To create the slots for imputation, these group means and deviations are named and attached to the data set. All members in the created variables are `NA`. ```{r secondstep2} dat1b$avescore <- NA dat1b$aveeesalary <- NA dat1b$aveeeworkexp <- NA dat1b$diffscore <- NA dat1b$diffeesalary <- NA dat1b$diffeeworkexp <- NA ``` ##### Step 3: Interaction In this example, assume that interviewers' work experience (`eeworkexp`) and interviewer's interview experience (`erexp`) have the interaction effect on other variables. We need to account for this interaction in the imputation model. However, the work experience is separated into the group mean (`aveeeworkexp`) and the deviation from group mean (`diffeeworkexp`). Then, two interactions are created. Like group mean centering, the interaction is created from variables with missing values. Thus, the missing values must be imputed before the interaction is created in the imputation model. To create the slots for imputation, these interactions are named and attached to the data set. All members in the created variables are `NA`. ```{r thirdstep2} dat1b$int1 <- NA # diffeeworkexp * erexp dat1b$int2 <- NA # aveeeworkexp * erexp ``` ##### Step 4: Make Imputation Method Vector and Prediction Matrix First, make the blueprints of the imputation method vector and predictor matrix. ```{r fourthstep2} meth1b <- make.method(dat1b) meth1b pred1b <- make.predictorMatrix(dat1b) pred1b ``` The imputation method vector is set as follows: - `score`, `eesalary`, `eeworkexp` is predicted by `2l.pan` because they are level-1 continuous variables. - `erexp` is predicted by `2lonly.norm` because they are level-2 continuous variable. - `avescore`, `aveeesalary`, `aveeeworkexp` are the group means calculated from `score`, `eesalary`, and `eeworkexp` respectively. The `2l.groupmean` is used. - `diffscore`, `diffeesalary`, and `diffeeworkexp` are all deviation scores from the group means. They are simply the differences between the original variables and the group means. `~I()` is used to tell `mice` that the simple calculation between variables is used. Then, put the formula inside `~I()` - `int1` and `int2` are the interactions between variables in the data set. Again, `~I()` is used and the calculation is put inside it. ```{r fourthstep22} meth1b[c("score", "eesalary", "eeworkexp")] <- "2l.pan" meth1b["erexp"] <- "2lonly.norm" meth1b[c("avescore", "aveeesalary", "aveeeworkexp")] <- "2l.groupmean" meth1b["diffscore"] <- "~I(score - avescore)" meth1b["diffeesalary"] <- "~I(eesalary - aveeesalary)" meth1b["diffeeworkexp"] <- "~I(eeworkexp - aveeeworkexp)" meth1b["int1"] <- "~I(diffeeworkexp*erexp)" meth1b["int2"] <- "~I(aveeeworkexp*erexp)" ``` Next, the prediction matrix is adjusted. All elements in a row are 0 if the imputation method matrix is `""` or `~I()`. Thus, these following rows are adjusted: `score`, `eesalary`, `eeworkexp`, `erexp`, `avescore`, `aveeesalary`, and `aveeeworkexp`. First, set all members in the prediction matrix as 0. ```{r fourthstep23} pred1b[,] <- 0 ``` Second, in these adjusted rows, set the column of `erid`, which is level-2 ID, as -2. ```{r fourthstep24} pred1b[c("score", "eesalary", "eeworkexp", "erexp"), "erid"] <- -2 pred1b[c("avescore", "aveeesalary", "aveeeworkexp"), "erid"] <- -2 ``` Third, let's change the `score` row. The level-2 predictors (excluding `avescore`) are predicted using only fixed effects (set as 1). The level-1 deviation scores from group means (excluding `diffscore`) are predicted with fixed and random effects (set as 2). The interactions, which are not calculated from `score`, are predicted with only fixed effects (set as 1). ```{r fourthstep25} pred1b["score", c("aveeesalary", "aveeeworkexp", "erexp")] <- 1 pred1b["score", c("diffeesalary", "diffeeworkexp")] <- 2 pred1b["score", c("int1", "int2")] <- 1 ``` Fourth, let's change the `eesalary` row. The level-2 predictors (excluding `aveeesalary`) are predicted using only fixed effects (set as 1). The level-1 deviation scores from group means (excluding `diffeesalary`) are predicted with fixed and random effects (set as 2). The interactions, which are not calculated from `eesalary`, are predicted with only fixed effects (set as 1). ```{r fourthstep26} pred1b["eesalary", c("avescore", "aveeeworkexp", "erexp")] <- 1 pred1b["eesalary", c("diffscore", "diffeeworkexp")] <- 2 pred1b["eesalary", c("int1", "int2")] <- 1 ``` Fifth, let's change the `eeworkexp` row. The level-2 predictors (excluding `aveeeworkexp`) are predicted using only fixed effects (set as 1). The level-1 deviation scores from group means (excluding `diffeeworkexp`) are predicted with fixed and random effects (set as 2). The interactions are not used because the interactions are calculated from `eeworkexp.` ```{r fourthstep27} pred1b["eeworkexp", c("avescore", "aveeesalary", "erexp")] <- 1 pred1b["eeworkexp", c("diffscore", "diffeesalary")] <- 2 ``` Sixth, let's change the `erexp` row. Because this variable is in Level 2, only other level-2 predictors can predict its missing data. ```{r fourthstep28} pred1b["erexp", c("avescore", "aveeesalary", "aveeeworkexp")] <- 1 ``` Finally, group means are calculated from their original scales. In the prediction matrix, set the corresponding original scales as 2. ```{r fourthstep29} pred1b["avescore", "score"] <- 2 pred1b["aveeesalary", "eesalary"] <- 2 pred1b["aveeeworkexp", "eeworkexp"] <- 2 ``` ##### Step 5: Determine the Sequence of Imputation When a variable is imputed, the variables based on the calculation of the imputed variable must be calculated right after the imputation. For example, if the original scale of `eeworkexp` is imputed, the group means (`aveeeworkexp`), deviations from group means (`diffeeworkexp`), and interactions (`int1` and `int2`) must be imputed right after it. Thus, the sequence of imputation is as follows: ```{r fifthstep2} visit1b <- c("score", "avescore", "diffscore", "eesalary", "aveeesalary", "diffeesalary", "erexp", "int1", "int2", "eeworkexp", "aveeeworkexp", "diffeeworkexp", "int1", "int2") ``` ##### Step 6: Imputation The `mice` function is used to impute data to multiple sets. The additional arguments from the previous example are `visit` (the vector determining the imputation sequence) and `allow.na`, set as `TRUE` to allow variables that all members are `NA`. ```{r sixthstep2, eval=FALSE} imp1b <- mice(dat1b, pred=pred1b, meth=meth1b, m=10, maxit=10, visit=visit1b, seed=123321, allow.na = TRUE) ``` ##### Step 7: Check Convergence Check whether the means and standard deviations of the imputed variables are converged. Note that if there are many imputed variables, users may focus on the variables that are important in your main analysis and have high propotions of missing data. ```{r seventhstep2} plot(imp1b, c("score", "eesalary")) plot(imp1b, c("eeworkexp", "erexp")) ``` From this plot, it is hard to assume convergence. Thus, `m` and `maxit` are increased. ```{r sixthstep22, eval=FALSE} imp1b2 <- mice(dat1b, pred=pred1b, meth=meth1b, m=30, maxit=20, visit=visit1b, seed=123321, allow.na = TRUE) ``` Check the convergence again. The convergence can be assumed. ```{r seventhstep22} plot(imp1b2, c("score", "eesalary")) plot(imp1b2, c("eeworkexp", "erexp")) ``` ##### Step 8: Transform to the original scale The means and standard deviations from Step 1 are used to transform the imputed variables back to the original scales. ```{r eigthstep2} impdat1b <- complete(imp1b2, "long", include = TRUE) impdat1b$score <- sd_score1b*impdat1b$score + m_score1b impdat1b$eesalary <- sd_eesalary1b*impdat1b$eesalary + m_eesalary1b impdat1b$eeworkexp <- sd_eeworkexp1b*impdat1b$eeworkexp + m_eeworkexp1b impdat1b$erexp <- sd_erexp1b*impdat1b$erexp + m_erexp1b imp1b2a <- as.mids(impdat1b) ``` ##### Step 9: Analyze Each Imputed Data The `with` function means that analyze each data set shown in the first argument by the second argument. This example will compare the fixed effect of the interaction, `eeworkexp*erexp`. ```{r ninthstep2} library(lme4) fit1b1 <- with(imp1b2a, lmer(score ~ eesalary + eeworkexp + erexp + (1|erid), REML=FALSE)) fit1b2 <- with(imp1b2a, lmer(score ~ eesalary + eeworkexp + erexp + eeworkexp*erexp + (1|erid), REML=FALSE)) ``` ##### Step 10: Pool Analysis Results First, pool the fixed effects. ```{r tenthstep2} out1b1 <- pool(fit1b1) summary(out1b1) out1b2 <- pool(fit1b2) summary(out1b2) ``` Next, pool the covariance between random effects. ```{r tenthstep22} source("mimlmtools.R") ranefMI(fit1b1) ranefMI(fit1b2) ``` The `sigmaMI` function is used to find the level-1 residual variance. ```{r tenthstep23} sigmaMI(fit1b1) sigmaMI(fit1b2) ``` The `anovaMI` function is used to pool the likelihood ratio tests from each imputed data set. ```{r tenthstep24} anovaMI(fit1b1, fit1b2) ``` ## Example 3: Predicting Students' Self-Efficacy from Math Achievement Read the data set of students nested in schools. The self-efficacy is predicted by math achievement. ```{r readdata3} dat2 <- read.table("lecture14ex2.csv", sep=",", header=TRUE, na.strings="999999") ``` Next, check the missing patterns in the dataset. ```{r missingpattern3} md.pattern(dat2) ``` ##### Step 1: Transform Variable to Reduce Computational Burden All target variables are subtracted by means and divided by standard deviations. Note that all transformed variables are in the student level. ```{r firststep3} m_efficacy2 <- mean(dat2$efficacy, na.rm=TRUE) m_ach2 <- mean(dat2$ach, na.rm=TRUE) sd_efficacy2 <- sd(dat2$efficacy, na.rm=TRUE) sd_ach2 <- sd(dat2$ach, na.rm=TRUE) dat2$efficacy <- (dat2$efficacy - m_efficacy2) / sd_efficacy2 dat2$ach <- (dat2$ach - m_ach2) / sd_ach2 ``` ##### Step 2: Group Mean Centering Create the blank group means and deviations from group means for imputation. ```{r secondstep3} dat2$aveach <- NA dat2$aveefficacy <- NA dat2$diffach <- NA dat2$diffefficacy <- NA ``` ##### Step 3: Interaction Assume that the research hypothesis expected to have the cross-level interaction effect of school means and deviation from school means of math achievement on self-efficacy. The imputation model must address that cross-level interaction by putting the cross-level interaction effect in both direction: `ach` on `efficacy` and `efficacy` on `ach`. Create the blank interaction variables for imputation. ```{r thirdstep3} dat2$intach <- NA # aveach*diffach dat2$intefficacy <- NA # aveefficacy*diffefficacy ``` ##### Step 4: Make Imputation Method Vector and Prediction Matrix First, make the blueprints of the imputation method vector and predictor matrix. ```{r fourthstep3} meth2 <- make.method(dat2) meth2 pred2 <- make.predictorMatrix(dat2) pred2 ``` The imputation method vector is set as follows: - `efficacy` and `ach` are predicted by `2l.pan` because they are level-1 continuous variables. - `aveefficacy` and `aveach` are the group means calculated from `efficacy` and `ach` respectively. The `2l.groupmean` is used. - `diffach` and `diffefficacy` are all deviation scores from the group means. Use `~I()` to calculate them. - `intach` and `intefficacy` are the interactions between variables in the data set. Again, `~I()` is used and the calculation is put inside it. ```{r fourthstep32} meth2[c("efficacy", "ach")] <- "2l.pan" meth2[c("aveefficacy", "aveach")] <- "2l.groupmean" meth2["diffach"] <- "~I(ach - aveach)" meth2["diffefficacy"] <- "~I(efficacy - aveefficacy)" meth2["intach"] <- "~I(aveach*diffach)" meth2["intefficacy"] <- "~I(aveefficacy*diffefficacy)" ``` Next, the prediction matrix is adjusted. All elements in a row are 0 if the imputation method matrix is `""` or `~I()`. Thus, these following rows are adjusted: `ach`, `efficacy`, `aveach`, and `aveefficacy`. First, set all members in the prediction matrix as 0. ```{r fourthstep33} pred1b[,] <- 0 ``` Second, in these adjusted rows, set the column of `erid`, which is level-2 ID, as -2. ```{r fourthstep34} pred2[c("ach", "efficacy", "aveach", "aveefficacy"), "schoolid"] <- -2 ``` Third, let's change the `efficacy` row. The level-2 predictors (excluding `aveefficacy`) are predicted using only fixed effects (set as 1). The level-1 deviation scores from group means (excluding `diffefficacy`) are predicted with fixed and random effects (set as 2). The interactions, which are not calculated from `efficacy`, are predicted with only fixed effects (set as 1). ```{r fourthstep35} pred2["efficacy", c("aveach", "diffach", "intach", "private")] <- c(1, 2, 1, 1) ``` Fourth, let's change the `ach` row. The level-2 predictors (excluding `aveach`) are predicted using only fixed effects (set as 1). The level-1 deviation scores from group means (excluding `diffach`) are predicted with fixed and random effects (set as 2). The interactions, which are not calculated from `ach`, are predicted with only fixed effects (set as 1). ```{r fourthstep36} pred2["ach", c("aveefficacy", "diffefficacy", "intefficacy", "private")] <- c(1, 2, 1, 1) ``` Finally, group means are calculated from their original scales. In the prediction matrix, set the corresponding original scales as 2. ```{r fourthstep37} pred2["aveach", "ach"] <- 2 pred2["aveefficacy", "efficacy"] <- 2 ``` ##### Step 5: Determine the Sequence of Imputation When a variable is imputed, the variables based on the calculation of the imputed variable must be calculated right after the imputation. The sequence of imputation is as follows: ```{r fifthstep3} visit2 <- c("efficacy", "aveefficacy", "diffefficacy", "intefficacy", "ach", "aveach", "diffach", "intach") ``` ##### Step 6: Imputation Use the `mice` function to impute data to multiple sets. ```{r sixthstep3, eval=FALSE} imp2 <- mice(dat2, pred=pred2, meth=meth2, m=30, maxit=30, visit=visit2, seed=123321, allow.na = TRUE) ``` ##### Step 7: Check Convergence Check whether the means and standard deviations of the imputed variables are converged. ```{r seventhstep3} plot(imp2, c("ach", "efficacy")) ``` ##### Step 8: Transform to the original scale The means and standard deviations from Step 1 are used to transform the imputed variables back to the original scales. ```{r eigthstep3} impdat2 <- complete(imp2, "long", include = TRUE) impdat2$efficacy <- sd_efficacy2*impdat2$efficacy + m_efficacy2 impdat2$ach <- sd_ach2*impdat2$ach + m_ach2 imp22 <- as.mids(impdat2) ``` ##### Step 9: Analyze Each Imputed Data The `with` function means that analyze each data set shown in the first argument by the second argument. This example will compare the fixed effect of the interaction, `diffach*aveach`. ```{r ninthstep3} library(lme4) fit21 <- with(imp22, lmer(efficacy ~ diffach + aveach + private + (1 + diffach|schoolid), REML=FALSE, control = lmerControl(optimizer ="Nelder_Mead"))) fit22 <- with(imp22, lmer(efficacy ~ diffach*aveach + private + (1 + diffach|schoolid), REML=FALSE, control = lmerControl(optimizer ="Nelder_Mead"))) ``` ##### Step 10: Pool Analysis Results First, pool the fixed effects. ```{r tenthstep3} out21 <- pool(fit21) summary(out21) out22 <- pool(fit22) summary(out22) ``` Next, pool the covariance between random effects. ```{r tenthstep32} source("mimlmtools.R") ranefMI(fit21) ranefMI(fit22) ``` The `sigmaMI` function is used to find the level-1 residual variance. ```{r tenthstep33} sigmaMI(fit21) sigmaMI(fit22) ``` The `anovaMI` function is used to pool the likelihood ratio tests from each imputed data set. ```{r tenthstep34} anovaMI(fit21, fit22) ``` ## Example 4: Changes in Sense of Belonging in Undergraduates Read the data set of years nested in undergraduate students. Sense of belonging is the dependent variable. The missing patterns reflect many students dropped out from the program. ```{r readdata4} dat3 <- read.table("lecture14ex3.csv", sep=",", header=TRUE, na.strings="999999") ``` Next, check the missing patterns in the dataset. ```{r missingpattern4} md.pattern(dat3) ``` Check the proportion of missing at each time point. ```{r timemissingcalc4} aggregate(stress ~ time, data = dat3, FUN = function(x) sum(is.na(x)), na.action = NULL) ``` ##### Step 1: Transform Variable to Reduce Computational Burden All target variables are subtracted by means and divided by standard deviations. Let's start with level-1 (time-varying) variables. ```{r firststep4} m_mem3 <- mean(dat3$mem, na.rm=TRUE) m_stress3 <- mean(dat3$stress, na.rm=TRUE) sd_mem3 <- sd(dat3$mem, na.rm=TRUE) sd_stress3 <- sd(dat3$stress, na.rm=TRUE) dat3$mem <- (dat3$mem - m_mem3) / sd_mem3 dat3$stress <- (dat3$stress - m_stress3) / sd_stress3 ``` For level-2 (time-invariant) variables, new data which rows represent individuals are created before finding means and standard deviations. ```{r firststep42} dat3g <- dat3[dat3$time == 1,] m_act3 <- mean(dat3g$act, na.rm=TRUE) m_ext3 <- mean(dat3g$ext, na.rm=TRUE) sd_act3 <- sd(dat3g$act, na.rm=TRUE) sd_ext3 <- sd(dat3g$ext, na.rm=TRUE) dat3$act <- (dat3$act - m_act3) / sd_act3 dat3$ext <- (dat3$ext - m_ext3) / sd_ext3 ``` ##### Step 2: Centering First, the time variable is centered at the first time point. ```{r secondstep4} dat3$time <- dat3$time - 1 ``` Create the blank students' means and deviations from students' means for imputation. ```{r secondstep42} dat3$avemem <- NA dat3$avestress <- NA dat3$diffmem <- NA dat3$diffstress <- NA ``` ##### Step 3: Interaction The interaction terms in this example are based on time multipled with other variables, which can be separated into two sets. First, `time` is multipled with other variables that do not have missing values. These interactions do not need to be imputed because all values are observed. The interaction variables are simply calculated and attached into the data sets. ```{r thirdstep4} dat3$intact <- dat3$time * dat3$act dat3$intext <- dat3$time * dat3$ext dat3$intcohort <- dat3$time * dat3$cohort dat3$intfemale <- dat3$time * dat3$female ``` Second, `time` is multipled with variables with missing values. In this case, create the blank interaction variables for imputation. ```{r thirdstep42} dat3$intmem1 <- NA # time*avemem dat3$intmem2 <- NA # time*diffmem dat3$intstress1 <- NA # time*avestress dat3$intstress2 <- NA # time*diffstress ``` ##### Step 4: Make Imputation Method Vector and Prediction Matrix First, make the blueprints of the imputation method vector and predictor matrix. ```{r fourthstep4} pred3 <- make.predictorMatrix(dat3) pred3 meth3 <- make.method(dat3) meth3 ``` The imputation method vector is set as follows: - `mem` and `stress` are predicted by `2l.pan` because they are level-1 continuous variables. - `avemem` and `avestress` are the group means calculated from `mem` and `stress` respectively. The `2l.groupmean` is used. - `diffmem` and `diffstress` are all deviation scores from the group means. Use `~I()` to calculate them. - `intmem1`, `intmem2`, `intstress1`, and `intstress2` are the interactions between variables in the data set. Again, `~I()` is used and the calculation is put inside it. ```{r fourthstep42} meth3[c("mem", "stress")] <- "2l.pan" meth3[c("avemem", "avestress")] <- "2l.groupmean" meth3["diffmem"] <- "~I(mem - avemem)" meth3["diffstress"] <- "~I(stress - avestress)" meth3["intmem1"] <- "~I(time*avemem)" meth3["intmem2"] <- "~I(time*diffmem)" meth3["intstress1"] <- "~I(time*avestress)" meth3["intstress2"] <- "~I(time*diffstress)" ``` Next, the prediction matrix is adjusted. All elements in a row are 0 if the imputation method matrix is `""` or `~I()`. Thus, these following rows are adjusted: `mem`, `stress`, `avemem`, and `avestress`. First, set all members in the prediction matrix as 0. ```{r fourthstep43} pred3[,] <- 0 ``` Second, in these adjusted rows, set the column of `erid`, which is level-2 ID, as -2. ```{r fourthstep44} pred3[c("mem", "stress", "avemem", "avestress"), "pid"] <- -2 ``` Third, let's change the `mem` row. The level-2 predictors (excluding `avemem`) are predicted using only fixed effects (set as 1). The level-1 deviation scores from group means (excluding `diffmem`) are predicted with fixed and random effects (set as 2). The interactions, which are not calculated from `mem`, are predicted with only fixed effects (set as 1). ```{r fourthstep45} pred3["mem", c("time", "diffstress")] <- 2 pred3["mem", c("cohort", "female", "act", "ext", "avestress")] <- 1 pred3["mem", c("intstress1", "intstress2", "intact", "intext", "intcohort", "intfemale")] <- 1 ``` Fourth, let's change the `stress` row. The level-2 predictors (excluding `avestress`) are predicted using only fixed effects (set as 1). The level-1 deviation scores from group means (excluding `diffstress`) are predicted with fixed and random effects (set as 2). The interactions, which are not calculated from `stress`, are predicted with only fixed effects (set as 1). ```{r fourthstep46} pred3["stress", c("time", "diffmem")] <- 2 pred3["stress", c("cohort", "female", "act", "ext", "avemem")] <- 1 pred3["stress", c("intmem1", "intmem2", "intact", "intext", "intcohort", "intfemale")] <- 1 ``` Finally, group means are calculated from their original scales. In the prediction matrix, set the corresponding original scales as 2. ```{r fourthstep47} pred3["avemem", "mem"] <- 2 pred3["avestress", "stress"] <- 2 ``` ##### Step 5: Determine the Sequence of Imputation When a variable is imputed, the variables based on the calculation of the imputed variable must be calculated right after the imputation. The sequence of imputation is as follows: ```{r fifthstep4} visit3 <- c("mem", "avemem", "diffmem", "intmem1", "intmem2", "stress", "avestress", "diffstress", "intstress1", "intstress2") ``` ##### Step 6: Imputation Use the `mice` function to impute data to multiple sets. ```{r sixthstep4, eval=FALSE} imp3 <- mice(dat3, pred=pred3, meth=meth3, m=30, maxit=20, visit=visit3, seed=123321, allow.na = TRUE) ``` ##### Step 7: Check Convergence Check whether the means and standard deviations of the imputed variables are converged. ```{r seventhstep4} plot(imp3, c("mem", "stress")) ``` ##### Step 8: Transform to the original scale The means and standard deviations from Step 1 are used to transform the imputed variables back to the original scales. ```{r eigthstep4} impdat3 <- complete(imp3, "long", include = TRUE) impdat3$mem <- sd_mem3*impdat3$mem + m_mem3 impdat3$stress <- sd_stress3*impdat3$stress + m_stress3 impdat3$act <- sd_act3*impdat3$act + m_act3 impdat3$ext <- sd_ext3*impdat3$ext + m_ext3 imp32 <- as.mids(impdat3) ``` ##### Step 9: Analyze Each Imputed Data The `with` function means that analyze each data set shown in the first argument by the second argument. This example will compare the random effect of `time`. ```{r ninthstep4} library(lme4) fit31 <- with(imp32, lmer(mem ~ time + (1|pid), REML=FALSE, control = lmerControl(optimizer ="Nelder_Mead"))) fit32 <- with(imp32, lmer(mem ~ time + (1 + time|pid), REML=FALSE, control = lmerControl(optimizer ="Nelder_Mead"))) ``` ##### Step 10: Pool Analysis Results First, pool the fixed effects. ```{r tenthstep4} out31 <- pool(fit31) summary(out31) out32 <- pool(fit32) summary(out32) ``` Next, pool the covariance between random effects. ```{r tenthstep42} source("mimlmtools.R") ranefMI(fit31) ranefMI(fit32) ``` The `sigmaMI` function is used to find the level-1 residual variance. ```{r tenthstep43} sigmaMI(fit31) sigmaMI(fit32) ``` The `anovaMI` function is used to pool the likelihood ratio tests from each imputed data set. ```{r tenthstep44} anovaMI(fit31, fit32) ``` ## Example 5: Changes in Sense of Belonging in Undergraduates (Imptation with Wide-Format Data) Let's read the data set again to remove additional variables in the previous example. ```{r readdata5} dat3 <- read.table("lecture14ex3.csv", sep=",", header=TRUE, na.strings="999999") ``` Remove level-1 ID variable. ```{r removeid5} dat3 <- dat3[,-1] ``` Check the proportion of missing at each time point. ```{r describe5} library(psych) describe(dat3) ``` ##### Step 1: Transform Data from Long to Wide Format Change data from long to wide format by the `reshape` function. In the long format, rows represent level-1 ID. In contrast, in the wide format, rows represent level-2 ID. A level-1 variable in different time points will be shown as multiple variables in the wide format. For example, `stress` is the level-1 variable in the long format and this data has four time points. In the wide format, each row will have `stress1`, `stress2`, `stress3`, and `stress4`. The arguments of the `reshape` function are as follows: - `data` is the target data set in the long format - `idvar` is the level-2 ID variable - `timevar` is the time variable - `v.names` is a vector of time-varying variables - `direction` is the target data set format, which is wide ```{r firststep5} dat3w <- reshape(data=dat3, idvar="pid", timevar="time", v.names=c("mem", "stress"), direction="wide") ``` Next, the missing patterns can be checked from the wide format. ```{r firststep52} md.pattern(dat3w) ``` Note that the transformation to the standard scale can be done but not necessary. The imputation in multilevel data structure is computational intensive so the transformation helps reduce computation time. However, the imputation in single level data structure is much quicker. The transformation is not necessary. Further, the group mean centering is not necessary. In the wide format, the group means are shown in the average level of time-varying variables and the deviation from group means are shown in the variation across the values in time varying variables in each row. The time-varying variable at all time points will predict other variables simultaneously so the group means and their deviation from group means have been accounted. The interaction of time and other variables are not needed as well because all time-varying variables are separated to different variables for different time points. The predictions in a variable in Time 1 and the same variable in Time 2 are based on different equations. Therefore, the interaction with time has been accounted already. ##### Step 2: Make Imputation Method Vector and Prediction Matrix First, make the blueprints of the imputation method vector and predictor matrix. ```{r secondstep5} pred3w <- make.predictorMatrix(dat3w) pred3w meth3w <- make.method(dat3w) meth3w ``` In the imputation method vector, all variables with missing values are set as `pmm` which is predictive mean matching. This is suitable for both binary and continuous variables. ```{r secondstep52} meth3w[c("mem.2", "mem.3", "mem.4", "stress.2", "stress.3", "stress.4")] <- "pmm" ``` Next, the prediction matrix is adjusted. All variables with missing values are predicted by other variables except ID and themselves. Note that the imputation in wide format is better than long format because the missing values in time-varying variables in one time point can be predicted by the same variable in other time points. Thus, if possible, the imputation in the wide format is recommended. ```{r secondstep53} pred3w[,] <- 0 pred3w[c("mem.2", "mem.3", "mem.4", "stress.2", "stress.3", "stress.4"), ] <- 1 pred3w[, "pid"] <- 0 diag(pred3w) <- 0 ``` Note that extra variables (e.g., group means or interactions) are not created. The sequence of imputation will be not adjusted. ##### Step 3: Imputation Use the `mice` function to impute data to multiple sets. ```{r thirdstep5, eval=FALSE} imp3w <- mice(dat3w, pred=pred3w, meth=meth3w, m=30, maxit=20, seed=123321) ``` ##### Step 4: Check Convergence Check whether the means and standard deviations of the imputed variables are converged. ```{r fourthstep5} plot(imp3w, c("mem.2", "mem.3")) plot(imp3w, c("mem.4", "stress.2")) plot(imp3w, c("stress.3", "stress.4")) ``` ##### Step 5: Transform Data from Wide to Long Format First, change the result of the imputation to stacked data set. ```{r fifthstep5} impdat3w <- complete(imp3w, "long", include = TRUE) ``` Next, list the same time-varying variables with different time points in the same vector. Then, combine them in a list. ```{r fifthstep52} mem <- c("mem.1", "mem.2", "mem.3", "mem.4") stress <- c("stress.1", "stress.2", "stress.3", "stress.4") timevarying <- list(mem, stress) ``` Next, create a variable representing the ID of rows in the stack data set. When changing to the long format, this ID variable will be used similar to level 2 IDs but taking unique values across imputed data set. If the stacked wide data set has 26350 rows, the stacked long will have $26350 \times 4 = 105400$ rows. This new variable will be only used in the reshaping process. ```{r fifthstep53} impdat3w$id2 <- 1:nrow(impdat3w) ``` Next, the reshape function is used again to change from the wide format to the long format. The arguments are as follows: - `data` is the target data set in the wide format - `idvar` is the level-2 ID variable. In this case, it is simply the variable that have different unique values across rows. The original level-2 variable cannot be used because it is duplicated across imputed data set. - `times` is the values of time used in the long data format. Users may simply use 1, 2, 3, and 4 to represent the school years. In this example, 0, 1, 2, and 3 is used as the centered time at the beginning year. - `timevar` is the name of the time variable. `time` is used here. - `varying` is the list of time varying variables created above - `v.names` is the name of each time-varying variable - `direction` is the target data set format, which is long ```{r fifthstep54} dat3l <- reshape(data=impdat3w, idvar="id2", times=0:3, timevar="time", varying = timevarying, v.names=c("mem", "stress"), direction="long") ``` `dat3l` is in the long format. `pid` would be the level-2 ID variable in each imputed data set. However, there is no level-1 ID variable in each imputed data set. Use `ave` with `seq_along` to create level-1 ID variable. ```{r fifthstep55} dat3l$.id <- ave(dat3l$id2, dat3l$.imp, FUN=seq_along) ``` Finally, change to the `mice` result format for further analyses. ```{r fifthstep56} imp3w2 <- as.mids(dat3l) ``` ##### Step 6: Analyze Each Imputed Data The `with` function means that analyze each data set shown in the first argument by the second argument. This example will compare the random effect of `time`. ```{r sixthstep5} library(lme4) fit3w1 <- with(imp3w2, lmer(mem ~ time + (1|pid), REML=FALSE, control = lmerControl(optimizer ="Nelder_Mead"))) fit3w2 <- with(imp3w2, lmer(mem ~ time + (1 + time|pid), REML=FALSE, control = lmerControl(optimizer ="Nelder_Mead"))) ``` ##### Step 7: Pool Analysis Results First, pool the fixed effects. ```{r seventhstep5} out3w1 <- pool(fit3w1) summary(out3w1) out3w2 <- pool(fit3w2) summary(out3w2) ``` Next, pool the covariance between random effects. ```{r seventhstep52} source("mimlmtools.R") ranefMI(fit3w1) ranefMI(fit3w2) ``` The `sigmaMI` function is used to find the level-1 residual variance. ```{r seventhstep53} sigmaMI(fit3w1) sigmaMI(fit3w2) ``` The `anovaMI` function is used to pool the likelihood ratio tests from each imputed data set. ```{r seventhstep54} anovaMI(fit3w1, fit3w2) ```