Estimating statistical power for within-cluster randomized studies
4 stars based on
This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. We provide a systematic review on GEE including basic concepts as well as several recent developments due to practical challenges in real applications. A brief summary and discussion of potential research interests regarding GEE are provided in the end.
One longitudinal data example can be taken from a study of orthodontic measurements on children including 11 girls and 16 boys.
The response is the measurement of the distance in millimeters from the center of the pituitary to the pterygomaxillary fissure, which is repeatedly measured at ages 8, 10, 12, and 14 years. The primary goal is to investigate whether there exists significant gender difference in dental growth measures and the temporal trend as age increases [ 4 ].
There are two types of approaches, mixed-effect models and GEE [ 67 ], which are traditional and are widely used in practice now. Of note is that these two methods have different tendencies in model fitting depending on the study objectives. In particular, the mixed-effect model is an individual-level approach by adopting random effects to capture the correlation between the observations of the same subject [ 7 ]. On the other hand, GEE is a population-level approach based on a quasilikelihood function and provides the population-averaged estimates of the parameters [ 8 ].
In this paper, we focus on the latter to provide a review and recent developments of GEE. As is well known, GEE has several defining features [ 9 — 11 ]. Sample size and power calculations with correlated binary data variance-covariance matrix of responses is treated as nuisance parameters in GEE and thus this sample size and power calculations with correlated binary data fitting turns out to be easier than mixed-effect models [ 12 ].
In particular, if the overall treatment effect is of primary interest, GEE is preferred. GEE relaxes the distribution assumption and only requires the correct specification of marginal mean and variance as well as the link function which connects the covariates of interest and marginal means. In addition, the estimation of the correlation coefficients using the moment-based approach is not efficient; thus the correlation matrix may not be a positive definite matrix in certain cases.
Also, Liang and Zeger did not incorporate the constraints on the range of correlation which was restricted by the marginal means because the estimation of the correlation coefficients was simply based on Pearson residuals [ 6 ].
Chaganty and Joe discussed this issue for dependent Bernoulli random variables [ 13 ], and later Sabo and Chaganty made future explanation [ 14 ]. For example, Sutradhar and Das pointed out under misspecification the correlation coefficient estimates sample size and power calculations with correlated binary data not converge to the true values [ 15 ].
Furthermore, for discrete random vectors, the correlation matrix was usually complicated, and it was not easy to attain multivariate distributions with specified correlation structures. These limitations lead researchers to actively work sample size and power calculations with correlated binary data this area to develop novel methodologies. Wang and Carey proposed to estimate the correlation coefficients by differentiating the Cholesky decomposition of the working correlation matrix [ 18 ].
Also, Qu and Lindsay proposed similar Gaussian or quadratic estimating equations [ 19 ]. In particular, for binary longitudinal data, the estimation of the correlation coefficients was proposed based on conditional residuals [ 20 — 22 ]. Nevertheless, in this paper, the above issues are not sample size and power calculations with correlated binary data in great depth, and the assumption that, under the regular mild conditions, the consistency of parameter estimates as well as within-subject correlation coefficient estimate holds is satisfied.
Thus, three specific topics including model selection, power analysis, and the issue of informative cluster size are mainly focused on and the recent developments are reviewed in the following sections.
Let denote the response vector for the subject with the mean vector noted by where is the corresponding mean. The marginal model specifies that a relationship between and the covariates is written as follows: The conditional variance of given is specified aswhere is a known variance function of and is a scale parameter which may need to be estimated. Mostly, and depend on the distributions of outcomes. For instance, if is continuous, is specified as 1, and represents the error variance; if is count,and is equal to 1.
Note that the iterative algorithm is applied for estimating using the Pearson residuals calculated from the current value sample size and power calculations with correlated binary data. Also, the scale parameter can be estimated by where is the total number of observations and is covariates dimensionality. Sample size and power calculations with correlated binary data mildregularity conditions, is asymptotically normally distributed with a mean and a covariance matrix estimated based on the sandwich estimator with by replacing, and with their consistent estimates, where with is an estimator of the variance-covariance matrix of [ 623 ].
Note that if is correctly specified, then reduces towhich is often referred to as the model-based variance estimator [ 24 ]. Thus, a Wald -test can be performed based on asymptotic normal distribution of the test statistic. In this section, we will discuss the model selection criteria available of GEE. There are several reasons why model selection of GEE models is important and necessary: GEE has gained increasing attention in biomedical studies which may include sample size and power calculations with correlated binary data large group of predictors [ 25 — 28 ].
Therefore, sample size and power calculations with correlated binary data to select intrasubject correlation matrix plays a vital role in GEE with improved finite-sample performance; the variance function is another potential factor affecting the goodness-of-fit of GEE [ 25 sample size and power calculations with correlated binary data, 30 ].
Correctly specified variance function can assist in the selection of covariates and an appropriate correlation structure [ 3132 ]. The statistic is defined by where andrespectively. Thus, and can all be used for correlation structure selection. In their work, they also showed that criterion held better performance than via simulation. Due to the fact that GEE is not likelihood-based, thus it is called quasi-likelihood under the independence model criterion QIC [ 40 ].
One limitation of this criterion is that it cannot penalize the overparameterization; thus the performance is not well in comparison with two correlation structures having quite different numbers of correlation parameters. Another attractive criterion is the extended quasilikelihood information criterion EQIC proposed by Wang and Hin [ 25 ] by using the extended quasilikelihood EQL defined by Nelder and Pregibon based on the deviance function, which is shown below under the independent correlation structure [ 44 ]: Therefore, EQIC is defined by where some adjustments were applied to by adding a small constant with the optimal chosen value as.
Besides those criteria mentioned above, Cantoni et al. Overall, the model selection of GEE is nontrivial, where the best selection criterion is still being pursued [ 56 ], and the recent work by Wang et al.
It is well known that the calculation of sample size and power is necessary and important for planning a clinical trial, which have been well studied for independent observations [ 1 ].
For example, in a study with one parameter of interestthe hypothesis of interest can be formulated as where is the expected value. Thus, based on a two-sided -test with type I errorthe power denoted by can be obtained by where is sample size and is the robust variance estimator corresponding to in the estimate of.
Accordingly, the sample size is given by For correlated continuous data, the calculation is straightforward using 16 ; however, in particular, for correlated binary data, more work will be needed [ 60 ], and Pan provided explicit formulas for under various situations as follows [ 61 ]: The detailed sample size and power calculations with correlated binary data of under several important special cases are given by These formulas can be directly used in practice, which has covered most situations encountered in clinical trials [ 61 ].
Note that whenLiu and Liang provided a different formula of sample size compared with 17 withwhich is Be aware that the difference is due to the test methods, the Wald -test used by Pan [ 61 ] and the score test applied by Liu and Liang sample size and power calculations with correlated binary data 58 ]. Note that, in some cases, the score test may be preferred [ 62 ]. On the other hand, there are several concerns [ 68 ]. First, we here focus on the calculation of the sample size assuming is known; however, based on the power formula 16depends on and thus increasing can also assist in power improvement but turns out to be less effective than [ 69 ].
For example, by the literature review of published CRTs, the median number of clusters is shown as 21 [ 70 ]. In such situations, the power formula adjusted for the small samples in GEE is necessary, which has drawn attention from researchers recently [ 71 — 75 ]. The application of GEE in clustered data with informative cluster size is another special topic [ 76 ]. Taking an example of a periodontal disease study, the number of teeth for each patient may be related to the overall oral health of the individual; in other words, the worse the oral health is, the less the number of teeth is and, thus, cluster size may influence the distribution of the oral outcomes, which is called informative cluster size [ 4577 ].
Such issues commonly occur in biomedical studies e. Note that if the maximum of cluster size exists and is known, then this can be treated as informative missing data problem, which can be solved via the weighted estimating equations proposed by Robins et al.
The basic idea is that, for each of resampled replicate data based on a Monte Carlo method is a large number, i. The details are shown as follows: Alternatively, the approach considered by Williamson et al. The estimating equation is where is defined the same as above, but what is different is that the subscription ranges from 1 tonot restricted by the index. Note that asconverges to its expected estimating function and is asymptotically equivalent to.
This method was also explored or extended for the correlated data with nonignorable cluster size by Benhin et al. Furthermore, a more efficient method called modified WCR MWCR was proposed by Chiang and Lee, where minimum cluster size subjects were randomly sampled from each cluster, and then GEE models for balanced data were applied for estimation by incorporating the intracluster correlation; thus MWCR might be a more efficient way for analysis [ 84 ].
In addition, Wang et al. Examples include health studies of subjects from multiple hospitals or families. With the adoption and comparison of GEE, WCR, and CWGEE, the author claimed that CWGEE was recommended because of the comparable performance with WCR and the lack of intensive Monte Carlo computation in terms of well preserved coverage rates and desirable power properties, while GEE models led to invalid inference due to the biased parameter estimates via extensive simulation studies and real data application of a periodontal disease study [ 45 ].
In addition, for observed-cluster inference, Seaman et al. More work can be found in [ 87 — 90 ], among others. Two types of outcomes are considered, continuous and count responses. The models for data generation are as follows: The covariates are i. For each scenario, we generate the data based on the underlying true correlation structures as independentexchangeableand autoregressive with0.
The partial simulation results are provided in Tables 23and 4where the results of CIC are not shown because they are the same as those of QIC. Based on the results, RJ does not perform well for the scenarios with either continuous or binary outcomes, while RJ1 and RJ2 have comparable performances and can future and option trading meaning the true underlying correlation structure in most scenarios with better performance under large sample size.
QIC is not satisfactory when the true correlation structure is independent but has advantageous performance for the scenarios with the true correlation structure as exchangeable or AR On the other hand, SC and GP do not perform well for longitudinal data with normal responses, but the performance is slightly improved for longitudinal data with binary outcomes. The simulation studies are conducted for providing numerical comparisons among five types of model selection criteria [ 9192 ].
Until now, novel methodologies are still needed and sample size and power calculations with correlated binary data developed due to the increasing usage and potential theoretical constraints of GEE as well as new challenges emerging from practical applications in clinical trials or biomedical studies. Although GEE has sample size and power calculations with correlated binary data features, flexible application, and easy implementation in software, the application in practice should be cautious depending on the context of study design or data structure and the goals of research interest.
The author declares that there is no conflict of interests regarding the publication of this paper. The content is solely the responsibility of the author and does not represent the views of the NIH.
Home Journals About Us. Simulation for longitudinal data with independent correlation matrix. Simulation for longitudinal data with exchangeable correlation matrix with. Simulation for longitudinal data with AR-1 correlation matrix with. View at Google Scholar M. Statistical Methodologyvol.
View at Google Scholar A. Journal of the International Biometric Societyvol. Jang, Working correlation selection in generalized estimating equations [Dissertation]University of Iowa,