Close
About
FAQ
Home
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Multilevel Bayesian latent class growth mixture model for longitudinal zero-inflated Poisson data
(USC Thesis Other)
Multilevel Bayesian latent class growth mixture model for longitudinal zero-inflated Poisson data
PDF
Download
Share
Open document
Flip pages
Copy asset link
Request this asset
Transcript (if available)
Content
1
Multilevel Bayesian Latent Class Growth Mixture model for
longitudinal Zero-inflated Poisson data
By
Kejia Wang
A Dissertation presented to the
Faculty of the USC Graduate School
University of Southern California
In partial fulfillment of the requirement
of the degree of Doctor of Philosophy in Biostatistics
May, 2017
2
Acknowledgements
First of all, I would like to sincerely thank my mentor, Dr. Kiros Berhane. This dissertation
would not be possible without his strong support over the past two to three years. Due to his
enthusiasm and advice, I was introduced into Bayesian analysis and multilevel modeling. I truly
appreciate his patience, great guidance.
I would also like to thank my committee, Dr. Juan Pablo Lewinger, Dr. David Conti, Dr. Chih-
Ping Chou and Dr. Donald Lloyd. Their great suggestions and comments have been very helpful
for me to improve my dissertation. I would also like to thank USC TCORS team, it was a great
research experience where I was introduced and gain knowledge on Latent Class analysis.
I’m also very grateful to all my supportive friends, and my family in China. Because of them, I
stay motivated and persistent to achieve my research goal, and never felt thousands of miles
away from home.
3
Table of Contents
Acknowledgements ....................................................................................................................2
List of Figures ............................................................................................................................4
List of Tables .............................................................................................................................5
Chapter 1 Introduction ...............................................................................................................6
1.1 Overview .............................................................................................................................6
1.2 Multi-level LCA ...............................................................................................................12
1.3 Research Motivation .........................................................................................................14
1.4 The Children’s Health Study ............................................................................................15
1.5 The Add Health study .......................................................................................................16
Chapter 2 Current parameter estimation methods and software ..............................................19
2.1 Zero-inflated Poisson LCA ...............................................................................................19
2.2 Parameter Estimation in the frequentist approach ............................................................21
2.3 Currently available software .............................................................................................23
2.4 Limitations of current approaches ....................................................................................28
2.5 Fully Bayesian Approach .................................................................................................29
2.5.1 Gibbs Sampling .......................................................................................................32
2.5.2 Metropolis-Hastings random walk algorithm .........................................................33
2.5.3 Proposed Bayesian ZIP LCGM model ....................................................................34
2.5.4 Label-switching issue ..............................................................................................37
2.5.5 Model fit assessment and comparison ....................................................................39
Chapter 3 Proposed Bayesian LCGM model and Results .......................................................42
3.1 Simulation Studies ............................................................................................................42
3.1.1 The Simulation Process for single-level ZIP LCGM ..............................................42
3.1.2 Simulation procedure for parameter estimation of single-level ZIP LCGM ..........44
3.1.3 Comparison of Gaussian LCGM model using WinBUGS and R ...........................46
3.1.4 Multilevel extension ................................................................................................48
3.2 Simulation results .............................................................................................................50
3.2.1 Simulation results of Gaussian LCGM ...................................................................50
3.2.2 Simulation results for single level ZIP LCGM .......................................................53
3.2.3 Multi-level extension of LCGM .............................................................................58
3.2.3 Label-switching evaluation .....................................................................................61
3.3 Application to Add Health data (Real Data Analysis) ......................................................62
3.4 Application of multilevel LCGM on MACC data (Real Data Analysis) .........................70
Chapter 4 Discussion and Future research ...............................................................................78
Reference .................................................................................................................................82
4
List of Figures
Figure 1 Relationship between group membership and covariates and trajectory............................ 25
Figure 2 Trace plots for posterior class probability estimates for Gaussian LCGM ........................ 51
Figure 3 Trace plots of posterior mean estimates for Gaussian LCGM ......................................... 52
Figure 4 Cross tabulation of true and estimated class membership ...................................................... 56
Figure 5 Estimates for class-specific parameter ϒk for one-level ZIP LCGM (simulation) .......... 56
Figure 6 Estimates for class-specific βk for one-level ZIP LCGM (simulation) ............................... 57
Figure 7 Trace plot for extra zero probability parameter for one-level ZIP LCGM (simulation) 58
Figure 8 Trace plot for Level 2 covariate parameter for multilevel ZIP LCGM (simulation) ...... 59
Figure 9 Trace plot for extra zero prob parameter for multilevel ZIP LCGM (simulation) .......... 60
Figure 10 Trace plot of poisson mean parameters for multilevel ZIP LCGM (simulation) .......... 62
Figure 11 Averaged smoking trajectories for Add Health data by latent classes (3-class) ............ 65
Figure 12 Averaged smoking trajectories for Add Health data by latent classes (2-class) ............ 66
Figure 13 Averaged smoking trajectories for Add Health data by latent classes (4-class) ............ 69
Figure 14 Trace plot of Poisson mean parameter estimates of multilevel ZIP LCGM (MACC) . 75
Figure 15 Trace plot of extra zero probability parameter for multilevel ZIP LCGM (MACC) ... 76
Figure 16 Averaged smoking trajectories for MACC data by posterior latent classes .................... 76
5
List of Tables
Table 1 Wave and number recruited, age and year of recruitment, and grade until high school
graduation, and age of new interview data collections in CHS data ..................................................... 16
Table 2 Descriptive statistics for the Add Health smoking study .......................................................... 18
Table 3 Descriptive statistics for covariates predicting latent class membership ............................. 18
Table 4 The parameter estimates for Gaussian mean (and SD) for Gaussian LCGM ..................... 50
Table 5 The parameter estimates predicting class membership (and SD) for Gaussian LCGM .. 50
Table 6 Comparison of Gaussian LCGM model using R, proc traj and Mplus................................. 53
Table 7 Extra-zero probabilities for each class and each time points ................................................... 53
Table 8 Estimated ϒk (and SD) using Metropolis random walk for class 2 and 3 .......................... 54
Table 9 Estimated βk (and SD) using Metropolis random walk for all 3 classes .............................. 54
Table 10 cross tabulation of true and estimated class membership....................................................... 55
Table 11 The comparison of classification for data that has larger Poisson mean difference ....... 56
Table 12 Extra zero prob estimates (and SD) for ZIP LCGM (simulation) ....................................... 57
Table 13 Extra zero prob estimates (and SD) for multilevel ZIP LCGM (simulation) ................... 60
Table 14 Level 2 covariate estimates (and SD) for multilevel ZIP LCGM (simulation) ............... 60
Table 15 Poisson mean estimates (and SD) for multilevel ZIP LCGM (simulation) ...................... 60
Table 16 Class prob estimates (and SD) for multilevel ZIP LCGM (simulation) ........................ 61
Table 17 Distribution of smoking for all three latent classes (Add Health) ....................................... 63
Table 18 Estimates and 95% HPD interval for Poisson mean parameter (Add Health) ................. 63
Table 19 Estimates, 95% HPD interval, Odds Ratios and 95% CI for OR (Add Health) .............. 63
Table 20 Estimates and 95% HPD interval for extra zero probability parameter (Add Health) .. 63
Table 21 Comparisons of parameter estimates for class probability (proc traj vs Mplus) ............. 65
Table 22 Parameter estimates for 2-class LCGM (Add Health data) ................................................... 66
Table 23 Parameter estimates for 4-class LCGM (Add Health data) ................................................... 67
Table 24 DIC statistics for 2-4 class ZIP LCGM model (Add health data)........................................ 69
Table 25 The summary statistics of MACC smoking data (including missing) ................................ 72
Table 26 The summary statistics of MACC smoking data (without missing) ................................... 73
Table 27 Descriptive statistics for covariates predicting latent class membership ........................... 73
Table 28 Parameter estimates of multilevel ZIP LCGM model (MACC data) ................................. 73
Table 29 Estimated OR and 95% CI of multilevel ZIP LCGM model (MACC data) ..................... 74
6
Chapter 1 Introduction
1.1 Overview
Research studies based on longitudinal designs involve the collection and analysis of data over
time. The longitudinal study design has become very popular as a useful source of information to
describe changing patterns of individual behavior over time. Models to analyze data from such
studies could be generally referred to as growth or trajectory models. The identification of
distinct subgroups (latent classes) within these trajectory models has become an important
methodologic branch with substantive implications for biomedical research. Latent class analysis
(LCA), introduced in 1968 by Lazaesfel and Henry is a popular statistical approach for modeling
potential heterogeneity in data by identifying unobserved class membership among subjects
based on categorical, continuous, or count observed variables. The identification of distinct
growth trajectories has enjoyed wide application in the medical, social and behavioral sciences
over the past few decades.
The substantive motivation for this dissertation arises from the desire to have efficient and
appropriate modeling techniques in tobacco use trajectories in children for count data of cigarette
use. In this setting, the zero-inflated Poisson model (Lambert, 1992) is the natural distribution of
choice in order to accommodate a significantly large proportion of never smokers. Tobacco use
is the leading preventable cause of death in the United States (USDHHS, 2014). Adolescence is
the developmental period during which smoking is most commonly initiated, and the majority of
adult smokers report initiating smoking during adolescence (USDHHS, 1994). As with most
forms of substance use, smoking rises to peak prevalence during the age period of 18-25 years,
but unlike other forms of substance use, its initiation declines after the mid-20s (Chassin et al.,
7
1996, Chen & Kandel, 1995). In 2014, the overall prevalence of smoking among high school
students was about 9.2% for cigarette smoking and 24.6% for any tobacco products (CDC,
2014). In addition to the well-researched research about the prevalence of adolescent cigarette
smoking, it is also important to understand how smoking behavior develops during this critical
period. Longitudinal studies have been widely used for modeling the developing patterns of
smoking behavior during adolescence. In general, smoking history within an individual can be
characterized as including distinct phases, like early experimentation, progression to regular use,
development of dependence, and potential cessation, which are frequently followed by relapse
(Heron et.al (2011)).
In addition to the general pattern of individual smoking history, researchers are becoming more
and more interested in the typological representation of patterns in smoking trajectory. The
identification of distinct subgroups or sub-classes of individual longitudinal development
trajectories is often of interest as it provides a way to understand behaviors for researchers in
diverse fields including social, educational, psychological, medical and other public health
sciences. This usually requires access to extensive datasets consisting of longitudinal, repeated
measures of variables, and in many cases including clustered data structure. Analysis of such
complex data necessitates using longitudinal latent variable modeling techniques. In recent
decades, several studies have modeled longitudinal smoking data using polynomial growth
models (Brook, Zhang, Brook, & Finch, 2010; Simons-Morton, 2007; Windle & Windle, 2001),
or in combination with a mixture component, for example, such as Growth Mixture models
(Brook et al., 2010; Colder, Flay, Segawa, & Hedeker, 2008; Nilam Ram & Kevin J. Grimm,
8
2009, Shiyko MP, Li Y, Rindskopf (2012)), and Latent Class models (Bernard et al., 2013; Janet
et al., 2004).
Latent class analysis (LCA) identifies unobservable subgroups within a population and is a
subset of Structural Equation Models (SEM), and is also a special case of Finite Mixture Models,
which is a type of latent variable model because the distribution is expressed as a mixture of
finite number of homogeneous latent categories. LCA is a technique where subtypes are
identified and created using a set of categorical or continuous observed indicators. These
subtypes are called latent classes. The classes are latent in that the subtypes are not directly
observed, but are inferred from multiple observed indicators. In contrast to factor analysis, LCA
provides classification of individuals. A latent categorical variable that takes at least 2 values, is
usually used to represent the group membership. The latent classification of group membership
has a variety of interpretations under a wide range of applications. For instance, in medical
diagnosis, it can be used to classify patients with or without a certain disease when an accurate
diagnosis is unavailable; in behavioral and health science, the latent groups could involve
different behavioral patterns (e.g. drinkers and abstainers); and LCM has also been applied to
identify phenotypes or genetic susceptibility for diseases based on clinical and biological data
(Keel et al., 2004; Muth´en and Muth´en, 2000; Reboussin et al., 2006; Rindskopf and
Rindskopf, 1986; Wenzel, 2012).
The traditional LCA approaches are often used for modeling cross-sectional data. If there is an
interest in identifying distinct growth trajectory groups for repeated data, conventional random
effects growth modeling could be used to evaluate the unobserved heterogeneity of subjects via
9
random effects (Laird & Ware (1982)), i.e. continuous latent variables. LCA models a statistical
distribution by a mixture of distributions. We may hypothesize that the data are from different
types of individuals, and class memberships are then identified based on similarities in response
patterns. A probability density function of a mixture model is usually presented by a combination
of K component pdfs,
(1)
where y is a dependent variable with conditional density p, y can be either discrete or continuous
variable, ak is the prior probability of component k, or the probability of belonging to class k, and
it can be considered as weights for each component and they give the probability of an
underlying categorical latent variable Ci that takes a value of k (k=2, …, K). k is the component
specific parameter vector for the density function of p, and K is the total number of classes.
pk(y| k) is the probability of y given membership in group k. We can understand a mixture model
as the weighted average of the random variable y generated from K distinct distributions, where
ak is the weight that represents the proportion of observations of one of the k groups (Leisch,
2004b).
In longitudinal data analysis, conventional growth modeling approaches assume that the growth
trajectories of individuals can be adequately described using a single estimate of growth
parameters (Raudenbush & Bryk, 2002). On the other hand, the Growth Mixture Modeling
(GMM) approach presented by Muthen & Shedden in 1999 relaxes this assumption and allows
for differences in growth parameters across unobserved subpopulations. In GMM, the
p(y | ) a
k
p
k
(y |
k
)
k 1
K
a
k
0, a
k
1
k 1
K
10
unobserved heterogeneity can be captured also by categorical latent variables (latent classes) that
allow for different groups of individual growth trajectories to vary around different means.
Latent Class Growth Mixture (LCGM) models, which form a special type of GMM for analyzing
longitudinal data, have been increasingly recognized for their usefulness in classifying
homogeneous subpopulations with a larger heterogeneous population and for the identification of
meaningful classes of individuals. In LCGM, we assume the variance and covariate estimates for
the growth factors within each class to be fixed as zeros, which means that no heterogeneity is
allowed for the growth curves given class membership. Similar to conventional cross-sectional
LCA, the LCGM can also be used to model categorical, continuous, or count data or the
combination of them in longitudinal data structure.
Modeling count outcome data is often a topic of major interest in social, behavioral science and
medical studies. For example, in studies of smoking trajectories, count variables such as the
number of cigarettes smoked by an individual during a week or a day are frequently used as the
outcome measurements for smoking trajectory. In smoking data, the phenomenon that more
zeros are observed than would be expected under the Poisson distribution are also frequently
encountered since we usually have the majority of subjects as “never smokers”. Thus, the
classical Poisson regression model for count data is often of limited use in such cases. The Zero-
inflated Poisson (ZIP) distribution, which is a mixture of Poisson distribution and the excess
zeros, has been widely used to deal with these situations, where the observed random event
contains excess zero counts. For example, the number of days of absence of high school in
days/week follows a Zero-inflated Poisson distribution; the number of cigarette smoked in a
week among adolescents also follows a Zero-inflated Poisson distribution. A popular approach
11
to modeling the excess of zeros is to apply ZIP regression model, as discussed by Lambert
(1992).
The ZIP distribution is a mixture of a Poisson distribution and a degenerate distribution at zero.
Two types of zeros are thought to exist in the Zero-inflated data: “structural zeros” (or true zeros)
from a non-susceptible group (i.e., those that do not have the attribute or experience of interest,
such as nonsmokers) and “random zeros” (or false zeros) for those from a susceptible group
(e.g., those who smoke but may falsely indicate a count of zero). Thus, in ZIP modeling, there
are two parameters to be estimated, the mean count for the entire sample and the proportion of
extra zeros in the data, which denotes the probability of
being in a non-susceptible group. A latent class model on zero-inflated count responses actually
consists of two kinds of unobserved information. First, there is the latent categorical variable Ci ,
which follows a multinomial distribution. This typically divides a population into different latent
subgroups. Within each subgroup, there is another latent variable Bit that indicates the zeros are
coming from structural zero process or a count process.
In terms of modeling longitudinal data trajectories, the latent class variable Ci summarizes
different developmental trajectories. Therefore, for each subject, their latent class memberships
are constrained to be the same over time. However, over time, one’s response can change from a
structural zero process to a count process. For example, a subject from the first latent class can
change from being a non-smoker at the beginning of the study to be a regular smoker at the
follow-up time periods. For longitudinal data, the mixture distributions used to characterize ZIP
and the need for another mixture from LCA makes the likelihood of ZIP LCGM very
12
complicated and difficult for parameter estimation. First The traditional frequentist approaches,
like maximum likelihood estimation approach are often used to fit this model; however, several
issues arise in estimating ZIP LCGM using the maximum likelihood approach since global
maxima can be very challenging or often impossible to obtain in this situation, convergence may
not be easily achieved, and there could be non-identifiable estimates or classes. In currently
available software that are used to fit ZIP LCGM using the frequentist maximum likelihood
approach, such as “MPlus” or PROC TRAJ in SAS, convergence is always a problematic issue
that so far has not been adequately resolved.
These challenges suggest the need to develop an approach that can correctly identify latent
longitudinal trajectory groups for Zero-inflated Poisson data and provide relatively accurate
parameter estimates. As an important extension of the above discussed models, the research
being conducted in this dissertation develops a Latent Class Growth Mixture (LCGM) model
using a fully Bayesian approach for identification of distinct smoking trajectories where the
observed count data follows a Zero-Inflated Poisson distribution, and then further extend this
model to fit multi-level data.
1.2 Multi-level LCA
An important limitation of the Latent Class models developed so far is they assume that study
subjects are independent. However, this assumption is not satisfied in many social and
behavioral research studies. For example, study subjects are not typically independent from each
other when they are nested in schools; and schools in turn nested in communities. Thus, an
alternative multilevel approach is required to model these data. In a standard Latent Class model,
13
it is assumed that the model parameters are the same for all individuals (level 1 units). The basic
idea of a multilevel Latent Class model is that some of the model parameters are allowed to be
different across groups, clusters, or level 2 units (Vermunt, 2003). For example, the Multilevel
LCA accounts for the nested structure of data by allowing variation across multi-level units of
each latent class indicator (Henry & Muthen, 2010). In this way, it is possible to evaluate how
the Level specific (e.g., Level 2) units influence the higher Level (e.g., Level 1) indicators that
determine the latent class membership. For example, in some communities, there could be a
larger probability of belonging to a heavy smoker class than in other communities. This multi-
level latent class model is similar to a mixed-effects regression model for categorical outcomes,
as mentioned by Vermunt (2008) and Muthen & Asparouhov (2009). However, the dependent
variable in multilevel LCA is latent rather than observed. Given the interest in identifying latent
classes in a variety of research fields where the data are structured in a multilevel way, there is
no doubt that the application of Multi-level LCA will be popular and useful for the foreseeable
future.
In addition to modeling the nested structure of multilevel data correctly, a multilevel latent
class analysis can also allow researchers to evaluate many interesting research questions. In any
single level latent class analysis, individual (Level 1) covariates may be included in
the model. These covariates predict the probability that an individual will be classified into a
certain Level 1 latent group (e.g., a certain smoking phenotype). However, multilevel latent class
analysis extends the simple assessment of an individual level covariate by permitting the
simultaneous assessment of contextual (Level 2) predictors. This feature allows for the
possibility that individuals with the same Level 1 covariate values may differ in their probability
14
of belonging to a certain latent class (e.g., smoking typology) due to contextual or environmental
differences in their community. For example, an individual living in a community with a high
density of poverty may be more likely to be classified in the heavy smoking latent class than an
individual living in a community with a low density of poverty (Henry K.L. & Muthen, 2010).
This research extends the single-level LCGM model to a multilevel LCGM model that makes it
possible to modify the independence assumption.
For parameter estimation in multi-level models, it is well-known that Bayesian models are
conceptually easier than those based on the frequentist approach, as highlighted by many
researchers. For example, Gelman and Hill (2007) and Raudenbush and Bryk (2002) conclude
that there might not be enough information to estimate variance parameters precisely by using
the frequentist method when the multilevel structure is complicated or when the number of
higher-level units is small. In these settings, there are distinct advantages in using a fully
Bayesian approach.
1.3 Research Motivation
Because methods that use currently popular maximum likelihood based or other frequentist
estimation approaches to fit ZIP LCGM (e.g., those implemented in Mplus or SAS proc traj)
suffer from convergence problems and due to other limitations (see Chapter 2), there is a need
for better modeling approaches. Besides, as the final objective of any multi-level designed study
is to fully characterize the multi-level data structure as well, this further increases the complexity
of the estimation procedure making the frequentist maximum likelihood approach unsuitable due
to its many limitations. The Bayesian approach offers a viable alternative as it has already been
15
successfully used for parameter estimation in the multi-level setting. Since neither Mplus nor
proc traj in SAS has applied the Bayesian ZIP LCGM to model even the single level LCGM on
ZIP count data so far, it will be interesting and useful to explore a fully Bayesian approach to fit
this model.
The purpose of this dissertation is to develop a Bayesian LCGM modeling approach for ZIP data
using Markov Chain Monte Carlo (MCMC) Metropolis random walk approach to improve the
convergence relative to currently used approaches. This will be further extended to fit
longitudinal data with multilevel structure.
1.4 The Children’s Health Study
The work in this dissertation is inspired by an ongoing longitudinal epidemiological study, the
Southern California Children’s Health Study (CHS), aiming to use LCGM to characterize
adolescent cigarette smoking trajectories based on initiation and progression. This would help to
better understand the determinants of early life smoking trajectories and their relationship to
tobacco use and addiction in adult life. Initial attempts tried to classify each individual
empirically into one of several unique trajectory phenotypes (e.g. experimenters, heavy users
with early and late age of initiation) using PROC TRAJ in SAS.
The CHS is one of the largest and most detailed longitudinal studies with initial focus on the
long-term effects of air pollution on respiratory health of children. It involves five multi-ethnic
cohorts comprising more than 12,000 children from 16 Southern California communities in 3
waves of collection (the data of cohort A, B and C were collected at the same time) between ages
16
of 5 and 16 years at recruitment (Table 1), with diverse socio-economic, built-environment and
demographic profiles.
Table 1 Wave and number recruited, age and year of recruitment, and grade until high school graduation, and age of new
interview data collections in CHS data
Briefly, children were recruited from public school classrooms between 1993 and 2003 in
community schools and followed through high school graduation (Gilliland et al., 2006; Berhane
et al., 2009). The rates of smoking at follow-up through high school (age 18) decreased with
each wave from the highest of 43% in the earliest cohort to 10% in the latest cohort. Smoking
information is collected through a yearly update questionnaire (completed by a parent until a
child reaches age 10, and afterwards by the child). Starting in 4
th
grade, children completed staff
administered questionnaires. Each year, data on the age at which participant recorded first
smoking, the number of cigarettes and cigars smoked in the past 24 hours, week, month, year and
life time and the date of last cigarette smoked were collected. In previous analysis conducted
using proc traj in SAS and in this dissertation, smoking trajectories that are used in LCGM are
based on the number of cigarettes smoked in the past week in CHS data.
1.5 The Add Health study
To model and characterize the adolescent cigarette smoking trajectories based on initiation and
progression and classify each individual empirically into one of several unique trajectory
17
phenotypes, we fit our proposed Bayesian LCGM ZIP model on data collected from the Add
Health study (The National Longitudinal Study of Adolescent Health). Add Health is a
longitudinal, nationally representative and school based study of U.S. adolescents in grades 7
through 12. The primary aim of the Add Health study is to measure the major social factors that
affect the health of adolescents, with special emphasis on the effects of multiple contexts of
adolescent life. Different than many other large-scale demographic studies, the Add Health study
is based on a clustered design, in part to facilitate the collection of social network data, which
links between individuals and the social structure that they are embedded in (Harris, 2013). Four
waves of in-home interviews were conducted from 1994 to 2008. The first wave in-home
interviews were conducted on participants aged 11-21 years. Further waves were collected in
1996, 2001-2002, and 2007-2008 when the sample was aged 24-33 years. Participants were
asked to report the average number of cigarettes smoked per day in the past 30 days. Although,
the percentage of never smokers decreased as age increased, there were still about 64%-77% of
zero counts in these four waves of data. The percentage of zero counts are much higher than the
zeros that can be handled using the Poisson distribution. Table 2 and Table 3 show the
descriptive statistics for the Add Health smoking study (N=2923).
We select gender, peer smoking and household smoking as covariates in our proposed model to
predict latent class membership. Those covariates were shown as significant risk factors of
adolescent cigarette smoking in previous research (Brook, et.al, 2005; Kaufman & Augustson,
2008).
18
Table 2 Descriptive statistics for the Add Health smoking study
Year
Age(years) Cigarettes
Observed zeros (%)
Mean (SD) Mean (SD)
1994-1995 15.50 (1.56) 1.42 (4.25) 77.36
1996 16.41 (1.57) 1.98 (5.21) 68.94
2001-2002 21.86 (1.59) 3.76 (7.79) 66.75
2007-2008 28.36 (1.59) 3.65 (7.29) 63.83
Table 3 Descriptive statistics for covariates predicting latent class membership
Covariates N (%)
Gender Male 1292 (44.20)
Female 1631 (55.80)
Peer Smoking 0 peer smokers 1725 (59.01)
1 peer smokers 560 (19.16)
2 peer smokers 314 (10.74)
3 peer smokers 324 (11.08)
Household smoking Yes 1626 (55.63)
No 1297 (44.37)
19
Chapter 2 Current parameter estimation methods and software
In the CHS, the prevalence of cigarette smoking before 10
th
grade is very low. There are more
than 90% of observed zeros in smoking data from 8
th
to 10
th
grade. Similarly, in Add Health
data, 63-77% of the smoking data were zeros for all 4 waves, which indicates that a large
segment of the population consists of “never smokers” - and hence there are much more zeros
than would be expected under the Poisson distribution. To identify distinct patterns of
developmental trajectories of the number of cigarettes smoked per week for this type of data, the
zero-inflated Poisson (ZIP) model is an appropriate approach to handle the extra zeros, which
assigns an extra zero probability for the additional zeros beyond the Poisson distribution
(Lambert, 1992).
2.1 Zero-inflated Poisson LCA
In the simple setting where no time stable covariates are involved, we can obtain the extra zero
probability by subtracting the zeros under Poisson distribution from the total percentage of zeros
if the mean count 0 is given,
,
where p0 is the total percentage of zeros in the data. However, in LCA models where the mean
count for each class is unknown, and also the extra zero probability can vary over time, the extra
zero probability can be predicted from time-varying covariates using the logit link, and the mean
count for each class can be estimated by predictions from time stable covariates using the log
link. In addition, in previous research involving the ZIP LCGM, time stable covariates were
included to obtain the class membership (Jones & Nagin, 2001; Henry K.L. & Muthen, 2010).
(p
0
e
0
)/(1 e
0
)
20
Consider an observed sample Yij= y11, ….ynt of n ×t observations for the ZIP model, where each
response for individual i observed at time j is denoted by yij. The probability mass function of the
longitudinal ZIP model given a specific class k can be expressed as follows:
𝑃 ( 𝑌𝑖𝑗 = 𝑦 𝑖𝑗
|𝐶 𝑖 = 𝑘 )= {
( 1 − 𝑝 𝑖𝑗
)+ 𝑝 𝑖𝑗
𝑒 −𝜆 𝑖𝑗
𝑝 𝑖𝑗
𝜆 𝑖𝑗
𝑦 𝑖𝑗
𝑒 −𝜆 𝑖𝑗
𝑦 𝑖𝑗
!
(2)
Where 𝑝 𝑖𝑗
represents P(yij>0), and λij denotes the mean count for subject i at time j given class k.
For LCA models, Y follows a mixture distribution, which takes the form of a mixture density as
follows:
𝑃 ( 𝑌𝑖𝑗 = 𝑦 𝑖𝑗
)= ∑ 𝜋 𝑖𝑘
𝐾 𝑘 =1
𝑃 ( 𝑌𝑖𝑗 = 𝑦 𝑖𝑗
|𝐶 𝑖 = 𝑘 ) (3)
Where 𝐶 𝑖 ~ 𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚 ( 𝜋 𝑖 1
, … , 𝜋 𝑖𝑘
) , 𝜋 𝑖 1
, … , 𝜋 𝑖𝑘
are the probabilities associated with belonging
to a certain latent class. The classification of each trajectory group is assumed to be affected by
the time stable covariates Z, which allows the time-varying covariates X to be associated with Y
directly.
The 𝜋 𝑖𝑘
are estimated using multinomial logistic regression by,
𝜋 𝑖𝑘
=
𝑒 𝑧 𝑖 𝑇 𝛶 𝑘 ∑ 𝑒 𝑧 𝑖 𝑇 𝛶 ℎ
𝐾 ℎ=1
, with 𝛶 1
= 0 for identifiability. (4)
In this setting, the extra zero probability can be obtained by regressing time related variables X
with Y and by using,
Logit( ρ
𝑖𝑗𝑘 )= log (
ρ
𝑖𝑗𝑘 1−ρ
𝑖𝑗𝑘 )= 𝑋 𝑖𝑗
𝑇 𝛼 𝑘 , (5)
Where 𝜌 𝑖𝑗𝑘 denotes the probability of extra zeros in the data, whereas 𝜌 𝑖𝑗𝑘 = 1 − 𝑝 𝑖𝑗𝑘 .
The estimated mean for each class can be obtained by using log link with X, with the random
intercept bik for each subject i in each class k,
log( 𝜆 𝑖𝑗𝑘 )= 𝑋 𝑖𝑗
𝑇 𝛽 𝑘 + 𝑏 𝑖𝑘
(6)
21
We can obtain the likelihood of observing the data trajectory yij given group membership k from,
𝐿 ( 𝜆 𝑖𝑗𝑘 , 𝑝 𝑖𝑗𝑘 |𝑌 𝑖𝑗 ,
𝐶 𝑖 = 𝑘 )= ∏ 𝑃 ( 𝑌𝑖𝑗 = 𝑦 𝑖𝑗
|𝐶 𝑖 = 𝑘 )=
𝑦 𝑖𝑗
∏ [( 1 − 𝑝 𝑖𝑗𝑘 )+ 𝑝 𝑖𝑗𝑘 𝑒 −𝜆 𝑖𝑗𝑘 ] ∏ 𝑝 𝑖𝑗𝑘 𝑒 −𝜆 𝑖𝑗𝑘 𝜆 𝑖𝑗𝑘 𝑦 𝑖𝑗
𝑦 𝑖𝑗
!
𝑦 𝑖𝑗
>0 𝑦 𝑖𝑗
=0
(7)
2.2 Parameter Estimation in the frequentist approach
Earlier attempts in fitting LCA models were carried out under the frequentist paradigm. The
most commonly used method for estimating the parameter vector k is the maximum likelihood
approach via an iterative EM algorithm (Dempster et al., 1977). If we consider Yi as the
observed data, and Ci as the unobserved membership for each observation, combined with (7),
the log-likelihood of a sample of N subjects with t time points for ZIP data is given by
𝐿𝑜𝑔𝐿
= log (∏ ∑[𝑎 𝑘 ∏ 𝑝 ( 𝑦 𝑖𝑗
|𝐶 𝑖 , 𝜆 𝑖𝑗𝑘 , 𝑝 𝑖 𝑗𝑘
)
𝐽 𝑗 =1
]
𝐾 𝑘 =1
𝑁 𝑖 =1
) = ∑ log ( ∑[𝑎 𝑘 ∏ 𝑝 ( 𝑦 𝑖𝑗
|𝐶 𝑖 , Ѳ
𝑖𝑗𝑘 )
𝐽 𝑗 =1
]
𝐾 𝑘 =1
𝑁 𝑖 =1
= ∑ log ( ∑ [𝑎 𝑘 { ∏ [( 1 − 𝑝 𝑖𝑗𝑘 )+ 𝑝 𝑖𝑗𝑘 𝑒 −𝜆 𝑖𝑗𝑘 ] ∏ ( 𝑝 𝑖𝑗𝑘 𝑒 −𝜆 𝑖𝑗𝑘 𝜆 𝑖𝑗𝑘 𝑦 𝑖𝑗
𝑦 𝑖𝑗
!
) }
𝑦 𝑖𝑗
>0 𝑦 𝑖𝑗
=0
]
𝐾 𝑘 =1
𝑁 𝑖 =1
where ak=P(Ci=k). To reduce the difficulty of optimization with the log of sum term, a K-vector
of indicator variables zik is used so that zik =1 for subject i in class k. If yi=k, the log-likelihood
becomes
𝐿𝑜𝑔𝐿 = log (∏ ∏[𝑎 𝑘 ∏ 𝑝 ( 𝑦 𝑖𝑗
|𝐶 𝑖 , Ѳ
𝑖𝑗𝑘 )
𝐽 𝑗 =1
]
𝑧 𝑖𝑘
𝐾 𝑘 =1
𝑁 𝑖 =1
)
= ∑ ∑ 𝑧 𝑖𝑘
𝐾 𝑘 =1
𝑁 𝑖 =1
log( 𝑎 𝑘 )+ ∑ ∑ 𝑧 𝑖𝑘
𝐾 𝑘 =1
𝑁 𝑖 =1
∑ log ( 𝑝 ( 𝑦 𝑖𝑗
|𝐶 𝑖 , Ѳ
𝑖𝑗𝑘 ) )
𝐽 𝑗 =1
(8)
22
= ∑ ∑ 𝑧 𝑖 𝑘 𝐾 𝑘 =1
𝑁 𝑖 =1
log( 𝑎 𝑘 )+ ∑ ∑ 𝑧 𝑖𝑘
𝐾 𝑘 =1
𝑁 𝑖 =1
∑ log ( 𝑝 ( 𝑦 𝑖𝑗
|𝐶 𝑖 , Ѳ
𝑖𝑗𝑘 ) )
𝐽 𝑗 =1
= 𝐿𝐿
𝑃 + 𝐿𝐿
𝐷 where
LLp=∑ ∑ 𝑧 𝑖𝑘
𝐾 𝑘 =1
𝑁 𝑖 =1
log( 𝑎 𝑘 ) and LLD=∑ ∑ 𝑧 𝑖𝑘
𝐾 𝑘 =1
𝑁 𝑖 =1
∑ log ( 𝑝 ( 𝑦 𝑖𝑗
|𝐶 𝑖 , Ѳ
𝑖𝑗 𝑘 ) )
𝐽 𝑗 =1
.The EM
algorithm maximizes the log-likelihood iteratively by estimating zi by its expectation under the
updated ak and 𝑝 ( 𝑦 𝑖𝑗
|𝐶 𝑖 , 𝜆 𝑖𝑗𝑘 , 𝑝 𝑖𝑗𝑘 ) . At each iteration, ak and 𝑝 ( 𝑦 𝑖𝑗
|𝐶 𝑖 , 𝜆 𝑖𝑗𝑘 , 𝑝 𝑖𝑗𝑘 ) can be obtained
by maximizing LLp and LLD separately.
Based on (7) and (8), the complete data log likelihood for ZIP LCGA can be expressed as:
𝐿𝑜𝑔𝐿 = ∑ ∑ 𝑧 𝑖𝑘
log( 𝑎 𝑘 )+ ∑ ∑ 𝑧 𝑖𝑘
∑ log ( 𝑝 ( 𝑦 𝑖𝑗
|𝐶 𝑖 , 𝑝 𝑖𝑗𝑘 , 𝜆 𝑖𝑗𝑘 ) )
𝐽 𝑗 =1
𝐾 𝑘 =1
𝑁 𝑖 =1
𝐾 𝑘 =1
𝑁 𝑖 =1
=∑ ∑ 𝑧 𝑖𝑘
log( 𝑎 𝑘 )+ ∑ ∑ 𝑧 𝑖𝑘
∑ log ( ∏ [( 1 − 𝑝 𝑖𝑗𝑘 )+
𝑦 𝑖𝑗
=0
𝐽 𝑗 =1
𝐾 𝑘 =1
𝑁 𝑖 =1
𝐾 𝑘 =1
𝑁 𝑖 =1
𝑝 𝑖𝑗𝑘 𝑒 −𝜆 𝑖𝑗𝑘 ] ∏ [𝑝 𝑖𝑗𝑘 𝑒 −𝜆 𝑖𝑗𝑘 𝜆 𝑖𝑗𝑘 𝑦 𝑖𝑗
𝑦 𝑖𝑗
!
]
𝑦 𝑖𝑗
>0
) (9)
The following steps are required in each of the m+1iterations:
E-Step:
M-Step:
To find ak by maximizing LLP, we can obtain
(10)
By maximizing LLD and LLP separately, we can determine k based on the specific distribution
of p.
E(z
i
|a
k
(m)
,y
i
,
k
(m)
) w
k
a
k
(m)
p(y
i
|x
i
,
k
(m)
)
a
t
(m)
p(y
i
|x
i
,
t
(m)
)
t 1
K
a
k
(m 1)
w
i
(m)
i 1
N
N
23
The posterior probability for observing yij given class membership is,
E(z
ij
k |y
ij
,
k
(m)
,a
k
(m)
) w
k
(m)
a
k
(m)
p(y
ij
|x
i
,
k
(m)
)
j 1
J
a
k
(m)
p(y
ij
|x
i
,
k
(m)
)
j 1
J
k 1
K
where ak is the weight for each class, which is assigned to be 1/3 initially, assuming equal
assignment of subjects into different latent classes, for simplicity purposes, and is updated by
averaging the posterior probability at each iteration as described in (10).
Despite the popularity and relatively light computation burden of the maximum likelihood EM
algorithm, care must be taken to ensure that the process is not trapped at a local maximum. The
ZIP distribution, which contains two parameters (Poisson mean and extra zero probability)
makes it more difficult to obtain a closed form solution for both parameters by directly using the
EM algorithm. In fact, even when a closed form solution for both parameters can be obtained,
the zero-inflated models could have multiple modes (Angers & Biswas, 2003). Additional
difficulties could arise due to the mixture of zeros from both extra zeros and the zeros under
Poisson distribution. Therefore, the classical statistical inference procedures such as maximum
likelihood estimation for making inference about unknown parameters and large sample
approximations for estimating a confidence interval are not always suitable for a ZIP data.
2.3 Currently available software
As introduced in section 2.2, the Latent class growth model (LCGM) is designed to identify
groups of individuals that depict similar progressions of some behavior or outcome over age or
time, to calibrate variability about the average developmental tendencies, and to explain the
24
variability in terms of covariates of interest (Nagin, 1999). Based on the assumption that all
individual growth trajectories within a class are homogeneous, the intercept and slope are fixed
to be equal across individuals within a trajectory of a latent class. The framework of LCGM has
been developed by Nagin and colleagues, and is implemented in the SAS procedure PROC
TRAJ in 2001.
PROC TRAJ
Jones, Nagin and Roeder (2001) described a SAS procedure for estimating group-based
trajectory models. In this SAS procedure, known as “PROC TRAJ”, the maximum likelihood
method is used to estimate the model parameter parameters; and the maximization is performed
using a general quasi-Newton procedure (Dennis, Gay, and Welsch 1981; Dennis and Mei 1979).
To derive the likelihood, let y=(y1, …, yT) denote the longitudinal sequence of an individual’s
measurements over T periods of time. Then, we have
𝑓 ( 𝑦 )= ∑ 𝑃 ( 𝐶 = 𝑘 ) 𝑃 ( 𝑌 = 𝑦 |𝐶 = 𝑘 )= ∑ 𝑝 𝑘 𝑓 ( 𝑦 , 𝜆 𝑘 )
𝐾 𝑘 =1
𝐾 𝑘 =1
(11)
Here pk represents the probability of belonging to class k with corresponding parameter(s) k.
The longitudinal nature of the data is handled by letting the parameter(s) k depends on time.
According to Jones, Nagin and Roeder (2001), the time-stable covariates (risk factors such as
gender, race/ethnicity) are incorporated into the model by assuming that they influence the
probability of group membership. However, it is also assumed that given group membership, the
risk factors are independent of the data Y, while the time-dependent covariates can directly affect
the observed trajectory Y, as shown in Figure 1:
25
Figure 1 Relationship between group membership and covariates and trajectory
Thus equation (11) becomes
(12)
where the effect of time-stable covariates on group membership is modeled as a generalized logit
function ( 1 and 1 are taken to be zero for identifiability),
For the ZIP LCGM model implemented in PROC TRAJ, the Poisson mean for each latent class
and the extra zero probabilities for each individual of each class at each time point could be
modeled, for example, as a polynomial function of time varying covariate using a log link, and a
polynomial function of time variable, say cubic, using a logit link as shown below,
log( 𝜆 𝑘𝑡
)= 𝛽 0
+ 𝛽 1
𝑎𝑔𝑒 + 𝛽 2
𝑎𝑔𝑒 2
+ 𝛽 3
𝑎𝑔𝑒 3
logit( 𝑝 𝑖𝑡𝑘 )= 𝛼 0
+ 𝛼 1
𝑎𝑔𝑒 + 𝛼 2
𝑎𝑔𝑒 2
+ 𝛼 3
𝑎 𝑔𝑒
3
To fit LCGM with PROC TRAJ, researchers need to specify the number of distinct latent
classes that best fits the data. It is usually helpful to have a prior knowledge about the number of
f (y
i
|z
i
,w
i
) P(C
i
k |Z
i
z
i
)P(Y
i
y
i
|C
i
k,W
i
w
i
)
k 1
K
P(C
i
k |Z
i
z
i
)
e
(
k
k
'
z
i
)
e
(
l
l
'
z
l
)
l 1
K
26
trajectories in the particular area of study. The fit of the models are evaluated using the value of
Bayesian Information Criterion (BIC) that is obtained for each model, which is a fit index used
to compare competing models that include different numbers of trajectories of various shapes
(e.g, linear vs quadratic). The posterior probabilities are calculated to estimate the probability of
each latent class and then are used to assign each individual’s membership (Nagin, 2005;
Andruff et al, 2009).
There are several limitations associated with this approach: the quasi-Newton technique used to
estimate parameters in this approach is sensitive to the pre-specified starting values. It is often
the case that a local minimum is found, and thus this approach does not guarantee a fully
converged result. In addition, identifiability of model parameters has been found to be a problem
in analysis of real data. For example, in previous analysis of CHS data, applying the cubic
polynomial to the growth curves of the first two classes and then the quadratic polynomial to the
third class (i.e., order 332) can make no difference to the results than applying the quadratic
polynomial to the growth curves of the first class and cubic polynomial to the other two classes
(i.e., order 233). The above limitation becomes even worse once more than one time stable
covariates are included in the ZIP LCGM model.
Mplus
Another popular statistical software that is used to fit LCGM in social and behavioral science
applications is Mplus, which was developed by Muthen & Muthen, and first released in 1998. In
Mplus, the initial step prior to specifying latent classes is to specify a single-class growth model.
There are several major differences between Mplus and PROC TRAJ in terms of conducting
27
latent variable analysis: Mplus uses the EM-algorithm to find maximum-likelihood estimates,
rather than a quasi-Newton method for estimating the parameters for a latent class model, which
yields more stable results and better convergence. Besides, Mplus can handle the case when we
do not assume homogeneity given membership, and it is a reasonable (and perhaps necessary)
tool to identify the differences between individuals within the same class in many research
studies;
To model ZIP LCGM, Mplus uses a two-part model to handle both random zeros under the
Poisson distribution and the structural zeros. The two-part growth mixture modeling approach
decomposes the distribution of data into two parts- one part that determines whether the response
is zero and the other part that determines the actual level if nonzero responses occur (Kim &
Muthen, 2009; Olsen & Shafer, 2001).
Mplus also allows for the analysis of multilevel data, and missing data are allowed for
independent variables and are handled under the Missing At Random (MAR) mechanism using
model-based imputation in Mplus. On the other hand, missing observations are simply deleted in
PROC TRAJ, allowing only a complete data analysis. At last, Mplus allows for both Bayesian
and frequentist inference for parameters. Bayesian analysis uses the Markov Chain Monte Carlo
(MCMC) algorithm. In Mplus, the degree to which the latent classes are clearly distinguishable
by the data and the model can also be assessed using the estimated posterior probabilities in (5).
A summary statistic of the accuracy of classification is given by the entropy measure
(Ramaswamy et.al, 1993),
, (13)
E
K
1
( p
ik
lnp
ik
)
k
i
nlnK
28
Where denotes the estimated conditional probability for individual i in class k.
Entropy values range from zero to one, with entropy values close to one indicating clear
classifications and hence better fit (Muthen, 1998-2004).
The Mplus software provides better identifiability properties compared to PROC TRAJ,
however, convergence is still an issue in Mplus as well.
2.4 Limitations of current approaches
Currently, there are many issues and controversies related to the existing Latent Class analysis
tools (Jung & Wickrama (2008)); namely, (1) The determination of latent trajectory classes; (2)
Which model fit index to use; and (3) The convergence problem.
For the first problem, we are concerned with the question of whether latent classes truly exist.
Bauer and Curran (2003,a, b) mentioned that the existence of multiple classes may simply be due
to skewed or non-normally distributed data. The second problem relates to determining how
many classes there are, assuming there are multiple classes. Smallest Bayesian information
criteria (BIC) value and a significant Lo, Mendell and Robin (Lo, Mendell & Rubin, 2001)
Likelihood Ratio test are commonly used to determining the number of classes in a growth
mixture model. Recently, the bootstrap likelihood ratio test and sample-size adjusted BIC also
have proven to be better indicators of classes across all of the models being considered. The third
problem of non-convergence and being trapped in local solutions without reaching the global
extrema is frequently encountered (Hipp & Bauer, 2006). It is extremely difficult to model a
sample distribution that consists of a mixture of different sub-distributions, i.e., a finite mixture
p
ik
29
model. These lead to frequently encountered convergence issues due to local minima/maxima
and singularities. The problem with the EM-algorithm is that it cannot distinguish between a
global maximum and a local maximum. Thus, the algorithm is likely to terminate as long as it
reaches some maximum.
2.5 Fully Bayesian Approach
Recent advances in Bayesian estimation methodologies allow for fitting LCGM models also
within a Bayesian framework, as mentioned in Eilliott et.al (2005), Asparouhov and Muthen
(2010), and Gillian et.al (2012). Currently, there is little research available on the Bayesian
analysis of zero-inflated latent class models. The only known work so far is reported by Neelon
et al. (2011). They built a Bayesian two-part latent class model to analyze the effect of a health
care parity policy on mental health use and expenditures. There was a large proportion of
subjects who did not use any mental health service that were involved in their data. They used a
binomial component to model the observed zeros and a lognormal component was used to model
the right skewed nonzero values. Three classes of subjects were identified as low spenders,
moderate spenders, and high spenders and they also found that the parity policy had an impact
only on moderate spenders. So far, none of the current research to our knowledge have applied
Bayesian LCGM on ZIP multi-level data, where structural zeros exist that cannot be handled
using the Poisson distribution and conceptually more complicated than individual-level LCGM
models, and at the same time also account for multilevel nested data structure.
Building on the previous research on Bayesian LCGM on Lognormal data by Neelon and his
colleagues (Neelon et al., 2011), in this dissertation, we are proposing an alternative approach of
30
fitting multi-level ZIP LCGM models using an MCMC Metropolis random walk algorithm, due
to its flexibility in term of handling the situation when the joint posterior distribution of the
parameters does not have a closed form.
Bayesian analysis is based on the concept of the Bayes’ theorem, in which the probability can be
applied to the degree of prior belief or a proposition. According to Bayes’ theorem, which is also
known as Bayes rule, the posterior distribution of parameter θ given the observed data y is,
𝑃 ( 𝜃 |𝑦 )=
𝑃 ( 𝑦 |𝜃 ) 𝑃 ( 𝜃 )
𝑃 ( 𝑦 )
=
𝑃 ( 𝑦 |𝜃 ) 𝑃 ( 𝜃 )
∑ 𝑃 ( 𝑦 |𝜃 ) 𝑓 ( 𝜃 )
.
Here, 𝑃 ( 𝜃 ) is the prior distribution of parameter 𝜃 , 𝑃 ( 𝜃 |𝑦 ) is the posterior probability
distribution or parameter 𝜃 , 𝑃 ( 𝑦 |𝜃 ) is the probability of the data which is also the likelihood, and
because ∑ 𝑃 ( 𝑦 |𝜃 ) 𝑓 ( 𝜃 ) is a normalizing constant, then we have
𝑃 ( 𝜃 |𝑦 )∝ 𝑃 ( 𝑦 |𝜃 ) 𝑃 ( 𝜃 ) (14)
which states that a posterior is proportional to the prior multiplied by the likelihood (Ntzoufras,
2009). Once the posterior distribution of the parameters is obtained, statistical inference can be
performed by calculating the means of the posterior distribution,
𝜃 ̅
= ∫ 𝜃𝑝 ( 𝜃 |𝑦 ) 𝑑𝜃 . (15)
Credible intervals (L, U) are Bayesian analogues of confidence intervals that are used in
frequentist statistics, and are obtained by
1 − 𝛼 ≤ ∫ 𝑝 ( 𝜃 |𝑦 ) 𝑑𝜃 𝑈 𝐿 (16)
The Bayesian statistical inference discussed above is based on the fact that the integration in
equation (15) and (16) can be solved explicitly. However, this is almost impossible to guarantee
when there are multiple unknown parameters to be estimated (Zhang et al., 2007). In this case,
31
we can approximate the integrals via Monte Carlo integration by simulating N values from p(θ),
and then estimate 𝜃 ̅
=
1
𝑁 ∑ 𝑝 ( 𝜃 |𝑦 )
𝑁 𝑖 =1
.
Markov Chain Monte Carlo (MCMC) is a common Bayesian method based on drawing of the
parameter θ from approximate probability distributions by constructing a Markov Chain, and
then correcting those sampling values iteratively until an equilibrium distribution is reached. In
this way, the final stationary distribution that is obtained can approximate the target posterior
distribution p(θ|y). A sequence X1, X2, … of random elements of some set is a Markov chain,
taking values in an arbitrary countable state space and having the property that the conditional
distribution of Xn+1 given the past, X1, …, Xn, only depends on the present state Xn. A Markov
chain has stationary transition probabilities if the conditional distribution of Xn+1 given Xn does
not depend on n (Brooks, 2011). The specification of a Markov Chain model has two
components; namely, the initial distribution and the transitional probabilities. The initial
distribution is the marginal distribution of X1. The transitional probabilities specify the
conditional distribution of Xn+1 given Xn. Assuming the distribution of X1, …¸ Xn-1 is known, the
distribution of X1, …¸ Xn is determined by the usual formula of
Joint=conditional × marginal
MCMC is also used to calculate numerical approximation and to avoid the difficulty of multiple
dimension integration. Many MCMC sampling methods have been proposed and studied, such as
Gibbs sampling, Slice sampling, Metropolis-Hastings sampling, etc. The simulation process
following the scheme of Metropolis et al. (1953) is called the Metropolis algorithm. Hastings
(1970) extended it to more general case for which no requirement is needed for a symmetric
proposal distribution, and that kind of simulation is called Metropolis-Hastings algorithm. A
special case of the Metropolis-Hastings algorithm introduced by Geman & Geman (1984) are
32
called Gibbs Samplers, which update the proposal distribution from a conditional distribution of
the desired equilibrium distribution, and it is always accepted (Polasek, W. (2012)). It is now
recognized that most Bayesian inference could be done by MCMC, while there is very little that
could be done without MCMC.
2.5.1 Gibbs Sampling
There has been extensive research on Bayesian LCA for cross-sectional data. Gibbs sampling is
widely used, and this is because the conditional distributions of parameters are much easier to
obtain than the joint distribution in ZIP LCA models. Gibbs Sampling is an algorithm used to
generate a data point from the conditional distribution of each parameter, conditional on the
current values of other parameters (Geman & Geman, 1984). Let θ=(θ1,…,θn) be n unknown
parameters in the model of interest. At the (i+1)
th
iteration, under current value θ
(i)
=(θ1
(i)
, θ2
(i)
,
…, θn
(i)
), Gibbs Sampling is applicable in situations where n is greater than 1, then θ
(i+1)
=( θ1
(i+1)
,
θ2
(i+1)
, …, θn
(i+1)
) can be updated by sequentially generating
θ1
(i+1)
from
P(θ1|θ2
(i)
, θ3
(i)
, …, θn
(i)
;y),
θ2
(i+1)
from
P(θ2|θ1
(i)
, θ3
(i)
, …, θn
(i)
;y),
…
θn
(i+1)
from
P(θ1|θ1
(i)
, θ2
(i)
, …, θn-1
(i)
;y),
The basic idea in Gibbs Sampling is that, rather than probabilistically picking the parameters in
the next step all at once, we choose them separately one at a time based on the previous choice of
the other n-1 parameters. Note that, we can obtain the distribution that we are sampling from by
using the following conditional probability:
33
𝑃 ( 𝜃 𝑘 |𝜃 1
( 𝑖 +1)
, … , 𝜃 𝑘 −1
( 𝑖 +1)
, 𝜃 𝑘 +1
( 𝑖 )
, … 𝜃 𝑛 ( 𝑖 )
; 𝑦 )=
𝑃 ( 𝜃 1
( 𝑖 +1)
, … , 𝜃 𝑘 −1
( 𝑖 +1)
, 𝜃 𝑘 ( 𝑖 )
, 𝜃 𝑘 +1
( 𝑖 )
, … 𝜃 𝑛 ( 𝑖 )
; 𝑦 )
𝑃 ( 𝜃 1
( 𝑖 +1)
, … , 𝜃 𝑘 −1
( 𝑖 +1)
, 𝜃 𝑘 +1
( 𝑖 )
, … 𝜃 𝑛 ( 𝑖 )
; 𝑦 )
where 𝑧 𝑖𝑘
is an indicator variable that follows the Bernoulli distribution, with p= ak, which means
𝑧 𝑖𝑘
=1 if Ci=k, ak=P(Ci=k).
However, Gibbs sampling requires sampling from a full specification of the conditional
distribution, which may not be easily obtained for complex mixture models, such as ZIP LCA.
Instead, the Metropolis random walk algorithm can be used in this case by sampling the
parameter of interest from a proposal distribution, and in each iteration of the simulation we
determine whether to accept a new sampling parameter value by examining the ratio of
probability of the new sampling value and the previous one in the target distribution density, in
which way we are much more likely to provide better sampling parameter values at each
iteration of the simulation. In the LCGM ZIP model proposed in this approach, we update all the
parameters using a Metropolis random walk algorithm.
2.5.2 Metropolis-Hastings random walk algorithm
The Metropolis algorithm is an adaptation of a random walk with an acceptance/rejection rule to
converge to the specified target distribution. Its updating scheme was first applied in Metropolis
et al. (1953) and proceeds as follows. Given a current value of the d-dimensional Markov chain,
X, a new value X
*
is obtained by proposing a jump Y
*
:= X
*
− X from the pre-specified
Lebesgue density 𝑟 ̃ ( y
∗
; λ)=
1
λ
𝑑 𝑟 (
y
∗
λ
) , with r(y) = r(−y) for all y. Here λ > 0 governs the overall
size of the proposed jump and plays a crucial role in determining the efficiency of any algorithm.
The proposal is then accepted or rejected according to acceptance probability 𝛼 ( 𝑥 , 𝑦 ∗
)=
34
min ( 1,
𝜋 ( 𝑥 +𝑦 ∗
)
𝜋 ( 𝑥 )
) . If the proposed value is accepted it becomes the next current value, otherwise
the current value is left unchanged.
The detailed steps in the Metropolis random walk algorithm are shown as follows.
1. Draw a starting point θ0, for which p(θ0|y)>0, from a proposal distribution p0(θ), where
the starting distribution may be based on an approximation of the true distribution of θ.
2. For iterations t=1,2,…;
(a) Sample a proposal θ
*
from a proposal distribution at time t, Q(θ
*
| θ
t-1
). In Metropolis
algorithm, the proposal distribution must be symmetric, satisfying q(θ
a
| θ
b
)= q(θ
b
|
θ
a
), but symmetry of q is not required for the Metropolis-Hastings algorithm.
(b) Calculate the ratio of the densities,
𝑟 =
𝑝 ( θ
∗
|y) q( θ
𝑡 −1
| θ
∗
)
𝑝 ( θ
𝑡 −1
|y) q( θ
𝑡 −1
| θ
∗
)
=
𝑝 ( θ
∗
|y)
𝑝 ( θ
𝑡 −1
|y)
(17)
(c) Set θ
𝑡
= {
θ
∗
𝑤𝑖𝑡 ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 min ( 𝑟 , 1)
θ
𝑡 −1
𝑜𝑡 ℎ 𝑒𝑟𝑤 𝑖𝑠𝑒
2.5.3 Proposed Bayesian ZIP LCGM model
Based on the figure of assumptions (Figure 1.) in LCGM models by Jones & Nagins (2001),
the class membership is only associated with time stable covariates such as gender,
race/ethnicity, etc. Therefore, in the LCGM ZIP model, the probability of each class for each
individual can be predicted directly from time stable covariates by fitting a multinomial
regression with a vector of class probability,
𝜋 𝑖𝑘
=
𝑒 𝑧 𝑖 𝑇 𝛶 𝑘 ∑ 𝑒 𝑧 𝑖 𝑇 𝛶 𝑘 𝐾 ℎ=1
( 18)
35
and assigning the class membership for each individual by sampling 𝜋 𝑖𝑘
from the categorical
distribution Cat(𝜋 𝑖𝑘
). Based on the assumption that the trajectory is only related to time varying
covariates (such as age) conditional on class membership, and assuming that the trajectory of y
given class membership follows a ZIP distribution, the time variable can be used to predict the
two parameters in the ZIP distribution: The Poisson mean and the extra zero probability. The
Poisson mean for each class and each individual at each time point can be predicted using a log
link with time varying covariates and a random intercept represents the random variable of each
individual to the Poisson mean in each class,
log( 𝜆 𝑖𝑡𝑘 )= 𝑥 𝑖𝑡
𝑇 𝛽 𝑘 + 𝑏 𝑖𝑘
(19)
The extra zero probability for each class at each time point can be obtained by fitting a logistic
regression along with time varying covariates,
𝑙𝑜 𝑔𝑖𝑡 ( 𝑝 𝑡𝑘
)= log (
𝑝 𝑡𝑘
1−𝑝 𝑡𝑘
)= 𝑥 𝑡𝑘
𝑇 𝛼 𝑘 ( 20)
as no random effect is added for the extra zero probability, we assume it is the same for all
subjects in each class at a same time point.
Based on (18)-(20), the parameter vector that we are interested in estimating in the proposed
model is Ѳ={αk, βk, ϒk, bi}, each of them corresponding to the parameters that need to be
estimated for the extra zero probability, the Poisson mean and the class probability for each
class, as well as the random effect for each individual, accordingly. The posterior distribution of
Ѳ, 𝑃 ( 𝑌 |Ѳ)∝ 𝑃 ( Ѳ) 𝑃 ( Ѳ|𝑌 ) , where 𝑃 ( Ѳ) are the prior distributions for Ѳ, which is specified in the
simulation procedure in Chapter 3, 𝑃 ( 𝑌 |Ѳ) is the likelihood that can be expressed as follows:
36
𝑃 ( 𝑌 |Ѳ)= ∏ ∑[𝑃 ( 𝐶 𝑖 = 𝑘 ) { ∏[𝑝 ( 𝑌 𝑖𝑡
|𝐶 𝑖 = 𝑘 ) ]
𝑇 𝐾 𝑘 =1
𝑁 𝑖 =1
= ∏ ∑[𝜋 𝑖𝑘
{∏ [𝑝 𝑖𝑗𝑘 + ( 1 − 𝑝 𝑖𝑗𝑘 ) 𝑒 −𝜆 𝑖𝑗𝑘 ] ∏ [( 1 − 𝑝 𝑖𝑗𝑘 )
𝑒 −𝜆 𝑖𝑗𝑘 𝜆 𝑖𝑗𝑘 𝑦 𝑖𝑗
𝑦 𝑖𝑗
!
]
𝑦 𝑖𝑗
>0 𝑦 𝑖𝑗
=0
𝐾 𝑘 =1
𝑁 𝑖 =1
= ∏ ∑[
𝑒 𝑧 𝑖 𝑇 𝛶 𝑘 ∑ 𝑒 𝑧 𝑖 𝑇 𝛶 𝑘 𝐾 ℎ=1
{∏ [
1
1 + 𝑒 −( 𝑥 𝑖 𝑇 𝛼 𝑘 )
+
1
𝑒 𝑒 𝑥 𝑖 𝑇 𝛽 𝑘 +𝑏 𝑖𝑘
( 𝑒 𝑥 𝑖 𝑇 𝛼 𝑘 + 1)
]
𝑦 𝑖𝑡
=0
𝐾 𝑘 =1
𝑁 𝑖 =1
∏
𝑒 ( 𝑥 𝑖 𝑇 𝛽 𝑘 +𝑏 𝑖𝑘
) 𝑦 𝑖𝑡
𝑦 𝑖𝑡
!𝑒 𝑒 𝑥 𝑖 𝑇 𝛽 𝑘 +𝑏 𝑖𝑘
( 𝑒 𝑥 𝑖 𝑇 𝛼 𝑘 +1)
𝑦 𝑖𝑡
>0
(21)
The joint posterior distribution of the parameters from our proposed model does not have
a closed form, and hence simulation based MCMC methods are used to obtain estimates of
unknown parameters. MCMC methods are particularly powerful in dealing with high
dimensional and complex data. Because there are no closed forms available for the full
conditional posterior distributions of 𝛼 𝑘 , 𝛽 𝑘 , and ϒ
𝑘 , it is very difficult to directly draw samples
from those full conditional distributions. Instead, we use a Metropolis algorithm to draw samples
for these three parameters from proposal distributions. As a result, for the ZIP LCGM, we can
use the following algorithm to generate samples from the above full conditional distributions to
update 𝛼 𝑘 , 𝛽 𝑘 , and ϒ
𝑘 .
(1) Assign initial values to 𝛼 𝑘 , 𝛽 𝑘 , for k = 1, . . . , K, to ϒ
𝑘 for k = 2, . . . ,K, and to class
membership indicator 𝐶 𝑖 ;
(2) Update ϒ
𝑘 for k = 2, . . . , K, using Metropolis random walk algorithm;
(3) sample 𝐶 𝑖 from its full conditional multinomial distribution based on posterior probability;
(4) Update 𝛼 𝑘 , 𝛽 𝑘 for k = 1, . . . , K, using Metropolis random walk algorithm.
37
More related details are described in chapter 3.
2.5.4 Label-switching issue
The label-switching problem is a well-known and fundamental problem in Bayesian estimation
of mixture models. The label-switching problem can occur when both the prior distribution of
the model and the likelihood are invariant to permutations of the parameters. Conceptually, this
means that the labels of the estimated parameters might not match the data generating labels for
simulated data. Let
) ,..., (
1 K
denote a permutation of class-specific parameters and weights
(class probabilities in LCGM), so that ) ,..., ( ), ,..., (
1 1 K K
w w w
. If an exchangeable
prior distribution was placed upon the parameters of a mixture model, then the likelihood is
invariant to the permutation of the label of parameters (Jasra, Holmes and Stephen, 2005), which
means ) | , ( ) | , ( x w L x w L . For example, a 3-class LCGM can have 3! permutations of the
classes for parameters, as they can be labeled {1,2,3}, {1,3,2}, {2,3,1}, {2,1,3}, {3,2,1}, or
{3,1,2} without changing the model fit. During one run of an MCMC sampler, the order of
parameters for different classes can change multiple times between iterations. Label-switching
significantly increases the effort required to produce a satisfactory Bayesian analysis of the data,
but it is also well known that the presence of the Label-switching phenomenon in the MCMC
sample serves as a necessary condition for the convergence of the chain to the target distribution.
To obtain meaningful interpretations of the class-specific parameter estimates, it is necessary to
account for those changes (Stephens, 2000; Sperrin, N., Jaki, T., Wit, E (2010)) and the Label-
switching problem needs to be addressed.
38
Papastamoulis (2015) recently introduced the label.switching R package, currently available
from the Comprehensive R Archive Network, which implements the eight most commonly used
algorithms that are used to deal with Label-switching. These include ordering constraints, the
Kullback-Leibler based algorithm (Stephens, 2000), the pivotal reordering algorithm (Martin,
Mengersen and Robert 2005), the default and iterative versions of the Equivalence Classes
Representatives (ECR) algorithm (Papastamoulis and Iliopoulos, 2010; Rodriguez and Walker
2014; Paoastamoulis and Iliopoulos 2013), the probabilistic relabeling algorithm (Sperrin, Jaki,
and Wit, 2010) and the data-based algorithm (Rodrigues and Walker, 2014). According to
Papastamoulis (2015), if the MCMC sampler converged to the target distribution, the simulated
parameters should be switched to one among the K! symmetric areas of the posterior
distribution, where K is the number of latent classes. Label-switching can therefore be
determined by examining whether the trace plot for class-specific parameter switched a multiple
of K! different patterns (K! switching of colors) (Papastamoulis (2015)).
The simplest approach to solve the Label-switching problem is to impose an Artificial
Identifiability Constraint to the MCMC sample, which is called the AIC algorithm. In that way,
the simulated MCMC parameters are permuted according to the order of a specific parameter
(Papastamoulis (2015)). The detailed procedure for ordering constraints is to first choose a
specific parameter type ξs, where s=1,…, J, and J is the number of parameters to be estimated.
For iteration t=1, …, m, find the permutation τ
(t)
, such that 𝜉 𝜏 1
𝑗 ( 𝑡 )
< ⋯ < 𝜉 𝜏 𝐾 𝑗 ( 𝑡 )
.
To examine whether the estimated Poisson mean also suffers from the Label-switching problem,
we plot the estimated values of the Poisson mean parameter 𝛽 𝑘 for each class k. If the results of
39
𝛽 𝑘 after burn-in were clearly separated from each other, we would conclude there is no sign of a
label-switching problem. If the 𝛽 𝑘 for different classes in their corresponding trace plot were
switching between different latent groups for all the latent classes, we would conclude that a
label-switching issue exists and the label.switching package developed by Papastamoulis will be
used to correct the switched labels of parameter estimates.
2.5.5 Model fit assessment and comparison
In the Bayesian modeling framework, there are several approaches to assess the model fit and
determine the best fitting model in a potentially large class of candidates, such as Bayes factors
(Jeffreys, 1935) and deviance information criterion (DIC), introduced by Spiegelhalter et.al
(2002), which calculates the posterior mean of the deviance. The former approach is
computationally complex and sensitive to prior specifications. In this research, we use the widely
used DIC for comparing models with different classes and determine the appropriate number of
latent classes.
The posterior mean of the deviance is defined by -2*log(likelihood), where the likelihood is
defined by p(y|ϴ), y includes all the stochastic nodes given values, and ϴ contains the stochastic
nodes upon which the distribution of y depends.
DIC can be obtained from the following equation:
𝐷𝐼𝐶 = 𝐷 ( 𝛳 )
̅ ̅ ̅ ̅ ̅ ̅ ̅
+ 𝑝 𝐷 = 𝐸 [𝐷 ( 𝛳 ) |𝑦 ] + ( 𝐸 [𝐷 ( 𝛳 ) |𝑦 ] − 𝐷 ( 𝐸 [𝛳 |𝑦 ])= 2𝐷 ( 𝛳 )
̅ ̅ ̅ ̅ ̅ ̅ ̅
− 𝐷 ( 𝛳 ̃
)=
−4𝐸 ( [log( 𝑓 ( 𝑦 |𝛳 ) |𝑦 ])+ 2log ( 𝑓 ( 𝑦 |𝛳 ̃
) ) (22)
where 𝐷 ( 𝛳 )
̅ ̅ ̅ ̅ ̅ ̅ ̅
is the posterior mean of deviance and it provide the information on how much
discrepancy exists between the model and the data, 𝑝 𝐷 = 𝐷 ( 𝛳 )
̅ ̅ ̅ ̅ ̅ ̅ ̅
− 𝐷 ( 𝛳 ̃
) , where 𝛳 ̃
is an estimate
40
of parameters depending on the distributional form of y. 𝑝 𝐷 evaluates the difference between the
posterior mean of the deviance and the deviance evaluated at the posterior mean of the
parameter. The posterior mean 𝛳 ̅
is often used for 𝛳 ̃
. 𝐷 ( 𝛳 ̃
) is the point estimate of deviance
obtained by substituting in the posterior means 𝛳 ̅
. Similar to AIC and BIC, the smaller DIC
indicates better model choice.
To evaluate the variability of the posterior estimates of the parameter of interest, Bayesian
credible interval or Highest Posterior Density (HPD) interval are calculated based on the quantile
of posterior distribution of the parameters. If 𝛳 𝐿 ∗
is the α/2 posterior quantile for θ, and 𝛳 𝑈 ∗
is the
1 − α/2 posterior quantile for θ, then (𝛳 𝐿 ∗
, 𝛳 𝑈 ∗
) is a 100(1 − α) % credible interval for θ. The
equal-tail credible interval approach is ideal when the posterior distribution is symmetric. When
the posterior distribution is not symmetric, HPD interval has more advantage for selecting the
shortest possible interval enclosing (1-α) % of the posterior mass. The procedures use the Chen-
Shao algorithm (Chen and Shao; 1999; Chen, Shao, and Ibrahim, 2000) to estimate an empirical
HPD interval of 𝛳 𝑖 :
1. Sort { 𝛳 𝑖 ′
} to obtain the ordered values:
𝛳 𝑖 ( 1)
≤ 𝛳 𝑖 ( 2)
≤ … ≤ 𝛳 𝑖 ( 𝑛 )
2. Compute the 100 (1-α) % credible intervals:
𝑅 𝑗 ( 𝑛 )= ( 𝛳 𝑖 ( 𝑗 ) ,
𝛳 𝑖 ( 𝑗 +[( 1−𝛼 ) 𝑛 ])
)
𝑓𝑜𝑟 𝑗 = 1, 2, … , 𝑛 − [( 1 − 𝛼 ) 𝑛 ]
41
3. The 100 (1-α) % HPD interval is the one with the smallest interval width among all
credible intervals.
Trace plots of the posterior distribution are commonly used for MCMC convergence diagnosis.
Trace plots of MCMC samples versus the simulation iteration index is useful in diagnosis of
convergence and mixing of a chain. Relative constant means and variances of trace plots indicate
the chain has converged to its target distribution (i.e., the posterior distribution).
42
Chapter 3 Proposed Bayesian LCGM model and Results
3.1 Simulation Studies
The proposed Bayesian LCGM ZIP is first assessed using simulated data generated using the R
programming langauge, and will then be used to analyze the Add Health smoking data.
Simulation studies for multi-level Bayesian LCGM ZIP model will also be conducted to assess
the proposed model fit under nested data structure. The entire simulation process including
generating simulated dataset and then the Bayesian MCMC simulation are conducted based on R
programming as well. The Bayesian Metropolis random walk approach is used in this simulation
study in order to update all the parameters (Ѳ={αk, βk, ϒk, bi}) in the proposed model. In addition
to the parameters to be estimated, another major output in this simulation is the latent class
membership for each individual, which takes the value of 1 to K. The estimated posterior class
membership based on the proposed approach is then compared with the simulated true class to
evaluate the accuracy of classification based on this proposed model.
3.1.1 The Simulation Process for single-level ZIP LCGM
To generate simulated data that are as close as possible to the real CHS data, and to start with
simplified estimation procedure, we first simulate three classes of data (K=3), with a total sample
size of 1000 and include only 3 time points in total (t=3). Based on observations from the
smoking trend in the CHS data or in most related studies as well, the never smokers should have
the lowest mean for the number of cigarettes smoked in a week, while the third (i.e., highest)
class usually representing the heavy smokers should have the highest mean count. The scenario
is opposite for the extra zero probability in each class. Since the never smokers have the most
43
structural zeros, we expect the extra zero probability for the first class be the highest, and the
third class to have the lowest, i.e.,
𝑝 1𝑡 > 𝑝 2𝑡 > 𝑝 3𝑡 , 𝜆 1𝑡 < 𝜆 2𝑡 < 𝜆 3𝑡 (23)
Besides, we need to build growth curves, which means that the mean count of weekly cigarette
smoked, should be increasing over time. And also given each time point t=1,2,3, the third class is
much more likely to have highest Poisson mean and smallest extra-zero probability, so that we
set the following constraints to our simulated data
𝑝 𝑘 1
> 𝑝 𝑘 2
> 𝑝 𝑘 3
, 𝜆 𝑘 1
< 𝜆 𝑘 2
< 𝜆 𝑘 3
(24)
Besides, in order to simulate the smoking trajectories that are as similar as possible to that in the
real CHS data, where most subjects are never smokers or light smokers, we enforce the
proportion of the class 1 to be the highest, and class 3 to be the lowest, which means
𝑃 ( 𝐶 𝑖 = 1| ϒ
1
)> 𝑃 ( 𝐶 𝑖 = 2| ϒ
2
)> 𝑃 ( 𝐶 𝑖 = 2| ϒ
3
) (25)
To simulate the datasets, we first generate the time stable covariate gender using the Binomial
random variable with the probability of being a male set to be 0.55. And then we simulate three
latent classes by fitting a multinomial logistic regression as specified in (18), with given initial
values of coefficient ϒk. There are N=1000 subjects and two sets of coefficients for ϒk in this
simulation: one for class two and another for class three (Υ1=0 for identifiability). After we
obtain the true class via simulation, we can predict the Poisson mean
ikt
for each subject in
each class at each time point and the extra zero probability
kt
p
for each class at each time point
using (19) and (20). Both αk and βk are 2 by 3 matrices, including coefficients for predicting
kt
p
and
ikt
for each class with intercept and slope correspondingly. Each subject in each class has
44
an intercept, represented by the random intercept bi for the Poisson mean, and these are sampled
from the normal distribution with mean 0 and variable 0.05, 0.04 and 0.03 for classes 1, 2 and 3
respectively. Based on the constraint specified in (23) - (25), and the class assignment for each
individual, as well as the ZIP parameters for each class, we generate our data by randomly
drawing samples
) , ( ~ ,...,
11 kt ikt Nt
p ZIP y y
.
3.1.2 Simulation procedure for parameter estimation of single-level ZIP LCGM
The detailed steps for updating each parameter using the proposed Bayesian approach can be
described as shown below. Note that after each step is finished, the parameter estimates or the
latent classes obtained in this iteration will be used in the next step to update the next parameter,
and the whole procedure is repeated for 40,000 iterations of simulation:
Step 1: Update ϒk using Metropolis random walk
The full conditional for ϒk (k=2,…,K) is given by
𝜋 ( Υ
𝑘 |. )∝ ∏ [𝑃 ( 𝐶 𝑖 = 𝑘 |Υ
𝑘 ) ]
𝐼 ( 𝐶 𝑖 =𝑘 )
𝜋 ( Υ
𝑘 )= ∏ (
𝑒 𝑧 𝑖 𝑇 𝛶 𝑘 ∑ 𝑒 𝑧 𝑖 𝑇 𝛶 ℎ
𝐾 ℎ=1
) 𝑁 ( Υ
𝑘 ; 0, ( 9/4) 𝐼 𝑟 )
𝑁 𝑖 :𝐶 𝑖 =𝑘 𝑁 𝑖 =1
(26)
Where N(ϒk;.) is a 2-dimensional normal density evaluated at ϒk. This full conditional
distribution does not have a closed form solution, so we update ϒk using a Metropolis random
walk algorithm based on a multivariate t3(sgTk) proposal distribution, centered at the previous
sampling value of ϒk. The adaption proposal developed by Harrio et al. (2005) is used to improve
mixing, which apply the empirical covariance from an extended burn-in period to tune Tk to
emulate the true posterior covariance. Sg is a scaler factor for the covariance to help in obtaining
an optimal acceptance rate of about 20%. For choosing Sg, less informative priors are typically
45
chosen, such as Sg=10. In addition to (9/4) Ir , different variance of prior distribution have also
been experimented, e.g. 0.1, and 0.01.
Step 2: Update Ci:
In order to update the class membership Ci for subject i=1, …, N, we draw Ci from its full
conditional distribution,
𝜋 ( 𝐶 𝑖 |. )= 𝑃 ( 𝐶 𝑖 = 𝑘 |. )= 𝐶𝑎𝑡 ( 𝜋 𝑖𝑘
) ,
where 𝜋 𝑖𝑘
=
𝑃 ( 𝐶 𝑖 =𝑘 |Υ
𝑘 ) [∏ 𝑓 ( 𝑦 𝑖𝑡
|𝛼 𝑘 , 𝛽 𝑘 , 𝑏 𝑖 )
𝑡 𝑇 =1
]𝑁 ( 𝑏 𝑖 ;0,𝜎 𝑘 2
)
∑ 𝑃 ( 𝐶 𝑖 =ℎ|Υ
ℎ
) [∏ 𝑓 ( 𝑦 𝑖𝑡
|𝛼 𝑘 , 𝛽 𝑘 , 𝑏 𝑖 )
𝑡 𝑇 =1
]𝑁 ( 𝑏 𝑖 ;0,𝜎 𝑘 2
)
𝐾 ℎ=1
(27)
Since 𝜋 𝑖𝑘
has to sum up to 1 for all 3 classes, we have 𝜋 𝑖 1
=
1
1+𝑒 𝑧 𝑖 𝑇 𝛶 2
+𝑒 𝑧 𝑖 𝑇 𝛶 3
, 𝜋 𝑖 2
=
𝑒 𝑧 𝑖 𝑇 𝛶 2
1+𝑒 𝑧 𝑖 𝑇 𝛶 2
+𝑒 𝑧 𝑖 𝑇 𝛶 3
,
and 𝜋 𝑖 3
=
𝑒 𝑧 𝑖 𝑇 𝛶 3
1+𝑒 𝑧 𝑖 𝑇 𝛶 2
+𝑒 𝑧 𝑖 𝑇 𝛶 3
. 𝜎 𝑘 2
is sampled directly from an inverse gamma distribution, in order to
have a relatively small and maintain a less fluctuate variance (<1) for the random effect, we set
the shape parameter of 10, and rate parameter 5 for the inverse gamma distribution.
Step 3: Update αk using Metropolis random walk
The full conditional distribution for αk is
𝜋 ( α
k
|. )∝ 𝑓 ( 𝑦 𝑖 |𝑐 𝑖 = 𝑘 , α
k
, β
k
, 𝑏 𝑖 ) 𝑁 ( α
k
; 0, 𝐼 2
)
For each class k, we update 𝜋 ( 𝛼 𝑘 |. ) use a Metropolis random walk algorithm with a proposal
distribution of multivariate t (1, 𝐼 2
) centered at previous values of α
k
Step 4: Update βk using Metropolis random walk
The full conditional distribution for 𝛽 𝑘 is
𝜋 ( β
k
|. )∝ 𝑓 ( 𝑦 𝑖 |𝑐 𝑖 = 𝑘 , α
k
, β
k
, 𝑏 𝑖 ) 𝑁 ( β
k
; 0, 𝐼 2
)
46
For each class k, we update 𝜋 ( β
k
|. ) use a Metropolis random walk algorithm with a proposal
distribution of multivariate t (1, 𝐼 2
) centered at previous value of 𝛽 𝑘 .
Step 5: Update bi using Metropolis random walk
The full conditional distribution for 𝑏 𝑖 is
𝜋 ( 𝑏 𝑖 |. )∝ 𝑓 ( 𝑦 𝑖 |𝑐 𝑖 = 𝑘 , α
k
, β
k
, 𝑏 𝑖 ) 𝑁 ( 𝑏 𝑖 ; 0, 𝜎 𝑘 2
)
For each class k, we update 𝜋 ( 𝑏 𝑖 |. ) using a Metropolis random walk algorithm with a proposal
distribution of multivariate t (1, 𝐼 2
) centered at previous value of 𝑏 𝑖
After the posterior distribution for each parameter is obtained, we summarize the mean of the
distribution for 40,000 iterations as the parameter estimate. The trace plot for each parameter will
be provided and discussed in the next section. To start evaluating the model fit using simplified
model, the extra zero probability parameter αk for all latent classes is first fixed to certain values
based on our pre-specified constraints (shown in table 6), then we estimate αk using Metropolis
random walk algorithm as shown in step 3 above.
3.1.3 Comparison of Gaussian LCGM model using WinBUGS and R
The proposed Bayesian LCGM model is first tested on simulated Gaussian longitudinal data to
evaluate the performance of the proposed model on continuous data, as well as comparing the
estimation results from using Win BUGS and using R program.
In Gaussian LCGM, the probability density function given a specific class k can be expressed as
follows:
47
𝑓 ( 𝑌𝑖𝑗 = 𝑦 𝑖𝑗
|𝐶 𝑖 = 𝑘 )=
1
𝜎 𝑘 √2𝜋 𝑒 −( 𝑥 𝑖𝑗
−𝜇 𝑘 )
2
2𝜎 𝑘 2
Y follows a mixture distribution, which takes the form of a mixture density as follows:
𝑓 ( 𝑌𝑖𝑗 = 𝑦 𝑖𝑗
)= ∑ 𝜋 𝑖𝑘
𝐾 𝑘 =1
𝑓 ( 𝑌𝑖𝑗 = 𝑦 𝑖𝑗
|𝐶 𝑖 = 𝑘 )
Where 𝐶 𝑖 ~ 𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚 ( 𝜋 𝑖 1
, … , 𝜋 𝑖𝑘
) , 𝜋 𝑖 1
, … , 𝜋 𝑖𝑘
are the probabilities associated with belonging
to a certain latent class. The classification of each trajectory group is assumed to be affected by
the time stable covariates Z, which allows the time-varying covariates X to be associated with Y
directly.
The 𝜋 𝑖𝑘
are estimated using multinomial logistic regression by,
𝜋 𝑖𝑘
=
𝑒 𝑧 𝑖 𝑇 𝛶 𝑘 ∑ 𝑒 𝑧 𝑖 𝑇 𝛶 ℎ
𝐾 ℎ=1
, with 𝛶 1
= 0 for identifiability.
The estimated mean for each class can be obtained by using log link with X,
log( 𝜆 𝑖𝑗𝑘 )= 𝑋 𝑖𝑗
𝑇 𝛽 𝑘
We can obtain the likelihood of observing the data trajectory yij given group membership k from,
𝐿 ( 𝜆 𝑖𝑗𝑘 , 𝑝 𝑖𝑗𝑘 |𝑌 𝑖𝑗 ,
𝐶 𝑖 = 𝑘 )= ∏ 𝑃 ( 𝑌𝑖𝑗 = 𝑦 𝑖𝑗
|𝐶 𝑖 = 𝑘 )=
𝑦 𝑖𝑗
∏
1
𝜎 𝑘 √2𝜋 𝑒 −( 𝑥 𝑖𝑗
−𝜇 𝑘 )
2
2𝜎 𝑘 2
𝑦 𝑖𝑗
The metropolis steps for updating each unknown parameter are similar for Gaussian LCGM
(Step 1 &2), except for updating βk, sampling from full conditional of normal distribution is
allowed (shown in Step 3 below) and Gamma prior distribution is selected to update precision
parameter in Gaussian distribution (shown in Step 4 below)
Step 3: Update βk for Gaussian LCGM
The full conditional distribution for 𝛽 𝑘 is
𝜋 ( β
k
|. )= 𝜋 ( β
k
|𝑦 𝑘 , 𝜏 𝑘 2
)= 𝑁 ( 𝜂 𝛽 𝑘 , 𝑉 𝛽 𝑘 ) ,
48
Where 𝑉 𝛽 𝑘 = [𝛴 𝛽 −1
+ 𝜏 𝑘 −2
( 𝑋 𝑘 𝑇 𝑋 𝑘 ) ]
−1
and 𝜂 𝛽 𝑘 = 𝑉 𝛽 𝑘 { 𝛴 𝛽 −1
𝜇 𝛽 + 𝜏 𝑘 −2
( 𝑋 𝑘 𝑇 ∗ 𝑦 𝑘 ) }
For each class k, we update 𝜋 ( β
k
|. ) by sampling from its full conditional distribution.
Step 4: Update τk for Gaussian LCGM
Assuming a Ga(λ, ξ) prior for 𝜏 𝑘 −2
, 𝜏 𝑘 −2
is drawn from its full conditional distribution:
𝜋 ( 𝜏 𝑘 −2
|. )= 𝜋 ( 𝜏 𝑘 −2
|β
k
, 𝑦 𝑘 ) =Ga(λ +
M
𝑘 2
, ξ +
1
2
∗ [𝑦 𝑘 − 𝑋 𝑘 𝑇 β
k
] ∗ [𝑦 𝑘 − 𝑋 𝑘 𝑇 β
k
]), where M
𝑘 is the
number of subjects in each class k.
3.1.4 Multilevel extension
In the CHS and Add Health study, the participants are sampled in clusters, such as communities,
cohorts etc., and potentially inducing dependence among observations from the same clusters.
This nested data structure requires multilevel techniques in order to avoid inefficient estimation
and misleading inferences. In the next step for this research, the simulations will be conducted
based on the multi-level setting, e.g., two-level data, one for individual level, another one for
community level. In the multilevel data structure, we assume that the probability of class
membership is a random variable that varies across the level 2 units. Therefore, the class
parameter ϒk can be further decomposed to be predicted by the second-level covariates with
random intercepts. Similar Metropolis random walk algorithm as (30) will be applied to conduct
the multilevel model, as shown below:
Level 1: logit(𝑃 ( 𝐶 𝑖 = 𝑘 | ϒ
𝑘 ) )= 𝐺 𝑚 ( 𝑘 )
+ 𝑧 𝑖 𝑇 ϒ
𝑘 =𝐺 𝑚 ( 𝑘 )
+ ϒ
𝑘 2
𝑍 1𝑖 + ϒ
𝑘 3
𝑍 2𝑖 + ϒ
𝑘 4
𝑍 3𝑖 , (28)
49
where the 𝑍 1𝑖 , 𝑍 2𝑖 , 𝑍 3𝑖 are corresponding to covariate gender, peer smoking and home smoking,
k=1,…,K-1. For identifiability purposes, the class 1 coefficients 𝐺 𝑚 ( 1)
, ϒ
12
, ϒ
13
, ϒ
14
are set to zero.
Level 2: For a specific level 2 unit m, 𝐺 𝑚 ( 𝑘 )
= 𝜂 0
( 𝑘 )
+ 𝜂 1
( 𝑘 )
𝑠 𝑚 + 𝑈 𝑚 ( 𝑘 )
, where sm denotes the
community level covariate, and k=2, …,K , Um denotes the random deviation from the average
of the specific group m, and m represents the levels of community, which are 1,2,…,10. The
variance of Um will be estimated.
Combined: 𝑃 ( 𝐶 𝑖 = 𝑘 | ϒ
𝑘 , 𝜂 𝑘 )=
𝑒 𝜂 0
( 𝑘 )
+𝜂 1
( 𝑘 )
𝑠 𝑚 +𝑈 𝑚 ( 𝑘 )
+𝑧 𝑖 𝑇 ϒ
𝑘 ∑ 𝑒 𝜂 0
( 𝑘 )
+𝜂 1
( 𝑘 )
𝑠 𝑚 +𝑈 𝑚 ( 𝑘 )
+𝑧 𝑖 𝑇 ϒ
𝑘 𝐾 ℎ=1
, (29)
where m denotes the level 2 unit, Um denotes the random effect for level 2 unit. Thus the
posterior distribution of ϒk in multilevel model becomes,
𝜋 ( Υ
𝑘 |. )∝ ∏ (
𝑒 𝜂 0
( 𝑘 )
+𝜂 1
( 𝑘 )
𝑠 𝑚 +𝑈 𝑚 ( 𝑘 )
+𝑧 𝑖 𝑇 ϒ
𝑘 ∑ 𝑒 𝜂 0
( 𝑘 )
+𝜂 1
( 𝑘 )
𝑠 𝑚 +𝑈 𝑚 ( 𝑘 )
+𝑧 𝑖 𝑇 ϒ
𝑘 𝐾 ℎ=1
)𝑁 ( Υ
𝑘 ; 0,0.1𝐼 )
𝑁 𝑖 :𝐶 𝑖 =𝑘 𝑁 ( 𝑈 𝑚 ; 0, 0.1𝐼 ) (30)
The same normal prior distribution can be used for ϒk and for the random effect of level 2.
Similar multivariate t3(sgTk) proposal distribution can be used to sample ϒk.
By replacing 𝑃 ( 𝐶 𝑖 = 𝑘 | ϒ
𝑘 , 𝜂 𝑘 ) into (21), total likelihood can be obtained from the following
𝑃 ( 𝑌 |Ѳ)= ∏ ∑[(
𝑒 𝜂 0
( 𝑘 )
+𝜂 1
( 𝑘 )
𝑠 𝑚 +𝑈 𝑚 ( 𝑘 )
+𝑧 𝑖 𝑇 ϒ
𝑘 ∑ 𝑒 𝜂 0
( 𝑘 )
+𝜂 1
( 𝑘 )
𝑠 𝑚 +𝑈 𝑚 ( 𝑘 )
+𝑧 𝑖 𝑇 ϒ
𝑘 𝐾 ℎ=1
){∏ [
1
1 + 𝑒 −( 𝑥 𝑖 𝑇 𝛼 𝑘 )
+
1
𝑒 𝑒 𝑥 𝑖 𝑇 𝛽 𝑘 +𝑏 𝑖𝑘
( 𝑒 𝑥 𝑖 𝑇 𝛼 𝑘 + 1)
]
𝑦 𝑖𝑡
=0
𝐾 𝑘 =1
𝑁 𝑖 =1
∏
𝑒 ( 𝑥 𝑖 𝑇 𝛽 𝑘 +𝑏 𝑖𝑘
) 𝑦 𝑖𝑡
𝑦 𝑖𝑡
!𝑒 𝑒 𝑥 𝑖 𝑇 𝛽 𝑘 +𝑏 𝑖𝑘
( 𝑒 𝑥 𝑖 𝑇 𝛼 𝑘 +1)
𝑦 𝑖𝑡
>0
}
50
3.2 Simulation results
To evaluate whether our proposed Bayesian LCGM model can perform well in a relatively
simple case, for example when dealing with continuous normally distributed outcome, we first
start by conducting simulation studies of Bayesian Gaussian LCGM using the Metropolis
random walk algorithm. Then, we will proceed with a simple case of ZIP LCGM model where
the extra zero probability parameters are fixed. After demonstrating that the proposed model
works well in these two less complex cases, the extra zero probability parameter will be
estimated using Metropolis random walk algorithm as described in step 3 in chapter 3.1.2.
3.2.1 Simulation results of Gaussian LCGM
We simulated a total of 1000 subjects each with observations at 3 time points. The parameter
estimates for Gaussian LCGM using WinBUGS and R based on 100 simulated datasets with
20,000 iterations and 10,000 burn-in are shown in Tables 4-5 below, where both approaches
yield equivalently satisfying results in terms of the accuracy of estimation and convergence of
posterior trace plots.
Table 4 The parameter estimates for Gaussian mean (and SD) for Gaussian LCGM
βk C1_intcep C2_intcep C3_intcep C1_slope C2_slope C3_slope
True 0.25 0.15 0.95 0.85 1.95 1.75
Estimated using
R2WinBUGS
0.23 (0.02) 0.16 (0.01) 0.93 (0.02) 0.87(0.01) 1.89(0.04) 1.78 (0.02)
Estimated using R 0.24 (0.01) 0.16 (0.01) 0.98(0.02) 0.84(0.01) 1.94(0.01) 1.74 (0.01)
Table 5 The parameter estimates predicting class membership (and SD) for Gaussian LCGM
ϒk C2_intcep C2_slope C3_intcep C3_slope
True -0.35 0.25 -0.45 0.30
Estimated using
R2WinBUGS
-0.37 (0.02) 0.27 (0.02) -0.46 (0.01) 0.30 (0.01)
Estimated using R -0.33 (0.01) 0.23 (0.02) -0.47 (0.01) 0.28 (0.02)
51
Figure 2 Trace plots for posterior class probability estimates for Gaussian LCGM
52
As indicated by the estimation results and trace plots, the proposed model as implemented in R
yields equivalent results to the one implemented in WinBUGS, and the estimators from both
approaches converged well. Besides, both approaches are efficient in terms of computational
time. We can conclude that in the simple Gaussian LCGM setting, our proposed model works at
least as well as the one using the Gibbs Sampling approach in WinBUGS.
Since multinomial logistic regression is also used in Mplus and proc traj to predict class
membership probability parameters, the parameter estimates for class probability using our
proposed approach and that in Mplus and proc traj are expected to be similar. Thus, we also fit
Figure 3 Trace plots of posterior mean estimates for Gaussian LCGM
53
the 3-class Gaussian LCGM model on both proc traj and Mplus, the parameter estimates are
shown to be similar to each other regardless of which approach was used (Table 12), indicating
the our proposed Bayesian approach is comparable with the frequentist approach in terms of
estimating the latent class probabilities.
Table 6 Comparison of Gaussian LCGM model using R, proc traj and Mplus
ϒk C2_intcep C2_slope C3_intcep C3_slope
True -0.35 0.25 -0.45 0.30
Estimated using R -0.33 0.23 -0.47 0.28
Estimated using proc traj -0.36 0.30 -0.50 0.29
Estimated using Mplus -0.37 0.28 -0.50 0.32
3.2.2 Simulation results for single level ZIP LCGM
In order to better understand the relationship between parameters with respect to estimation, we
first simplify the estimation procedure by excluding the parameters α
k
, which are used to predict
the extra-zero probability for each class. Instead, we fix the extra-zero probabilities according to
the constraints in (27)-(29). Hence, they are specified as follows:
Table 7 Extra-zero probabilities for each class and each time points
Time 1 Time 2 Time 3
Class 1 0.7866 0.6874 0.5875
Class 2 0.3396 0.2817 0.1993
Class 3 0.2821 0.1577 0.0807
The above extra-zero probabilities are selected based both on the constraints in (27)-(29), as well
as in order to ensure that the mean count for each class will be 0.8, 2.85 and 5.5. The mean count
for each class is chosen arbitrarily but following the condition of an increasing trend.
54
In this particular model, in order to avoid having extreme large Poisson mean values and then
highly likely zero likelihood that was likely caused by large 𝛽 k
and random effect bi, and
extreme probabilities of class membership that are caused by large 𝛶 k
, 𝛽 k
are sampled from 0 to
1 for slopes and -1 to 0 for intercept, 𝛶 k
are sampled from 0 to1 for slopes and -1 to 0 for
intercept, bi are sampled from the normal distribution.
The estimated Poisson mean parameter 𝛽 k
and the class probability parameter 𝛶 k
based on
40,000 iterations of MCMC simulation using Metropolis random walk algorithm are shown in
Tables 3-4 below. The first row of results denotes the starting value when simulating the dataset,
which are the true values, the second row presents the estimated results using σ
k
2
= 0.1 for
random intercept and sg=10 in the proposal distribution for ϒk:
Table 8 Estimated ϒk (and SD) using Metropolis random walk for class 2 and 3
𝜰 𝒌 Class 2 Class 3
Intercept Slope Intercept Slope
True -0.35 0.25 -0.45 0.15
Estimated -0.42 (0.03) 0.24 (0.01) -0.50 (0.02) 0.16 (0.01)
Table 9 Estimated βk (and SD) using Metropolis random walk for all 3 classes
𝜷 𝒌 Class 1 Class 2 Class 3
Intercept Slope Intercept Slope Intercept Slope
True -0.65 0.15 -0.25 0.45 -0.25 0.60
Estimated -0.62 (0.02) 0.19 (0.01) -0.28 (0.03) 0.46 (0.01) -0.27 (0.01) 0.61 (0.01)
55
Most of our parameter estimates are quite close to the true values, indicating our estimates have
quite small bias, and also have small standard deviation, reflecting that the results of our model
are stable across different simulations. The trace plot of ϒ for class 2 and 3 and the trace plot for
β for all classes for the 40,000 iterations are shown in Figures 2 and 3; These plots indicate that
our estimates parameters using MCMC Metropolis random walk algorithm have good
convergence.
To examine how well the proposed model classifies the latent groups, we tabulate the true class
from simulation and the estimated latent class in Table 9, the correct classification can be
identified from checking the diagonal of Table 9, and from the bar plot comparing the true class
and estimated class for all individuals, the proposed model can correctly classify 90% of the
individual to their corresponding true latent group. We have also tried to evaluate if the
difference of Poisson mean in each class affects the accuracy of classification of our model. We
have simulated another set of datasets with mean count of 0.8, 10 and 29.5 for latent classes 1, 2
and 3 respectively. The resulting cross tabulation of true and estimated class membership is
shown in Table 10. Clearly, we can obtain a higher accuracy of classification of about 93% using
data with larger mean difference in each class.
Table 10 cross tabulation of true and estimated class membership
True Class 1 True Class 2 True Class 3
Estimated Class 1 392 18 1
Estimated Class 2 54 305 4
Estimated Class 3 1 18 207
56
Figure 4 Cross tabulation of true and estimated class membership
Table 11 The comparison of classification for data that has larger Poisson mean difference
True Class 1 True Class 2 True Class 3
Estimated Class 1 403 10 1
Estimated Class 2 42 300 1
Estimated Class 3 1 16 226
Figure 5 Estimates for class-specific parameter ϒk for one-level ZIP LCGM (simulation)
0
50
100
150
200
250
300
350
400
450
500
1 2 3
True Class Estimated Class
57
Figure 6 Estimates for class-specific βk for one-level ZIP LCGM (simulation)
We have shown that our proposed ZIP LCGM model performs well when fixing the extra zero
probability parameters αk for all classes. Next, we proceed to develop another simulation study
that includes αk as the parameter of interest and is estimated using Metropolis random walk
procedure described in step 3 in chapter 3.1.2.
Table 12 Extra zero prob estimates (and SD) for ZIP LCGM (simulation)
αk Class 1 Class 2 Class 3
Intercept Slope Intercept Slope Intercept Slope
TRUE 0.55 -0.4 0.35 -0.35 0.15 -0.45
Estimated 0.51 (0.02) -0.42 (0.02) 0.33 (0.02) -0.42 (0.03) 0.12 (0.01) -0.48 (0.02)
58
Figure 7 Trace plot for extra zero probability parameter for one-level ZIP LCGM (simulation)
The estimated αk for all latent classes are relatively accurate with reasonable 95% HPD interval,
the corresponding posterior trace plots also indicate good convergence to the stationary
distribution.
3.2.3 Multi-level extension of LCGM
To evaluate the performance of our proposed model on multilevel longitudinal data structure, we
conducted a simulation study that includes the individual level data nested within communities.
The simulated datasets have a sample size of 2000, each with four repeated observations, and
individuals are randomly assigned to 10 different communities. The binomial and Poisson
components contained class-specific intercept and linear age. To simulate data as close to Add
Health data as possible, Age at four time points are simulated as Age0 ~N(15.6, 1.6), Age1 =Age0
59
+N(0.91, 0.14), Age2 = Age1 + N(5.45, 0.22), Age3 = Age2+ N(6.51, 0.3), to reflect baseline age
and time intervals among these four occasions. A standardized age variable with four time
points is used to predict the Poisson mean parameter βk and extra zero probability parameter αk.
Three other level 1 covariates (gender, peer smoking and home smoking) are also generated to be
associated with class membership probabilities. We generate level 2 covariate tobacco growing
state (Yes or No) to be related to class membership probabilities. A level 2 random effect
variable U that follows a normal distribution is also generated to reflect the variability between
communities. We then fitted a three-class multilevel ZIP LCGM model to the simulated data.
Figure 8 Trace plot for Level 2 covariate parameter for multilevel ZIP LCGM (simulation)
60
Figure 9 Trace plot for extra zero prob parameter for multilevel ZIP LCGM (simulation)
Table 13 Extra zero prob estimates (and SD) for multilevel ZIP LCGM (simulation)
αk Class 1 Class 2 Class 3
Intercept Slope Intercept Slope Intercept Slope
TRUE 0.95 -0.50 0.60 -0.35 0.15 -0.45
Estimated 0.94 (0.01) -0.53 (0.01) 0.59 (0.01) -0.42 (0.03) 0.12 (0.01) -0.48 (0.02)
Table 14 Level 2 covariate estimates (and SD) for multilevel ZIP LCGM (simulation)
ηk (level 2 covariate) Class 2 Class 3
Intercept Slope Intercept Slope
True -1.20 0.40 -1.50 0.90
Estimated -1.19 (0.01) 0.42 (0.01) -1.52 (0.02) 0.91 (0.01)
Table 15 Poisson mean estimates (and SD) for multilevel ZIP LCGM (simulation)
βk Class 1 Class 2 Class 3
Intercept Slope Intercept Slope Intercept Slope
TRUE -0.90 0.20 -0.80 0.50 -0.60 0.8
Estimated -0.97 (0.01) 0.24 (0.01) -0.83 (0.02) 0.52 (0.03) -0.64 (0.01) 0.82 (0.02)
61
Table 16 Class prob estimates (and SD) for multilevel ZIP LCGM (simulation)
γk Class 2 Class 3
Male Home_smk Peer_smk Male Home_smk Peer_smk
TRUE 0.10 0.90 0.15 -0.30 0.59 0.43
Estimated 0.12 (0.01) 0.93 (0.02) 0.17 (0.01) -0.27 (0.01) 0.61 (0.02) 0.44 (0.01)
In this multilevel ZIP LCGM, we consider the probability of class membership to be a random
variable that is also varying across the level 2 covariate, which could reflect whether the
community is belonging to a tobacco growing state, for example. The proposed multilevel ZIP
LCGM model has shown good performance in terms of accuracy and convergence of the
estimates.
3.2.3 Label-switching evaluation
We carefully examined the MCMC output for both single and multi-level ZIP LCGM, single
level Gaussian LCGM and multilevel ZIP LCGM, however, we found no evidence of label
62
switching problem in our models. There is no need to further apply label.switching package to
make correction at this stage.
Figure 10 Trace plot of poisson mean parameters for multilevel ZIP LCGM (simulation)
The parameters for Multilevel ZIP LCGM model are shown in Figure. 10, the values for
parameters from three classes generated after burn-in are quite separated and there is no sign of
values jumping from one class to another. It is possible that the inclusion of multiple class
membership covariates, which are significant predictors of class membership can help to correct
the potential label switching issue.
3.3 Application to Add Health data (Real Data Analysis)
Compared to CHS, the Add Health data is collected from the longitudinal, nationally
representative and school based study with larger total sample size (N=2923), as well as a much
smaller proportion of missing observations, the total number of subjects that have complete four
waves of information on the number of cigarette smoked per day is 2909. We evaluate our
proposed individual-level LCGM model by applying the Add Health data with N=2909 subjects
in total that have complete 4 observations.
63
We calculate the mean age for each of the four waves are 15.54, 16.45, 21.90 and 28.41. The
average daily cigarette smoking for each latent groups are shown in the Table 17 below.
Table 17 Distribution of smoking for all three latent classes (Add Health)
Mean
age
Class 1 Mean
cigarettes/day
Class 2 Mean
cigarettes/day
Class 3 Mean
cigarettes/day
Wave I 15.54 0.06 1.19 5.91
Wave II 16.45 0.12 1.84 7.4
Wave III 21.9 0.05 3.92 14.22
Wave IV 28.41 0.09 4.21 13.3
Table 18 Estimates and 95% HPD interval for Poisson mean parameter (Add Health)
βk Class 1 Class 2 Class 3
Intercept Slope Intercept Slope Intercept Slope
Estimated -2.11 0.1 1.58 0.07 2.68 0.13
95% HPD
Interval
(-2.18, -2.04) (-0.05, 0.27) (1.56, 1.60) (-0.01, 0.15) (2.67, 2.70) (0.09, 0.18)
Table 19 Estimates, 95% HPD interval, Odds Ratios and 95% CI for OR (Add Health)
ϒk Class 2 Class 3
Intercept Male Home_smk Peer_smk Intercept Male Home_smk Peer_smk
Estimated -1.4 0.22 0.24 0.58 -2.91 0.65 1.11 0.92
95% HPD
Interval
(-1.53, -
1.25)
(0.05,
0.37)
(0.07, 0.43) (0.48, 0.68)
(-3.00, -
2.78)
(0.44,
0.83)
(0.92, 1.35) (0.82, 1.01)
Est. OR 0.25 1.25 1.27 1.79 0.05 1.92 3.03 2.51
95% CI for
OR
(0.22,
3.49)
(1.05,
1.45)
(1.07, 1.54) (1.62, 1.97) (0.04, 0.06)
(1.55,
2.29)
(2.51, 3.86) (2.27, 2.75)
Table 20 Estimates and 95% HPD interval for extra zero probability parameter (Add Health)
αk Class 1 Class 2 Class 3
Intercept Slope Intercept Slope Intercept Slope
Estimated -0.28 0.59 -0.24 0.57 -0.31 0.67
95% HPD
interval
(-0.36, -0.18) (0.49, 0.66) (-0.38, -0.11) (0.46, 0.69) (-0.48, -0.15) (0.52, 0.82)
64
Gender, home smoking and peer smoking are confirmed as risk factors to predict the latent class
probability in Add Health data by using proc traj in SAS, and are further confirmed using our
Bayesian approach. In this application to our LCGM model, the coefficient ϒk for predicting
latent class probability can be interpreted as odds ratio of belong to each of the latent groups
compared to the first group (light smoking group) after taking exponential. From Table 17, we
can conclude that compared to female subjects, male subjects are 25% more likely to be
classified into the moderate smoker group than the light smoker group, and they are 92% more
likely to be classified in to the heavy smoker group than light smoker group. The 95%
confidence intervals for the odds ratio transferred from 95% HPD interval do not contain the null
value 1, which verifies gender, home smoking and peer smoking are significant risk factors that
are associated with latent class probability using our model. We also plot the mean smoking
trajectories for each latent class separately by averaging the number of cigarette smoked per day
for each latent class at the mean age of each of the four waves. This is an important output from
our proposed ZIP LCGM model that provide information of different smoking phenotypes of the
Add Health population, and also estimates the corresponding probabilities for each latent classes
to inform the composition of this smoking population.
65
Figure 11 Averaged smoking trajectories for Add Health data by latent classes (3-class)
In Figure 11, the green, blue and red color represents the overall average smoking level
trajectories for light smokers, moderate smokers and heavy smokers. The posterior probability of
class membership is indicated below the chart. 57% of the Add Health population are classified
into never/light smokers, and 24% of them are classified as moderate smokers that is
characterized by a low initial level of cigarette smoking, a relatively low level of smoking until
late 20s, and 19% are classified as heavy smokers that is characterized by a relatively high initial
level of smoking and then a decreasing trend on the level of smoking from adolescence to
adulthood.
We have also compared our parameter estimates for predicting class probabilities with the
parameter estimates using proc traj and Mplus on Add Health data (see Table 19). Our model is
shown to be comparable to proc traj and Mplus again, specifically in predicting the class
membership.
Table 21 Comparisons of parameter estimates for class probability (proc traj vs Mplus)
Class Parameters Proposed model Proc traj Mplus
66
2
ϒ 21 (Intercept) -1.4 -1.39 -1.36
ϒ 22 (male) 0.22 0.21 0.22
ϒ 23 (peer_smk) 0.58 0.69 0.59
ϒ 24 (home_smk) 0.24 0.25 0.25
3
ϒ 31 (Intercept) -2.91 -2.85 -3.01
ϒ 32 (male) 0.65 0.70 0.73
ϒ 33 (peer_smk) 1.11 0.95 1.13
ϒ 34 (home_smk) 0.92 1.04 0.99
To determine the appropriate number of latent class for Add Health data using our approach, we
also fit 2-class and 4-class LCGM model to the Add Health data to compare the results with 3-
class model (see Fig 12-13 and Table 19-20).
Figure 12 Averaged smoking trajectories for Add Health data by latent classes (2-class)
Table 22 Parameter estimates for 2-class LCGM (Add Health data)
Class (%) Model component Parameter Posterior Mean 95% HPD interval
67
1 (67%) Binomial α 11 (Intercept) 1.20 (1.15, 1.25)
α 12 (Slope) -0.25 (-0.37, -0.12)
Poisson β 11 (Intercept) 1.18 (1.16, 1.22)
β 12 (Slope) 0.28 (0.19, 0.37)
2 (33%) Binomial α 21 (Intercept) -0.34 (-0.41, -0.28)
α 22 (Slope) -0.20 (-0.36, -0.05)
Poisson β 21 (Intercept) 2.61 (2.60, 2.62)
β 22 (Slope) 0.19 (0.15, 0.24)
Class membership ϒ 21 (Intercept) -2.85 (-3.06, -2.68)
ϒ 22 (male) 0.88 (0.73, 1.05)
ϒ 23 (peer_smk) 0.72 (0.64, 0.82)
ϒ 24 (home_smk) 1.2 (1.01, 1.40)
Table 23 Parameter estimates for 4-class LCGM (Add Health data)
Class (%) Model component Parameter Posterior Mean 95% HPD interval
1 (54%) Binomial α 11 (Intercept) 0.97 (0.80, 1.13)
α 12 (Slope) -0.02 (-0.21, 0.16)
Poisson β 11 (Intercept) -1.38 (-1.52, -1.24)
β 12 (Slope) 0.07 (-0.11, 0.24)
2 (20%) Binomial α 21 (Intercept) -0.28 (-0.36, -0.20)
α 22 (Slope) -0.12 (-0.28, 0.05)
Poisson β 21 (Intercept) 1.12 (1.08, 1.15)
β 22 (Slope) 0.22 (0.12, 0.31)
Class membership ϒ 21 (Intercept) -1.49 (-1.64, -1.35)
ϒ 22 (male) 0.07 (-0.07, 0.26)
68
ϒ 23 (peer_smk) 0.64 (0.55, 0.73)
ϒ 24 (home_smk) 0.06 (-0.11, 0.22)
3 (15%) Binomial α 31 (Intercept) -0.56 (-0.65, -0.48)
α 32 (Slope) -0.25 (-0.43, -0.08)
Poisson β 31 (Intercept) 2.21 (2.20, 2.23)
β 32 (Slope) 0.18 (0.12, 0.24)
Class membership ϒ 31 (Intercept) -2.29 (-2.51, -2.10)
ϒ 32 (male) 0.51 (0.28, 0.70)
ϒ 33 (peer_smk) 0.75 (0.64, 0.85)
ϒ 34 (home_smk) 0.53 (0.30, 0.76)
4 (11%) Binomial α 41 (Intercept) -0.57 (-0.67, -0.47)
α 42 (Slope) -0.24 (-0.42, -0.07)
Poisson β 41 (Intercept) 2.85 (2.84, 2.87)
β 42 (Slope) 0.14 (0.08, 0.19)
Class membership ϒ 41 (Intercept) -3.64 (-4.04, -3.36)
ϒ 42 (male) 0.76 (0.55, 1.00)
ϒ 43 (peer_smk) 0.93 (0.80, 1.06)
ϒ 44 (home_smk) 1.42 (1.11, 1.75)
69
Figure 13 Averaged smoking trajectories for Add Health data by latent classes (4-class)
Table 24 DIC statistics for 2-4 class ZIP LCGM model (Add health data)
ZIP LCGM model 𝑫 ( 𝜭 )
̅ ̅ ̅ ̅̅ ̅̅
PD DIC
2-class 39484.08 -1750.38 37733.70
3-class 37374.89 -2612.80 34762.09
4-class 36301.18 -2347.38 33953.80
Obviously, the 4-class solution gives us more information about the types of smoking behaviors,
especially in heavy smoking group. We have also computed the DIC statistics (Table 24) to
evaluate the model fit and complexity of 2-4 class ZIP LCGM model. Noticing that the 4-class
model has the smallest DIC among the three models, the DIC statistics change the most from 2-
class model to 3-class model, but decrease slightly from 3-class model to 4-class model,
indicating little improvement of model fit from 3-class to 4-class models. we have also fit a 5-
class ZIP LCGM model, but there is not much change in DIC, and the 5-class model is poorly
70
identified and failed to converged, the 4-class solution is therefore preferred for fitting this ZIP
LCGM on Add Health data.
3.4 Application of multilevel LCGM on MACC data (Real Data Analysis)
Due to the limited access of multilevel Add Health data, we applied the multilevel LCGM model
on the data from the Minnesota Adolescent Community Cohort (MACC) Study (Forster et.al,
2011), which is a population-based, longitudinal study that enrolled 3636 youth from Minnesota
and 605 youth from comparison states age 12 to 16 years (baseline age) in 2000-2001.
Participants were surveyed by telephone semi-annually about their tobacco-related attitudes and
behaviors until 2013. The primary aim of the MACC study was to evaluate local tobacco control
programs in Minnesota, and the secondary aim was to better define the patterns of development
of tobacco use in young adults and to evaluate the effects of tobacco control programs on these
patterns.
The MACC Study employs a multilevel design, including a cohort sequential design of five ages
at the individual level where individuals are surveyed every six months. These individuals are
nested in GPUs (geo-political units) and there are longitudinal
observations at the GPU level every six months as well. Finally, individuals and GPUs were
randomly assigned a specific month within six-month for each observation; thus each month
constitutes an observation in a time series design for the state of Minnesota.
A combination of probability and quota sampling methods (to assure equal age distribution) was
used to establish the cohort. These participants included 12 of each age from 12 to 16 years old
from each of the 60 Minnesota GPUs. Recruitment was conducted by telephone every 6 months
71
by Clearwater Research, Inc., using modified random digit dial (MOD1) sampling to identify
households with at least one teenager in the target age range within the target GPU. Within
households, respondents were selected at random from among age quota cells that were still open
for that GPU. The multilevel design of the MACC Study provides the power to examine the
effects of community-level programs, policies and other contextual factors and changes in these
factors over time on patterns of smoking uptake. Designing sampling units (GPUs) can facilitate
standardize the size of the population of each unit to reflect the variability in smoking by
geographical area, and to take into account the units of program delivery and policy coverage. In
addition, the MACC Study provides insight into patterns of initiation, progression to heavier
smoking and smoking cessation in young adults. The major topics covered by the survey are
cigarette use, nicotine dependence, alcohol use and dependence, cigarette access, quitting
smoking, use of other tobacco products and marijuana, parent smoking habits, tobacco
marketing, emotions and stress perceptions, and perceptions and opinions of smoke-free laws,
tobacco companies, and tobacco age restrictions.
Since no level 2 covariate are available in MACC data, we only include the random effect Um
(m=65) for 65 GPUs in our proposed multilevel model (30). The random effect Um is assumed to
follow a normal distribution with mean 0 and standard deviation of 0.1. Although each
participant in MACC study are followed every 6 months for about 10 years in total (at most 20
time points), there are large proportion of missing for our smoking variable (the number of
cigarette smoked on the days you smoke). In order to maximize the use of complete
observations, we apply our multilevel model on the MACC data restricted to the sample
(N=2873) that has 5 age waves from 15 years to 19 years, as well as they need to have at least 3
72
complete observations (only up to 2 missing values per individual allows), this will guarantee we
have sufficient amount of subjects with complete smoking information for each of the 65 GPUs.
For the subjects missing the number of cigarette smoked on the days when they smoked, but
have reported never smokers, their missing values are replaced by 0. The missing observations
for smokers are replaced using the mean value of all the complete observations for that
individual (Table 25). The distribution of the smoking data with and without missing (see Table
25 and Table 26) are relatively similar in term of the mean number of cigarette smoked per day.
The proportion of observed zeros increased due to large percentage of never smokers in this
population, most of the missing values are filled with zeros. Gender is not a significant predictor
for class membership in MACC data using proc traj (p>0.14), therefore is not included in the
model.
Table 25 The summary statistics of MACC smoking data (including missing)
Age N Cigarettes/day (Mean/SD) Observed zeros (%) % Missing
15 2134 0.23 (1.35) 71.71 19.49
16 2636 0.42 (1.99) 77.49 8.28
17 2675 0.80 (2.87) 70.74 6.92
18 2599 1.28 (3.68) 61.93 9.57
19 2425 1.61 (4.04) 52.99 15.62
73
Table 26 The summary statistics of MACC smoking data (without missing)
Age N Cigarettes/day (Mean/SD) Observed zeros (%)
15 2873 0.57 (2.39) 82.36
16 2873 0.69 (2.67) 80.24
17 2873 0.88 (2.95) 74.01
18 2873 1.22 (3.56) 68.30
19 2873 1.43 (3.80) 65.27
Table 27 Descriptive statistics for covariates predicting latent class membership
Covariates N (%)
Peer Smoking 0 peer smokers 381 (59.01)
1 peer smokers 587 (19.16)
2 peer smokers 699 (10.74)
3+ peer smokers 1206 (11.08)
Household smoking Yes 1656 (57.64)
No 1217 (42.36)
Table 28 Parameter estimates of multilevel ZIP LCGM model (MACC data)
Class (%) Model component Parameter Posterior Mean 95% HPD interval
1 (67%) Binomial α 11 (Intercept) 1.05 (0.87, 1.25)
α 12 (Slope) -0.06 (-0.15, 0.01)
Poisson β 11 (Intercept) -1.41 (-1.57, -1.24)
β 12 (Slope) -0.12 (-0.18, -0.05)
2 (25%) Binomial α 21 (Intercept) -0.39 (-0.56, -0.21)
α 22 (Slope) -0.87 (-0.98, -0.76)
Poisson β 21 (Intercept) -0.56 (-0.65, -0.49)
β 22 (Slope) 0.24 (0.21, 0.26)
74
Class membership ϒ 21 (Intercept) -2.21 (-2.38, -2.05)
ϒ 23 (peer_smk) 0.48 (0.42, 0.54)
ϒ 24 (home_smk) 0.29 (0.13, 0.48)
3(7%) Binomial α 21 (Intercept) -0.17 (-0.33, -0.01)
α 22 (Slope) -0.40 (-0.47, -0.34)
Poisson β 21 (Intercept) 1.84 (1.78, 1.88)
β 22 (Slope) 0.16 (0.14, 0.17)
Class membership ϒ 21 (Intercept) -3.80 (-4.12, -3.51)
ϒ 23 (peer_smk) 0.61 (0.51, 0.69)
ϒ 24 (home_smk) 0.43 (0.18, 0.70)
Table 29 Estimated OR and 95% CI of multilevel ZIP LCGM model (MACC data)
ϒ k Class 2 Class 3
Intercept Home_smk Peer_smk Intercept Home_smk Peer_smk
Estimated -2.21 0.48 0.29 -3.80 0.61 0.43
95% HPD
Interval
(-2.38, -2.05) (0.42, 0.54) (0.13, 0.48) (-4.12, -3.51) (0.51, 0.69) (0.18, 0.70)
Est. OR 0.11 1.62 1.34 0.02 1.84 1.54
95% CI for OR (0.09, 0.13) (1.52, 1.72) (1.14, 1.62) (0.01, 0.03) (1.66, 1.99) (1.20, 2.01)
75
Figure 14 Trace plot of Poisson mean parameter estimates of multilevel ZIP LCGM (MACC)
76
Figure 15 Trace plot of extra zero probability parameter for multilevel ZIP LCGM (MACC)
Figure 16 Averaged smoking trajectories for MACC data by posterior latent classes
77
The parameter estimates in Table 25 and the trace plots in Figure 14-15 indicates relatively good
fit of our proposed model on MACC data. Figure 16 shows the averaged cigarette smoking
trajectory by posterior latent class assignments for MACC data using a 3-class multilevel
Bayesian LCGM model. 67% of the MACC population are classified into never/light smokers,
and 26% of them are classified as moderate smokers that is characterized by a low initial level of
cigarette smoking, a relatively low level of smoking until late 20s, and only a very small
proportion about 7% of them are classified as heavy smokers that is characterized by a relatively
high initial level of smoking and then a increasing trend on the level of smoking through age 19.
Similar to the individual level LCGM model, the coefficient parameter ϒk for predicting latent
class probability for the multilevel LCGM model can also be interpreted as odds ratio of belong
to each of the latent groups compared to the first group (light smoking group) after taking
exponential. The 95% confidence interval for peer smoking and home smoking does not include
the null value, which confirmed they are significant predictors for class membership probabilities
in MACC data.
78
Chapter 4 Discussion and Future research
In this research, we proposed a Bayesian ZIP LCGM model for analyzing longitudinal cigarette
smoking trajectories where there are excess zeros in the data, and also extend it to fit multilevel
data. The goals of this research are not only to characterize individual smoking trajectories and
classify them into different latent groups, but also we expect our proposed model can overcome
the limitations of current frequentist approaches, e.g. maximum likelihood approach that are
implemented in SAS proc traj and MPlus, including the difficulty in convergence, local maxima
etc., as well as the limitation of applying two-part ZIP model via MPlus on smoking trajectory
data, where all subjects with zero counts are forced to be classified into never smoking group.
We demonstrate our proposed model in modeling cigarette smoking trajectories behavior from
early adolescence to adulthood, by conducting simulation studies and application to Add Health
data. We are able to identify three distinct groups of smoking behavior: light smokers, moderate
smokers and heavy smokers. Different smoking behaviors are different not only by the level of
cigarettes smoked, but also by the risk factors related to onset, escalation and leveling off on
smoking. Our proposed Bayesian ZIP LCGM model has several advantages. First, this model is
flexible in modeling both the unobserved time stable heterogeneity by identifying distinct latent
class for each individual, and the time-varying heterogeneity by obtaining distinct zero-inflated
cigarette smoking trajectories. There are other approaches available that can be used to cluster
individuals into different groups, like calculating Euclidean distance using K-means algorithm
(Sarafis et al. (2002)). However, the purpose of this proposed latent class model is to classify the
entire individual trajectory into different groups, not only group similar observations at certain
79
time points, which is fundamentally different than cluster analysis. Similarly, due to our goal of
classifying each individual into one time-stable latent group, the Latent Transitional model
(Lanza and Linda, 2008) which studies the transition of class membership is not appropriate to
meet our research purpose. Second, this model can also allow individual risk factors to be
included in the model and influence the latent class membership. Third, the joint estimation of
the class membership and risk factors is superior to the traditional two-part approach which does
not take into account the uncertainty of the class membership. Fourth, the proposed model allows
for examination of class membership probabilities and latent class trajectories by covariate
profile, for example, we are able to plot the averaged smoking trajectory by gender, ethnicity and
household smoking. In addition, our proposed model extends the individual level model to
multilevel model, which takes into account for the dependence between subjects within the same
level 2 unit via random effects and allows the class membership probability to be varying across
level 2 covariates. The multilevel model has also been evaluated on MACC multilevel data, the
trace plots for all parameter estimates indicate good convergence. Although there are no level 2
covariates included in MACC data, our proposed model can easily incorporate higher level
covariates, as shown in our simulation studies for multilevel ZIP LCGM. The process of
selecting the appropriate number of latent classes to fit Add Health data has also been included in
our proposed model. DIC statistics are calculated to compare the model fit and complexity of 2-
class, 3-class and 4-class model. Our proposed model has also shown compatibility with Mplus
and proc traj in terms of predicting the class membership probabilities for each latent class. Our
parameter estimates are shown to be similar to that from proc traj and Mplus. But our approach is
superior to the Mplus 2-part model specifically in fitting smoking trajectory Zero-inflated
Poisson data. In Mplus implemented models, the individuals with zero count in cigarette
80
smoking are directly forced into the never smoking group, but it is common in longitudinal
smoking data that there may be smoking cessation going on at any time, thus the smoking
behaviors for a smoker can stop and relapse frequently at any time points - a fact that is carefully
taken care of in our model (as we do not force them into the “never smoking” group) but not in
Mplus. Our model has better convergence properties compared to those based on the LCGM
estimated using maximum likelihood approach, in terms of the convergence of our approach is
less sensitive to starting values, and can still maintain good convergence even including 3
covariates, and allows categorical covariates with more than 2 categories (e.g. peer smoking),
which cannot be implemented in proc traj.
One limitation with our data application is that the Add Health data are collected using a cohort
sequential design. The baseline age range from 13-21 years and only four measurements are
provided for each subject with different time intervals. Although there is overlap in ages between
different cohorts, each cohort contributes to a different segment of the overall curve. It is
possible that a trajectory for the whole age range is biased due to the small number of
measurements for each subject. For future analysis, the baseline age (i.e. the cohort effect) could
be included in the model by either affecting the class membership probability or as a random
effect. In addition, we only include the linear time effect to model the smoking trajectory in our
current approach. In future research, a nonlinear (e.g., quadratic or other parametric or non-
parametric functions) time effect can also be included and evaluate the fit of the model in
relation to linearity.
81
It worth mentioning that with increasing the complexity of current model, e.g. adding more
levels to the current multi-level model structure, may result in slower MCMC estimation and
produce chains that exhibit poor mixing. Alternative Bayesian approach that implemented in
STAN (Sorensen & Vasishth, 2015), which could potentially be used to fit LCGM should also be
fully explored in the near future.
Besides, our current proposed Bayesian approach can only deal the data with complete
observations for each subjects. In our application to MACC data, we replace the missing data
using the mean observations for each participant in order to obtain sufficiently complete
observations for each level 2 units, this could potentially introduce bias. In the future, more
research should be done to explore the missing data imputation approach specifically for
longitudinal data. Another future plan is to test our proposed model under abnormal smoking
trajectories. It is common in real life settings that people cease smoking or resume smoking at
random time points.
82
Reference
Abbas Moghimbeigi , Mohammed Reza Eshraghian , Kazem Mohammad &
Brian Mcardle (2008) Multilevel zero-inflated negative binomial regression modeling for
overdispersed count data with extra zeros, Journal of Applied Statistics, 35:10, 1193-1202
Andruff et al. (2005). Latent Class Growth Modeling: A Tutorial. Tutorials in Quantitative
Methods for Psychology; Vol,5 (1), p.11-24
Angers J.F & Biswas A. (2003). A Bayesian analysis of zero-inflated generalized Poisson model.
Computational Statistics & Data Analysis, 42:37-46.
Bauer, D. J., & Curran, P. J. (2003a). Distributional assumptions of growth mixture models:
Implications for over-extraction of latent trajectory classes. Psychological Methods, 8, 338–363.
Bauer, D. J., & Curran, P. J. (2003b). Over-extraction of latent trajectory classes: Much ado
about nothing? Reply to Rindskopf (2003), Muthén (2003), and Cudeck and Henly (2003).
Psychological Methods, 8, 384–393.
Berhane K, Zhang Y, Linn W, et al. The effect of ambient air pollution on exhaled nitric oxide
(FeNO) in southern California schoolchildren. Am J Respir Crit Care Med 2009; 179:A4732.
Brook, J. S., Morojele, N. K., Brook, D. W., & Rosen, Z. (2005). Predictors of Cigarette Use
Among South African Adolescents. International Journal of Behavioral Medicine, 12(4), 207–
217.
Brook DW, Zhang C, Brook JS, Finch SJ (2010). Trajectories of cigarette smoking from
adolescence to young adulthood as predictors of obesity in the mid-30s. Nicotine & Tobacco
Research ;12:263-270.
Centers for Disease Control and Prevention (2015). Tobacco Use Among Middle and High
School Students—United States, 2011–2014. Morbidity and Mortality Weekly Report,
2015;64(14):381–5
Chassin, L., C. C. Presson, J. S. Rose, and S. J. Sherman. 1996. The natural history of cigarette
smoking from adolescence to adulthood: Demographic predictors of continuity and change.
Health Psychology 15 (6): 478–84.
Chen, K., & D. B. Kandel (1995). The natural history of drug use from adolescence to the mid-
thirties in a general population sample. American Journal of Public Health 85 (1): 41–47.
Chen, M. H. and Shao, Q. M. (1999), “Monte Carlo Estimation of Bayesian Credible and HPD
Intervals,” Journal of Computational and Graphical Statistics, 8, 69–92.
Chen, M. H., Shao, Q. M., and Ibrahim, J. G. (2000), Monte Carlo Methods in Bayesian
Computation, New York: Springer-Verlag
83
Colder CR, Flay BR, Segawa E, Hedeker D (2008). Trajectories of smoking among freshmen
college students with prior smoking history and risk for future smoking: Data from the
University Project Tobacco Etiology Research Network (UpTERN) study Addiction;103:1534-
1543.
Dempster A, Laird N, Rubin D (1977). Maximum Likelihood from Incomplete Data via the EM-
Alogrithm. Journal of the Royal Statistical Society, B, 39, 1–38.
Dennis, J. E., D. M. Gay, and R. E. Welsch. (1981). An Adaptive Nonlinear Least-Squares
Algorithm. ACM Transactions on Mathematical Software 7:348-83.
Dennis, J. E. and H. H. W. Mei. (1979). Two New Unconstrained Optimization Algorithms
Which Use Function and Gradient Values. Journal of Optimization Theory and Applications
28:453-83.
Forster et.al (2011). The Minnesota Adolescent Community Cohort Study: Design and Baseline
Results. Prev Sci, 12(2): 201-210
Geman S. , Geman D. (1984). Stochastic Relaxation, Gibbs Distributions, and the Bayesian
Restoration of Images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6 (6):
721–741.
Gelman, A., Hill, J. (2007). Data analysis using regression and multi-level/hierarchical models.
New York: Cambridge University.
Gilliland FD, Islam T, Berhane K, et al. (2006). Regular smoking and asthma incidence in
adolescents. Am J Respir Crit Care Med; 174:1094-100
Henry K.L. & Muthen B.O (2010). Multilevel Latent Class Analysis: An application of
adolescent smoking typologies with individual and contextual predictors. Structural Equation
Modeling, 17:193–215
Hipp, J. R., & Bauer, D. J. (2006). Local solutions in the estimation of growth mixture models.
Psychological Methods, 11, 36–53.
Hastings (1970), "Monte Carlo Sampling Methods Using Markov Chains and Their
Applications". Biometrika 57 (1): 97–109
Heron et al. (2011). Characterizing Patterns of Smoking Initiation in Adolescence: Comparison
of Methods for Dealing With Missing Data. Nicotine & Tobacco Research, Volumn 13 (12):
1266-1275
Janet A. et al. (2004). Identifying and characterizing adolescent smoking trajectories. Cancer
Epidemiology Biomarkers prev 13; 2023
Jasra A., Holmes C., Stephens D. (2005). “Markov chain Monte Carlo Methods and the Label
Switching Problem in Bayesian Mixture Modeling.” Statistical Science, 20, 50-67
84
Jones, B. L., D. Nagin, and K. Roeder. (2001). A SAS Procedure Based on Mixture Models for
Estimating Developmental Trajectories. Sociological Methods & Research 29:374-93.
Jeffreys, H (1935). “Some Tests of Significance, Treated by the Theory of Probability”.
Proceedings of the Cambridge Philosophy Society, 31, 203-222
Jung and Wickrama (2008), An Introduction to Latent Class Growth Analysis and Growth
Mixture Modeling. Social and Personality Psychology Compass 2/1: 302-317
Kim Y, Muthén BO. Two-Part Factor Mixture Modeling: Application to an Aggressive Behavior
Measurement Instrument. Structural equation modeling: a multidisciplinary journal.
2009;16(4):602-624
Kaufman, A. R., & Augustson, E. M. (2008). Predictors of regular cigarette smoking among
adolescent females: Does body image matter? Nicotine & Tobacco Research: Official Journal of
the Society for Research on Nicotine and Tobacco, 10(8), 1301–1309.
Keel, P. K., Fichter, M., Quadflieg, N., Bulik, C. M., Baxter, M. G., Thornton, L., Halmi,
K. A., Kaplan, A. S., Strober, M., Woodside, D. B., et al. (2004). Application of a latent
class analysis to empirically define eating disorder phenotypes. Archives of General
Psychiatry 61, 192–200
Lambert, D. (1992), Zero-Inflated Poisson Regression Models with an Application to Defects in
Manufacturing, Technometrics, 34, 1–14
Lanza, S.T., and Linda M. C.(2008). A new SAS procedure for latent transition analysis:
transitions in dating and sexual risk behavior.Developmental psychology 44.2: 446.
Larid NM & Ware JH (1982). Random-effects models for longitudinal data. Biometrics, 38(4),
963-74
Lee AH, Wang K, Scott JA, Yau KK, McLachlan GJ (2006). Multi-level zero-inflated poisson
regression modelling of correlated count data with excess zeros, Stat Methods Med
Res;15(1):47-61
Leisch F (2004b). FlexMix: A General Framework for Finite Mixture Models and Latent Class
Regression in R. Journal of Statistical Software, 11(8).
Lo, Y., Mendell, N. R., & Rubin, D. B. (2001). Testing the number of components in a normal
mixture. Biometrika, 88, 767–778.
Metropolis et al. (1953). "Equations of State Calculations by Fast Computing
Machines". Journal of Chemical Physics 21 (6): 1087–1092.
Muthe
́ n B. O., & Asparouhov, T. (2009). Multilevel regression mixture analysis. Journal of the
Royal Statistical Society, Series A, 172, 639–657.
Muthe
́ n, B.O. (1998-2004). Mplus Technical Appendices. Los Angeles, CA: Muthe
́ n & Muthe
́ n
85
Nagin, D.S. (1999). Analyzing Developmental Trajectories: A Semiparameteri, Group-Based
Approach. Psychological Methods, Col 4, No.2, 139-157
Nagin, D.S (2005). Group-based modeling of development. Cambrige, MA: Harvard University
Press.
Neelon B, O'Malley AJ, and Normand S-L (2011). A Bayesian two-part latent class model for
longitudinal medical expenditure data: assessing the impact of mental health and substance abuse
parity. Biometrics, 67, 280-289
Ntzoufras I. (2009). Bayesian Modeling Using WingBUGS. A John Wiley & Sons, INC.
Olsen, M. K, & Schafer, J., L. (2001). A two-part random e®ects model for semicontinu- ous
longitudinal data. Journal of the American Statistical Association, 96, 730-745
Orlando M, Tucker JS, Ellickson PL, Klein DJ (2004). Developmental trajectories of cigarette
smoking and their correlates from early adolescence to young adulthood. Journal of Consulting
and Clinical Psychology;72: 400-410. doi:10.1037/0022-006X.72.3.400
Papastamoulis P. (2015). label.switching: An R package for dealing with the Label Switching
problem in MCMC outputs. Journal of Statistical Software
Polasek, W. (2012), Handbook of Markov Chain Monte Carlo edited by Steve Brooks, Andrew
Gelman, Galin Jones, Xiao-Li Meng. International Statistical Review, 80: 184–185
Ram N, Grimm KJ (2009). Growth Mixture Modeling: A Method for Identifying Differences in
Longitudinal Change Among Unobserved Groups. International journal of behavioral
development; 33(6):565-576.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical Linear Models: Applications and Data
Analysis Methods (2nd ed.). Thousand Oaks, CA: Sage Publications
Reboussin, B. A., Song, E.-Y., Shrestha, A., Lohman, K. K., and Wolfson, M. (2006). A
latent class analysis of underage problem drinking: Evidence from a community sample
of 16–20 year olds. Drug and Alcohol Dependence 83, 199–209.
Rindskopf, D. and Rindskopf, W. (1986). The value of latent class analysis in medical diagnosis.
Statistics in Medicine 5, 21–27
Sorensen, T. and Vasishth, S. (2015). Bayesian linear mixed models using Stan: A tutorial for
psychologists, linguists, and cognitive scientists, submitted to Psychological Methods (Special
Issue on Bayesian Data Analysis)
Sarafis, A.M.S. Zalzala, P.W. Trinder (2002). A genetic rule-based data clustering
toolkit", Evolutionary Computation. CEC '02. Proceedings of the 2002 Congress on, vol. 2, pp.
1238-1243 vol.2.
86
Simons-Morton B (2007). Social influences on adolescent substance use. American Journal of
Health Behavior;31:672-684.
Shiyko MP, Li Y, Rindskopf (2012). Poisson Growth Mixture Modeling of Intensive
Longitudinal Data: An Application to Smoking Cessation Behavior. Structural equation
modeling : a multidisciplinary journal;19(1):65-85.
Sperrin, N., Jaki, T., Wit, E (2010). Probabilistic relabeling strategies for the label switching
problem in Bayesian mixture models. Statistics and Computing 20: 357-366
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and Van Der Linde, A. (2002). Bayesian measures
of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 64, 583–639
Stephens, M. (2000), Dealing with label switching in mixture models. Journal of the Royal
Statistical Society, Series B 62: 795-809
USDHHS (1994). Tobacco and the clinician: interventions for medical and dental practice. NIH
Publ 94-3693. Bethesda (MD): NIH
U.S. Department of Health and Human Services (2014). The Health Consequences of Smoking:
50 Years of Progress. A Report of the Surgeon General. Atlanta, GA: U.S. Department of Health
and Human Services, Centers for Disease Control and Prevention, National Center for Chronic
Disease Prevention and Health Promotion, Office on Smoking and Health
Vermunt, J.K.(2003) Multilevel latent class models. Sociological Methodology, 33, 213-239.
Vermunt, J. K. (2008). Latent class and finite mixture models for multilevel data sets. Statistical
Methods in Medical Research, 17(1), 33–51.
Windle M & Windle RC (2001). Depressive symptoms and cigarette smoking among middle
adolescents: Prospective associations and intrapersonal and interpersonal influences. Journal of
Consulting and Clinical Psychology;69:215-226.
Yang, Si (2015). A BAYESIAN ZERO-INFLATED GENERALIZED GROWTH MIXTURE
MODEL FOR ADOLESCENT HEALTH RISK BEHAVIORS. Open Access Master's Theses.
Paper 551.
Zhang et al. (2007). Bayesian Analysis of Longitudinal Data using growth curve models.
International Journal of Behavioral Development, 31(4), 374-383
Asset Metadata
Creator
Wang, Kejia (author)
Core Title
Multilevel Bayesian latent class growth mixture model for longitudinal zero-inflated Poisson data
Contributor
Electronically uploaded by the author
(provenance)
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Degree Conferral Date
2017-05
Publication Date
02/09/2018
Defense Date
11/29/2016
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Bayesian,longitudinal,multilevel,OAI-PMH Harvest,zero-inflated Poisson
Format
theses
(aat)
Language
English
Advisor
Berhane, Kiros (
committee chair
), Chou, Chih-Ping (
committee member
), Conti, David (
committee member
), Lewinger, Juan Pablo (
committee member
), Lloyd, Donald (
committee member
)
Creator Email
kejiawan@usc.edu,stephaniewangkejia@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11258356
Unique identifier
UC11258356
Identifier
etd-WangKejia-5020.pdf (filename)
Legacy Identifier
etd-WangKejia-5020
Dmrecord
331783
Document Type
Dissertation
Format
theses (aat)
Rights
Wang, Kejia
Internet Media Type
application/pdf
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
uscdl@usc.edu
Abstract (if available)
Abstract
There has been a growing interest among researchers in the use of Latent Class Growth mixture modeling techniques for applications in the field of medicine, developmental psychology, social and behavioral science. Latent Class Growth Mixture (LCGM) model has been increasingly recognized for its effectiveness to identify distinct subgroups of trajectories based on longitudinal data. Current LCGM methods (e.g., using Mplus and SAS Proc Traj) for Zero‐inflated count data, have problems with convergence issue, non‐identification, local maxima etc. This research focuses on developing multilevel LCGM for Zero‐inflated Poisson longitudinal data using fully Bayesian approach. The goal is to model longitudinal growth curves on counts of rare events and to classify individuals into distinct latent groups that represent the unique phenotype of each individual. Bayesian MCMC Metropolis random walk algorithm is used to determine latent class membership for each individual and estimate the latent variables parameters in this proposed model. The methodological details are illustrated through simulation studies and are applied to both individual‐level data and multi‐level data on cigarette smoking from the National Longitudinal Study of Adolescent to Adult Health.
Tags
Bayesian
longitudinal
multilevel
zero-inflated Poisson
Linked assets
University of Southern California Dissertations and Theses