Close
About
FAQ
Home
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Small area cancer incidence mapping using hierarchical Bayesian methods
(USC Thesis Other)
Small area cancer incidence mapping using hierarchical Bayesian methods
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SMALL AREA CANCER INCIDENCE MAPPING USING HIERARCHICAL BAYESIAN
METHODS
by
Xinyang Dai
A Thesis Presented to the
FACULTY OF THE USC KECK SCHOOL OF MEDICINE
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
(BIOSTATISTICS)
May 2024
Copyright 2024 Xinyang Dai
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2: Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Hierarchical Bayesian Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Sampling from Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Convergence Diagnosis for HMC sampler . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.4 Cross-validation between models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Chapter 3: Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1 Males with All Cancers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 African-American males with prostate cancer . . . . . . . . . . . . . . . . . . . . . . . . . 14
Chapter 4: Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
ii
List of Tables
2.1 Summary table of Census tract-based cancer data for males in Los Angeles from 2010-2019 4
4.1 Summary table of fitting statistics of Poisson lognormal CAR model in males with cancers 21
4.2 Summary table of fitting statistics of zero-inflated Poisson CAR model in males with cancers 22
4.3 Summary table of fitting statistics of negative binomial CAR model in males with cancers . 22
4.4 Summary table of fitting statistics of BYM model in males with cancers . . . . . . . . . . . 22
4.5 Summary table of fitting statistics of BYM model in African-American males with prostate
cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
iii
List of Figures
2.1 Total cancer cases for males in LA county from year 2010-2019 . . . . . . . . . . . . . . . . 5
2.2 Histogram of cases in all sites in males in LAC from 2010-2019 . . . . . . . . . . . . . . . . 6
2.3 Prostate cancer cases in African-American males in LAC from 2010-2019 . . . . . . . . . . 7
2.4 Histogram of Census tract-based cases in prostate for African-American males in LAC
from 2010-2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 Standardized morbidity rate prediction for males with tumors in LAC . . . . . . . . . . . . 13
3.2 Incidence rate prediction for males with tumors aged 35-49 in LAC(per million) . . . . . . 14
3.3 Standardized morbidity rate prediction for African-American males with prostate tumor
in LAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Incidence rate prediction for African-American males with prostate tumor aged 35-49 in
LAC(per million) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1 MCMC trace plots of Poisson lognormal CAR model for males with cancers . . . . . . . . 21
4.2 MCMC trace plots of Poisson lognormal CAR model for males with cancers . . . . . . . . 22
4.3 MCMC trace plots of negative binomial CAR model for males with cancers . . . . . . . . . 23
4.4 MCMC trace plots of BYM model for males with cancers . . . . . . . . . . . . . . . . . . . 23
4.5 MCMC trace plots of BYM model for African-American males with prostate cancer . . . . 24
4.6 MCMC trace plots of divergence for African-American males with prostate cancer . . . . . 25
iv
Abstract
Considering spatial patterns amplifies the credibility of predicting cancer morbidity rates, which incorporates neighbor information of the area. Bayesian methods are powerful techniques allowing measuring
uncertainty that exists in real world data. Los Angeles County has an incredibly diverse and large population, providing the possibility of exploring specific spatial patterns on small area simultaneously grouped
by gender, race, age and cancer sites. Multiple hierarchical Bayesian models were applied through R with
the support of Stan language in this study, and some modifications to models were made to adapt our
data. We considered 4 models: Poisson lognormal conditional autoregressive(CAR) model, zero-inflated
CAR model, negative Binomial CAR model, Besag York Mollié(BYM) model and selected negative Binomial CAR model for predicting overall incidence rates in males because it optimized the leave-one-out
cross-validation information criterion(Looic). However, only BYM model works from the four models
when African-American males were selected as the target population and prostate oncology was chosen
as the interest. The results suggest that several cities containing Los Angeles, Monterey Park, Alhambra,
San Gabriel, El Monte Rosemead, Montebello has lower incidence rates for males aged from 35 to 49 with
cancer so the spatial patterns were necessary to be involved while predicting cancer rates.
v
Chapter 1
Introduction
Cancer is a crucial part of assessing global public health. Predicting cancer rates can be valuable in contributing to making relevant policy and saving social cost [1]. Cancer is the second highest ranked cause of
causing death in US, and there are approximately 193,880 persons predicted to develop cancer and 59,930
deaths in 2024 in California [2]. Incidence is one of the most commonly used measures for evaluating the
growth of disease and can be directly presented through mapping. Disease mapping is feasible based on
data of country level, state level and census tract level [3]. Healthcare professionals can reach a more accurate and targeted goal of predicting cancer risks if the area of interest is smaller such as census tract-based
data, contributing to early interventions of oncology in more specific areas with high predicted incidence
rates [4].
Multiple studies shown that hierarchical Bayesian approaches can be a powerful way to be performed in
spatial epidemiology of cancer research, which combines prior information and probabilistic models [4,
5, 6, 7]. However, the performance of hierarchical Bayesian approach highly depends on the model design
and software implementation. Researchers must carefully choose models for incidence rates based on the
characteristics of specific data since the population structures and cancer cases would be various across the
regions. Information disclosure would be limited especially when investigators focus on certain trait such
1
as ethnicity for census tract-based data. Using Stan programming is effective to compute the statistical
parameter posterior through Hamiltonian Monte Carlo(HMC) sampling [6, 8].
In this study, we used cancer registry data and GIS data to predict age-specific cancer incidence rates of Los
Angeles County by Poisson lognormal conditional autoregressive(CAR) model, zero-inflated CAR model,
negative Binomial CAR model and Besag York Mollié model on a census tract level. We include a data
description in Chapter 2, followed by the introduction of competitive models, implementation algorithm
and fitting evaluation. Sex is an important classification in cancer analysis [9], and males are the target
population of interest at the current stage. In Chatper 3, we present the analysis results for two categories of
cancer: males with all cancers and African-American males with prostate cancer. We predicted age-specific
incidence rates overall as well as rates for particular racial groups and specified tumor sites, considering
the spatial autocorrelation between not only census tracts within LA County but also the surrounding
neighbor census tracts bordering LA County. Model comparisons between four models were calculated
while running those models. Chapter 4 concludes our analysis by summarizing the findings, discussing
limitations, and suggesting potential extensions for future work.
2
Chapter 2
Method
2.1 Data
The cancer data in our analysis originate from California Cancer Registry and California’s populationbased cancer registry [10]. The variables involved in the analysis includes population counts, case counts,
cancer sites from 2010-2019 as well as some basic categorical variables including age, gender and ethnicity
for each census tract in Los Angeles County. All variables above are integrated based on 18 age groups
(0-4, 5-9,. . . , 80-84, 85+) and the level of 2010 Census tract. Since cancer cases are rare, the data managers
combined the counts of ten years from 2010 to 2019 instead of using annual data. For the purpose of mapping predicted morbidity rates based on Census Tracts, the GIS data is required to provide the boundaries
of Census tract in LAC. Shapefiles were from LA County Government website and were clipped to the
fitted version.
There are average 81.1 cancer cases in males per Census tract with median of 72, and a range of 0 to 110407
male population with mean 21019 as well as median 20368 in LA county between year 2010 and 2019. It
is also noticed that prostate oncology accounts for the majority of cancers with maximum of 102 prostate
tumor cases in males(Table 2.1). Intuitively, the observed cancer counts based on Census tracts in LA for
men from 2010-2019 were shown in Figure 2.1. From Figure 2.2, the actual male cases of all sites for every
3
tract and every age group followed the Poisson distribution, and this includes the cases from the available
neighbors of LA based on modelling.
Table 2.1: Summary table of Census tract-based cancer data for males in Los Angeles from 2010-2019
Race Variables Statistics
Mean Median Min Max
Overall
Cancer cases in all sites 81.1 72 0 317
Population 21019 20368 0 110407
Prostate cancer case 20.3 17 0 102
African American
Cancer cases in all sites 7.6 3 0 150
Population 869.4 440 0 36100
Prostate cancer case 2.6 1 0 46
For the ethnicity group of African American, there is mean 7.6 case of all cancers in males located in LA
between 2010 and 2019, and the number is 150 at maximum. The population grouped by Census tracts
can be up to 36100, and the particular prostate cancer counts have the largest value of 46 that is close
to 1/3 of the total cancer cases (Table 2.1). The map of detected cases in prostate for African American
males provided an overview of the distribution across LA county (Figure 2.3), and Figure 2.4 indicated that
Poisson distribution was consented for this intended population.
2.2 Statistical Analysis
2.2.1 Hierarchical Bayesian Models
Four hierarchical Bayesian models were behaved to predict standardized morbidity rates (SMR) and incidence rates for both target populations mentioned in Data session. Our base model is Poisson lognormal
conditional auto-regressive (CAR) model [11, 12]. However, we made some changes from the original
model to fit our datasets. The age group was added as a categorical variable, making the actual number of
4
Figure 2.1: Total cancer cases for males in LA county from year 2010-2019
case and expected case counts matrices instead of vectors. Let Yij be the observed cancer case counts in
tract i for age group j, we modeled the outcome through
Yij |ϕi ∼ Poisson(Eije
ϕi
) , (2.1)
where Eij is the expected number of cancer cases in tract i for age group j and ϕi represents the spatial
random effect tract i. The expected number Eij was calculated through the population for age group j in
census tract i multiples the overall incidence rate for age group j. Assume there are N census tracts and
5
Figure 2.2: Histogram of cases in all sites in males in LAC from 2010-2019
denote ϕ = (ϕ1, . . . , ϕN )
′
as the spatial random effects across all census tracts, we assigned a CAR prior
for ϕ as
ϕ ∼ Normal
0, [τ (D − αW)]−1
. (2.2)
In (2.2), the precision parameter τ was assigned a prior τ ∼ Cauchy(0, 0.5) and we set the prior for the
hyper-parameter α to be Unif(0, 1). The α is a parameter of spatial dependency where 0 suggests no spatial
correlation across difference census tracts. e
ϕi denotes SMR of tract i. D is an N ×N diagonal matrix whose
diagonal elements are the number of neighbors for all census tract. W is the binary weight matrix where
1 designates tract i has at least one neighbor [11]. In this analysis, we assume W is a rook adjacency
matrix, indicating that tracts are mutually spatial neighbors only if they share a common boundary. We fit
the target model through Stan, a probabilistic programming language for statistical inference [13]. Since
the model construction involves expensive operation like large scale matrix inverse (2.2), we adapted an
efficient CAR model Stan implementation introduced by Max Joseph to improve sampling efficiency [8],
which reduce the time through utilizing a sparse expression of the matrix W .
The Poisson lognormal CAR model introduced above serves as a base model that can be extended to multiple variants. An immediate extension is to replace the Poisson distribution in formula 2.1 by a negative
6
Figure 2.3: Prostate cancer cases in African-American males in LAC from 2010-2019
binomial distribution. The negative binomial distribution is renowned for its enhanced modeling of over
dispersed data. Another extension we consider in this project is zero-inflated poisson lognormal CAR
model, adding the likelihood for case counts of zero. Zero-inflated Poisson CAR model is performed due
to the fact that the case number variable has many zeros. In this model, the probability θ was given when
the case counts are zero. For the rest of the case number, the model is the same as shown in formula 2.1
with the probability 1 − θ [14]. All the other settings are equivalent to those in Poisson lognormal CAR
model. Negative binomial CAR model assumes that cancer counts can be over dispersed, which is the
extension of Poisson lognormal CAR model. The parameter φ was added to assess the dispersion from
average predicted prevalence rate (Eije
ϕi
) [15].
Besag York Mollié (BYM) model: Besag York Mollié (BYM) model is also based on formula 2.1, but
Intrinsic CAR (ICAR) model is executed rather than CAR model to evaluate the spatial module [5]. ICAR
model supposes that spatial dependency strictly exists between Census tracts. Based on our data structure,
7
Figure 2.4: Histogram of Census tract-based cases in prostate for African-American males in LAC from
2010-2019
the random effect ϕi
in (2.1) is updated to be the sum of spatial ICAR parameter(ϕi
) and the random effects
of non-spatial discrepancy(θi
).
Yij |ϕi ∼ Poisson(Eije
ϕi+θi
) . (2.3)
Notably, ϕi and θi do not follow the same algorithm as those in previous models. ICAR model is equivalent
to CAR model in 2.2 if and only if α is 1. In addition, we assume that the sum of ϕi
is limited to zero, with
its prior distribution denoted as mean(ϕ) ∼ Normal(0, 0.001). In Stan, we used 1
√τϕ
ϕi
instead of ϕi
and replaced √
1
τθ
θi with θi to fit the model by increasing variance components [16]. Prior information
is required in Bayesian methods. In this case, θ follows the standard normal distribution. The precision
parameters adhere to the gamma distribution as below: τϕ ∼ Gamma(1, 1), τθ ∼ Gamma(3, 2). This
model solves the issue of prediction in islands (Census tracts without any connections) after ignoring
those tracts with zero or unavailable population. BYM model for all applicable islands excluded spatial
factor (ϕ) because islands have no spatial neighbors. All models were constructed through Stan language
that is convenient to do sampling for next steps.
8
2.2.2 Sampling from Distribution
MCMC helps conduct sampling from the distribution of log likelihood on parameters according to different models, and Hamiltonian Monte Carlo(HMC) was employed in Stan [17]. R programming was
implemented to obtain the parameters posterior of Bayesian inference, and 6000 draws in total with four
chains were selected to do the calculations with the initial values zero of spatial parameters ϕi
. Mean and
90% credible interval of each parameter for every Census tract in models were recorded and descriptive
statistics can be computed.
Predicted SMR and incidence rates were estimated from the posterior information by interface R. Posterior
of e
ϕi+θi
(e
θi
) for islands) was extracted from BYM model as a matrix of draws number 6000 by length of
Census tracts number (2382 for males with all types of cancer and 1674 for African-American males with
prostate cancer). Mean e
ϕi+θi
(e
θi
) for islands) for each tract can be measured by column means of this
matrix, which was regarded as predicted SMR in BYM model; furthermore, the corresponding 90% credible
interval of the predicted SMR was acquired through computing the 5th and 95th percentiles. For the other
three models, e
ϕi posterior would be picked to derive expected SMR and the relevant 90% credible interval
using the similar procedures.
Predicted age-specific incidence rate for negative binomial CAR model was generated by the probability
matrix of negative binomial distribution given the product of Eij and e
ϕi as well as the corresponding
counts. This likelihood matrix has row number of 6000 times amount of age groups and column length of
age group counts. Therefore, row means of the matrix are the predicted case counts categorized by Census
tracts and age groups. The incidence rates are the ratios of the expected case number and the matching
age-specific population. For the other three models, we need to replace negative binomial distribution with
Poisson distribution. In addition, e
ϕi should be changed to e
ϕi+θi [(e
θi
) for islands] in BYM model while
estimating occurrence rates.
9
2.2.3 Convergence Diagnosis for HMC sampler
Testing convergence of sampling is necessary, and it reflects the stability of HMC sampling from hierarchical Bayesian models. Trace plots are the most direct way to have an overview of model fit, and the main
statistics such as rhat and effective sample size would help determine the ultimate decision. rhat refers
to the deviation comparisons of sampling chains, and is regarded as a good performance with values of
around 1. Effective sample size also reflects the degree of covergence, and we expected it has a mininum
value of 300 in our study. This step was accomplished once after HMC sampler was drawn, and calculating
rates would not be allowed if the model was divergent.
2.2.4 Cross-validation between models
To compare these four models, loo package in R was adopted utilizing log likelihood of actual counts for
each Census tract and each age group. Based on the background computation info in R, elpd estimate using
PSIS could be more compatible than WAIC, taking into account that the four models have complex settings
of parameters. This process needs to generate log likelihood operated in Stan and realizes the information
criterion (Looic) in R [18].
10
Chapter 3
Results
The previous part is the base data we need to do the analysis, and the same data of neighboring Census
tracts from other counties bordering LAC would enhance the accuracy of prediction since LAC is not an
island and has many neighbors. However, we would not present those area outside LAC in maps, and this
session of data was only applied to build spatial models. There are initially 2405 Census tracts utilized to
build models (2342 tracts of LAC and 63 tracts of its neighbors) after removing some islands, but 22 tracts
from the 2405 tracts were deleted due to the zero population of these areas. Since variable population is
based on both age and Census tract level, some Census tracts have zero population for several age groups,
which was skipped while performing Stan program. Moreover, there are some special handling of data
based on different target populations and statistical models.
For the age-specific incidence rates in males for all cancer sites, we have all 18 age groups included in
models since each age group has non-zero population. Specifically, the designated race, sex and cancer
site are more interested in predicting cancer rates. We selected African American males who developed
prostate cancer as the second target population. In this case, there are 1674 valid Census tracts (1638 tracts
of LAC and 36 tracts of surrounding counties) involved in models. In addition, we eliminated the first
seven age groups (age 0 – 34) due to all zero values in age-specific population of all 1674 tracts.
11
3.1 Males with All Cancers
Poisson lognormal CAR model for males with tumors was required to test the convergence by both statistics and visualization. rhat estimate ranges from 0.9995 to 1.0052 that are fairly close to 1, indicating that
the model is well converged. Bulk and tail effective sample sizes are both larger than 900, which is enough
to believe the posterior estimates (Table 4.1). Trace plots of the parameters phi (e
ϕi
) for certain Census
tract, tau_c (τ ) and lp__(log density) respond to relatively fixed changes from sampling (Figure 4.1). Thus,
it is concluded that this model was approved through converge diagnosis.
Zero-inflated Poisson CAR model for males with all cancers was also needed to diagnose the convergence.
rhat estimate ranges from 0.9995 to 1.0051, which is pretty close to 1. Bulk and tail effective sample sizes are
both larger than 900, indicating that the posterior estimates are reliable to do further analysis (Table 4.2).
Trace plots of the parameters respond to relatively fixed changes from sampling (Figure 4.2). Therefore, it
is inferred that this model was accepted through converge diagnosis.
Negative binomial CAR model for males with all cancers was also needed to diagnose the convergence. rhat
estimate ranges from 0.9995 to 1.0050 with all values around 1. The effective sample size has a minimum
of 488.2, suggesting that the posterior estimates of sample are informative (Table 4.3). Trace plots of the
parameters respond to relatively fixed changes from sampling (Figure 4.3). Hence, the model convergence
for this model was confirmed.
BYM model for males with all cancers was also needed to diagnose the convergence. rhat estimate ranges
from 0.999 to 1.005. The smallest effective sample size is 963, suggesting that the posterior information
is sufficient conducting this model (Table 4.4). Trace plots of the multiple extracted parameters respond
to relatively stable variations from sampling (Fig. 8). Therefore, BYM model is a good fit for predicting
overall cancer rates in males.
12
We did the comparison of these four models and get Looic value of 155808.3 for Poisson lognormal CAR
model, 155809.8 for zero-inflated CAR model, 153916.3 for negative binomial CAR model and 156443.6
for BYM model. Consequently, negative binomial CAR model fits the data best when target population
is males for all ethnicities and interested outcome is overall cancer morbidity rates. Figure 3.1 shows the
anticipated standardized morbidity rate for males developing cancer across LA county based on Census
tracts according to the negative binomial CAR model, and the predicted risks of having cancer for males
aged from 35 to 49 were displayed in Figure 3.2.
Figure 3.1: Standardized morbidity rate prediction for males with tumors in LAC
13
Figure 3.2: Incidence rate prediction for males with tumors aged 35-49 in LAC(per million)
3.2 African-American males with prostate cancer
BYM model was applied for predicting SMR and incidence rate of prostate tumor for African American males. Convergence of this model can be assessed through summary statistics of parameter estimates(Table 4.5) and trace plots of MCMC chains (Figure 4.5). rhat measure ranges from 0.9995 to 1.0198
with median of around 1, suggesting that the model is well converged. Effective sample sizes of bulk and
tail are both larger than 300, meaning that the posterior estimates are trustworthy. Trace plots show the
stable pattern from sampling. Therefore, it is concluded that this model passes the convergence diagnosis.
We also try to fit the other three models for this population, but the models are all diverged (Figure 4.6).
Figure 3.3 presents the predicted mean SMR across the applicable Census tracts in LA county. The observed
cases are more than anticipated cases when the color is much close to red for each Census tract. Incidence
rate can also be mapped across 11 age groups, and there is no similar spatial pattern of incidence rates
14
distribution among different age groups. Age groups can also be combined, and the expected morbidity
rates were shown in Figure 3.4.
Figure 3.3: Standardized morbidity rate prediction for African-American males with prostate tumor in LAC
15
Figure 3.4: Incidence rate prediction for African-American males with prostate tumor aged 35-49 in
LAC(per million)
16
Chapter 4
Discussion and Future Work
From the male population aged 35-49, several cities such as Los Angeles, Monterey Park, Alhambra, San
Gabriel, El Monte Rosemead, Montebello tend to have less people developing cancer (Figure 3.2). Interestingly, prevalence rates become larger when we narrow the range of age group such as 35-39, 40-45, 45-49.
This is due to the rare cancer cases and even zero cases in some age groups for one Census tract. The same
phenomenon appeared more typically when we focus on a specified race African American. We found the
spatial pattern in Figure 3.4 was not that obvious even though three age groups were combined.
Cancer case counts are minority of the population especially when the study focused on the Census tract
level data, and it became much less when the target people are restricted in African-American males based
on 18 age groups. This brought the challenge of constructing models due to the large number of zero
values in the dataset, and the population of certain Census tract for some age groups is 0 though the total
population in this Census tract is not equal to 0. To make the prediction more reasonable, the observations with zero age-specific population was considered to be excluded when building models so that the
prediction was not influenced by these invalid records.
Los Angeles County has some natural islands where there is no statistical population released, which
can be simply ignored in this study; however, some inland Census tracts had no neighbors on account of
17
discarding Census tracts without population, making them become artificial islands involved in models.
For these independent areas, we isolated them into another loop in Stan assigning zero to their spatial
parameters when performing BYM model, which realizes the cancer rate prediction of all suitable Census
tracts containing artificial islands in one model. This solution for islands can be generalized to other
population and improve the precision of modeling. Furthermore, we found that the other three model
except for BYM model cannot be converged to forecast when we focused on prostate tumor in AmericanAfrican males.
There are also some limitations in the study. The results can only be applied to the specified population,
meaning that other population combinations of race, gender and tumor types would possibly not be fitted
using these four models due to the variability of real data. Further work could explore the morbidity
prediction in various ethnicities and cancer sites in either women or men. In addition, our models do
not incorporate any spatial covariates that could cause cancer such as diet, tobacco and family history.
Further research could involve more relevant oncology risk factors to predict occurrence rates since more
concerning information might lead to a stronger statistical estimates with less uncertainty.
18
Bibliography
[1] Yu, B. Predicting county-level cancer incidence rates and counts in the USA. Stat Med. 32.(22)
(May 2013), 3911–3925. doi: 10.1002/sim.5833.
[2] Dizon, D.S. and Kamal, A.H. Cancer statistics 2024: All hands on deck. CA Cancer J Clin 74 (2024),
8–9. doi: 10.3322/caac.21824.
[3] Ramamurthy, Poornima, Sharma, Dileep, Adeoye, John, Choi, Siu-Wai, Thomson, Peter, et al.
Bayesian Disease Mapping to Identify High-Risk Population for Oral Cancer: A Retrospective
Spatiotemporal Analysis. International Journal of Dentistry 2023 (2023).
[4] Abdulrahman, Mohammed A and Schmid, Madhuchhanda Bhattacharjee3 Volker. Bayesian Spatial
analysis for breast and prostate cancer incidence in Sudan based on 2009-2013 national registry data.
[5] Simkin, J., Dummer, T.J.B., Erickson, A.C., Otterstatter, M.C., Woods, R.R., and Ogilvie, G. Small
area disease mapping of cancer incidence in British Columbia using Bayesian spatial models and
the smallareamapp R Package. Frontiers in Oncology 12 (Oct. 19, 2022), 833265. doi:
10.3389/fonc.2022.833265.
[6] Morris, Mitzi, Wheeler-Martin, Katherine, Simpson, Dan, Mooney, Stephen J., Gelman, Andrew,
and DiMaggio, Charles. Bayesian hierarchical spatial models: Implementing the Besag York Mollié
model in stan. Spatial and Spatio-temporal Epidemiology 31 (2019), 100301. issn: 1877-5845. doi:
10.1016/j.sste.2019.100301.
[7] Jainsankar, R and Ranjani, M. Spatial disease mapping using the Poisson-Gamma model. Journal of
Future Sustainability 4.(2) (2024), 101–106.
[8] Joseph, Max. Exact sparse CAR models in Stan. 2016. url:
https://mc-stan.org/users/documentation/case-studies (visited on 03/01/2024).
[9] Kim, Hae-In, Lim, Hyesol, and Moon, Aree. Sex differences in cancer: epidemiology, genetics and
therapy. Biomolecules & therapeutics 26.(4) (2018), 335.
[10] California Cancer Registry. url:
https://www.cdph.ca.gov/Programs/CCDPHP/DCDIC/CDSRB/Pages/California-Cancer-Registry.aspx
(visited on 03/17/2024).
19
[11] Banerjee, S., Carlin, B.P., Gelfand, A.E., and Banerjee, S. Hierarchical Modeling and Analysis for
Spatial Data(1st ed.) Chapman and Hall/CRC, 2003.
[12] Tesema, Getayeneh Antehunegn, Tessema, Zemenu Tadesse, Heritier, Stephane, Stirling, Rob G,
and Earnest, Arul. A systematic review of joint spatial and spatiotemporal models in Health
Research. International Journal of Environmental Research and Public Health 20.(7) (2023), 5295.
[13] Carpenter, Bob, Gelman, Andrew, Hoffman, Matthew D, Lee, Daniel, Goodrich, Ben,
Betancourt, Michael, Brubaker, Marcus A, Guo, Jiqiang, Li, Peter, and Riddell, Allen. Stan: A
probabilistic programming language. Journal of statistical software 76 (2017).
[14] Lambert, D. Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing. 1992.
url: https://mc-stan.org/docs/stan-users-guide/zero-inflated.html (visited on 03/01/2024).
[15] S.D.Team. 13.1 Negative Binomial Distribution | STAN Functions reference.
https://mc-stan.org/docs/2_19/functions-reference/negative-binomial-distribution.html.
[16] Morris, Mitzi. Spatial Models in Stan: Intrinsic Auto-Regressive Models for Areal Data. Aug. 2019.
url: https://mc-stan.org/users/documentation/case-studies/icar_stan.html.
[17] MCMC sampling. https://mc-stan.org/docs/reference-manual/mcmc.html#hamiltonian-monte-carlo.
[18] Vehtari, A., Gelman, A., and Gabry, J. Practical Bayesian model evaluation using leave-one-out
cross-validation and WAIC. Statistics and Computing 27 (2017), 1413–1432. doi:
10.1007/s11222-016-9696-4.
20
Appendices
Table 4.1: Summary table of fitting statistics of Poisson lognormal CAR model in males with cancers
Min 1st Qu. Median Mean 3rd Qu. Max
rhat 0.9995 1.0002 1.0005 1.0006 1.0010 1.0052
essbulk 962.5 6200.5 6730.6 6587.5 7251.4 9578.7
esstail 1909 4020 4223 4208 4427 5524
Figure 4.1: MCMC trace plots of Poisson lognormal CAR model for males with cancers
21
Table 4.2: Summary table of fitting statistics of zero-inflated Poisson CAR model in males with cancers
Min 1st Qu. Median Mean 3rd Qu. Max
rhat 0.9995 1.0002 1.0006 1.0007 1.0011 1.0051
essbulk 953.1 11024 11962.8 11435 12865 16151.7
esstail 2113 4221 4472 4430 4683 5817
Figure 4.2: MCMC trace plots of Poisson lognormal CAR model for males with cancers
Table 4.3: Summary table of fitting statistics of negative binomial CAR model in males with cancers
Min 1st Qu. Median Mean 3rd Qu. Max
rhat 0.9995 1.0001 1.0004 1.0005 1.0008 1.0050
essbulk 488.2 8848.2 9602.7 9299.3 10274.5 14014.5
esstail 1021 4610 4833 4803 5032 5875
Table 4.4: Summary table of fitting statistics of BYM model in males with cancers
Min 1st Qu. Median Mean 3rd Qu. Max
rhat 0.999 1.000 1.001 1.001 1.001 1.005
essbulk 963 6200 6731 6588 7251 9579
esstail 1909 4020 4223 4208 4427 5524
22
Figure 4.3: MCMC trace plots of negative binomial CAR model for males with cancers
Figure 4.4: MCMC trace plots of BYM model for males with cancers
23
Table 4.5: Summary table of fitting statistics of BYM model in African-American males with prostate cancer
Min 1st Qu. Median Mean 3rd Qu. Max
rhat 0.9995 1.0004 1.0010 1.0018 1.0021 1.0198
essbulk 303.4 1564.0 4216.1 4841.3 7987.1 13767.5
esstail 776.4 3111.4 4059.5 3698.3 4385.8 5276.4
Figure 4.5: MCMC trace plots of BYM model for African-American males with prostate cancer
24
Figure 4.6: MCMC trace plots of divergence for African-American males with prostate cancer
25
Abstract (if available)
Abstract
Considering spatial patterns amplifies the credibility of predicting cancer morbidity rates, which incorporates neighbor information of the area. Bayesian methods are powerful techniques allowing measuring uncertainty that exists in real world data. Los Angeles County has an incredibly diverse and large population, providing the possibility of exploring specific spatial patterns on small area simultaneously grouped by gender, race, age and cancer sites. Multiple hierarchical Bayesian models were applied through R with the support of Stan language in this study, and some modifications to models were made to adapt our data. We considered 4 models: Poisson lognormal conditional autoregressive(CAR) model, zero-inflated CAR model, negative Binomial CAR model, Besag York Mollié(BYM) model and selected negative Binomial CAR model for predicting overall incidence rates in males because it optimized the leave-one-out cross-validation information criterion(Looic). However, only BYM model works from the four models when African-American males were selected as the target population and prostate oncology was chosen as the interest. The results suggest that several cities containing Los Angeles, Monterey Park, Alhambra, San Gabriel, El Monte Rosemead, Montebello has lower incidence rates for males aged from 35 to 49 with cancer so the spatial patterns were necessary to be involved while predicting cancer rates.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Origins of the gender disparity in bladder cancer risk: a SEER analysis
PDF
Predictive factors of breast cancer survival: a population-based study
PDF
Using multi-level Bayesian hierarchical model to detect related multiple SNPs within multiple genes to disease risk
PDF
Air pollution and breast cancer survival in California teachers: using address histories and individual-level data
PDF
Analysis of factors associated with breast cancer using machine learning techniques
PDF
Accelerating posterior sampling for scalable Gaussian process model
PDF
Bayesian hierarchical models in genetic association studies
PDF
Diet quality and pancreatic cancer incidence in the multiethnic cohort
PDF
Evaluating the use of friend or family controls in epidemiologic case-control studies
PDF
Exploring the interplay of birth order and birth weight on leukemia risk
PDF
Bayesian model averaging methods for gene-environment interactions and admixture mapping
PDF
Pathogenic variants in cancer predisposition genes and risk of non-breast multiple primary cancers in breast cancer patients
PDF
sFLT-1 gene polymorphisms and risk of severe-spectrum hypertensive disorders of pregnancy
PDF
Incidence and survival rates of the three major histologies of renal cell carcinoma
PDF
Stochastic inference for deterministic systems: normality and beyond
PDF
The association of prediagnostic metformin use with prostate cancer in the multiethnic cohort study
PDF
Machine learning-based breast cancer survival prediction
PDF
Bayesian models for a respiratory biomarker with an underlying deterministic model in population research
PDF
Red and processed meat consumption and colorectal cancer risk: meta-analysis of case-control studies
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
Asset Metadata
Creator
Dai, Xinyang
(author)
Core Title
Small area cancer incidence mapping using hierarchical Bayesian methods
School
Keck School of Medicine
Degree
Master of Science
Degree Program
Biostatistics
Degree Conferral Date
2024-05
Publication Date
04/08/2024
Defense Date
04/05/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Bayesian model,cancer incidence rates,disease mapping,hierarchical methods,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Zhang, Lu (
committee chair
), Cockburn, Myles (
committee member
), Eckel, Sandrah (
committee member
)
Creator Email
daixy336@gmail.com,xinyangd@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113871201
Unique identifier
UC113871201
Identifier
etd-DaiXinyang-12770.pdf (filename)
Legacy Identifier
etd-DaiXinyang-12770
Document Type
Thesis
Format
theses (aat)
Rights
Dai, Xinyang
Internet Media Type
application/pdf
Type
texts
Source
20240408-usctheses-batch-1136
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
Bayesian model
cancer incidence rates
disease mapping
hierarchical methods