Close
Home
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Machine learning-based breast cancer survival prediction
(USC Thesis Other)
Machine learning-based breast cancer survival prediction
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Machine Learning-Based Breast Cancer Survival Prediction
by
Qi Nie
A Thesis Presented to the
FACULTY OF THE USC KECK SCHOOL OF MEDICINE
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
BIOSTATISTICS
May 2023
Copyright 2023 Qi Nie
ii
Table of Contents
List of Tables ………………………………………………………………………………….. iv
List of Figures ………………………………………………………………………...………... v
Abstract ………………………………………………………………………………………… vi
Introduction …………………………………………………………………………………. 1
Chapter 1: Data Description ………………………………………………………………. 3
1.1 Case Identification ………………………………………………………………………… 3
1.2 Data Cleaning and Preprocessing ………………………………………………………... 3
1.2.1 Apriori Algorithm ……………………………………………………………………… 3
1.2.2 Model Training ……………………………………………………………………..…. 4
Chapter 2: Methods …………………………………………………………………………... 5
2.1 Algorithms ……….………………………………...………………….…………………. 5
2.1.1 Apriori ……………..…………………………………………………………………..5
2.1.2 LASSO ……………………………………………………………………………….. 8
2.1.3 Random Forest ……………………………………………………………………….. 9
2.1.4 Boruta ………………….…………………………………………………………… 10
2.1.5 Comparison between LASSO and Random Forest …………………………………… 11
2.2 Statistical Analysis ……….……………………………………………………………. 12
Chapter 3: Results ….…………………………………………………………………..…… 14
3.1 Apriori …………………..………………………………………………………………..14
3.2 Descriptive Statistics …………………..………………………………………………...16
3.3 LASSO-based Logistic Regression …………………………………………………… 18
3.4 Burota ………………………………………..……………………………………….... 20
3.5 Random Forest ………………………………………………………………………… 21
iii
Chapter 4: Conclusion and Discussion ………………….…………………………………. 22
4.1 Apriori …………………..………………………………………………....................... 22
4.2 Prediction Model …………………..……………………………………………….…... 24
References …………………………………………………………………………………… 27
iv
List of Tables
Table 1: Comparison of LASSO and Random Forest …………………………………….. 11
Table 2: Results of Apriori algorithm related to race/ethnicity ………………………….. 14
Table 3: Results of Apriori algorithm related to insurance type ………………………… 16
Table 4: Descriptive table of continuous variables for training model ..……………...… 16
Table 5: Descriptive table of categorical variables for training model …………….…… 17
Table 6: LASSO-based logistic regression model ……………………………………… . 18
Table 7: Testing confusion matrix for LASSO logistics regression …………………….. 18
Table 8: Testing confusion matrix for Random Forest Classification ………………….. 21
v
List of Figures
Figure 1: Flowchart of Apriori Algorithm …………………………………………………….. 5
Figure 2: Example of Apriori Algorithm .…………………………………………………….. 6
Figure 3: Linear vs. Nonlinear ………..…………………………………………………….. 11
Figure 4: A brief version of all rules found by the Apriori Algorithm……...…………….. 11
Figure 5: Coefficients of variables vs. log lambda in LASSO logistic regression ……...19
Figure 6: ROC of LASSO logistic regression ……………………………………………....19
Figure 7: Variable importance results produced by Boruta ……………………………... 20
Figure 8: ROC of Random Forest classification ………………………………………….. 21
vi
Abstract
Machine learning (ML) has recently been applied to medical areas. To discover
the potential association between social determinants of health (SDOH), clinical factors
of patients, and survival in a population-based of women who had been diagnosed with
breast cancer between 2010-2013 in California, the Apriori algorithm was utilized using
data from the California Cancer Registry (CCR) linked to neighborhood-level data from
the American Community Survey (ACS), the CalEnviroScreen, and the Healthy Places
Index (HPI). To build the prediction models of breast cancer four years survival, the
Boruta algorithm was used to select essential features in the model; Least absolute
shrinkage and selection operator (LASSO) based logistic regression and random forest
using features determined by Boruta algorithm were compared using Area Under the
Curve (AUC). The final model selected was LASSO-based logistic regression, providing
an AUC of 0.8453.
1
Introduction
Breast cancer has become the most commonly diagnosed cancer worldwide
since 2022. Among eight cancer patients, one was diagnosed with breast cancer [1],
and around 30 thousand new cases in the US were diagnosed in 2022 for females. [2]
Various machine learning models have been used to diagnose breast cancer in
earlier stages to provide timely and effective treatments to patients. Juan
Gutiérrez‑Cárdenas and Zenghui Wang built a breast cancer classification model based
on 866 subjects (452 records for the train set and 413 for the test set) with genomic
sequence features. [3] Since 1997, many machine learning-based models have been
built to predict breast cancer recurrence based on patients’ characteristics, such as age,
race, marital status, and tumor characteristics, such as tumor size, expression of
estrogen receptor (ER), and progesterone receptor (PR). [4] Most previous models have
included clinical information but few of them included demographic or social
determinants data. In addition to clinical and biochemical data, Ferroni et al. added
demographic data into a machine-learning model using 464 women in total (318 for the
train set and 136 for the test set) with breast cancer in Italy. [5]
All the models described above have been supervised built on less than 1000
samples in total. This thesis has two goals. The first one is data mining using Apriori
algorithm. The aim is to discover potential patterns between social determinants of
health (SDOH), clinical factors of patients, and survival in a population-based of women
who have been diagnosed with breast cancer between 2010-2013 in California based
on 274625 observations of 49 variables. The potential influence of SDOH on breast
2
cancer survival may help to provide suggestions for policies for minorities in California.
Another goal is to build a prediction model of 4-year survival of breast cancer using
SDOH and clinical factors with appropriate performance. Supervised least-absolute-
shrinkage-and-selection-operator (LASSO)-based logistic regression and Random
Forest models were built using 46,792 observations of 30 variables after data cleaning
and preprocessing to predict breast cancer’s four-year survival.
3
Chapter 1: Data Description
1.1 Case Identification
Breast cancer data was obtained from California Cancer Registry (CCR) and
neighborhood-level data from the American Community Survey (ACS), the
CalEnviroScreen, and the Healthy Places Index (HPI). Records of women ages 18 and
older diagnosed with invasive (stage I-IV) breast cancer in California between January
1
st
, 2010, and December 31
st
, 2013, were included as the original dataset for data
mining and model training. SEER site recodes 26,000 was used to identify primary
malignant breast cancer cases. [26]
1.2 Data Cleaning and Preprocessing
1.2.1 Apriori Algorithm
Variables more than 100,000 NAs (around fifty percent of the total number of
observations in the original dataset) were removed. All variables used in this algorithm
were categorical. Numerical data was converted to categorical data based on quantiles.
A total of 274,623 observations and 49 variables were used in the Apriori algorithm. The
minimum length of the itemset was set to be 2, support was set to be 0.002 to capture
patterns towards minority groups, confidence was set to be 0.8 (default), and the lift was
set to be 3 instead of default 1 to capture subsets of variables with stronger
relationships and constrain the number of association rules in the result since lift =1
provided thousands of rules.
1.2.2 Model Training
4
Since model training requires a complete dataset, records were excluded with
missing/unknown race or ethnicity; race or ethnicity other than non-Hispanic White, non-
Hispanic Black, Asian, and Hispanic; unknown cancer stage(stage I-IV); missing
pathological tumor data - hormone receptor or HER2 status; unknown immigrant status;
missing/unknown tumor size; missing Charlson comorbidity index. ; unknown receipt of
treatment or time to first treatment; non-linked neighborhood-level data. The total
number of observations used for the prediction model was 46,792 with 30 variables (12
continuous variables and 18 categorical variables summarized in table 3 and table 4).
The split ratio is set to be 0.7; hence, there were 32,754 observations in the train set
and 14,038 observations in the test set. Cross-validation of 10 folds was used in training
LASSO-based logistic regression model.
5
Chapter 2: Methods
2.1. Algorithms
This section briefly introduces the algorithms used for data mining and for model
training. For data mining, the Apriori algorithm was used to discover the collection of
features that are highly correlated to the status of 4 years of breast cancer survival. For
model training, the Boruta algorithm was used to do a feature selection process and
then the features selected by Boruta were used to train a Random Forest model to
predict the survival of breast cancer of patients in the test set. A LASSO-based logistic
regression model was also trained for prediction purposes.
2.1.1 Apriori
Figure 1: Flowchart of Apriori Algorithm
The Apriori algorithm, also called association rule mining, is a commonly used
unsupervised algorithm in data mining. Its main purpose to generate frequent item sets,
which is a subset of all features. The flowchart of how it works is shown in figure 1. It
starts with generating a frequent itemset of length k with minimal confidence, where k is
Itemset
Generate k itemset
candidates
Check support
(Prune)
Frequent item
Set of k-1 frequent
itemset (Check
whether it is an empty
set)
Stop
Yes
No
6
a minimal number of items in a set determined subjectively. It repeats this process until
no new frequent itemset is identified. Then candidate itemset with length k+1 is
generated from a frequent itemset of length k. Afterward, the candidate itemset
containing infrequent subsets with length k+1 will be pruned. By scanning the database,
each candidate’s support will be counted, eliminating those who do not reach minimal
support and lift requirements. Repeating all steps until no more frequent itemset is
produced.[14]
Figure 2: Example of Apriori Algorithm
Target:E
A,E B,E
C,E D,E
A,B,E A,C,E A,D,E B,C,E B,D,E
C,D,E
A,B,C,E
A,B,D,E
A,C,D,E B,C,D,E
A,B,C,D,E
Found to be infrequent
Prune the supersets
7
This algorithm is more understandable using an example of transactions. For
example, there are five goods A, B, C, D, E in a supermarket. The manager wants to
discover which goods frequently appear in the same basket of the good E(target
variable). A simple way is to find all the subsets of those five goods containing E and
then calculate the frequency. The problem is that the total number of calculations is 2
n
-1
for n goods (32 in this example), and it increases exponentially as n increases. Apriori is
more efficient. The basic idea of it is that if a subset is infrequent, then all the supersets
(extended sets containing the original set) are infrequent. Figure 2 shows the process.
The itemset {A,E} is found to be infrequent, then all the supersets {A,B,E}, {A,C,E},
{A,D,E}, {A,B,C,E}, {A,B,D,E}, {A,C,D,E}, {A,B,C,D,E} will be considered as infrequent
and then be pruned.
Parameters involved in this algorithm are mainly support, confidence, and lift.
Support refers to the frequency of the subset that appears in the full data set.
Confidence represents the confidence that item set B is chosen under the condition that
target item set A has already been chosen. Lift measures the strength of association
between A and B, which is the increased possibility of choosing itemset B while
choosing itemset A. If lift equals one, then itemset A and itemset B are independent.
The following formula represents confidence, support, and lift of item sets A and B. [6]
Support(A) =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑢𝑏𝑠𝑒𝑡𝑠 𝑐𝑜𝑛𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑖𝑡𝑒𝑚 𝐴
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑢𝑏𝑠𝑒𝑡𝑠 𝑓𝑜𝑟 𝑒𝑛𝑡𝑖𝑟𝑒 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
Confidence(A→B) =
𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐴⋂𝐵)
𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐴)
8
Lift(A→B) =
𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐴⋂𝐵)
𝑠𝑢𝑝𝑝𝑜𝑟𝑡 (𝐴 )∗𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐵)
2.1.2 LASSO
If the number of features included is large relative to the number of subjects, a
model can suffer from overfitting [10]. LASSO is a popular technique to mitigate this
problem, which was firstly introduced by Tibshirani. [23] It helps to control the model's
complexity by constraining the model coefficients' size. A subset of predictor features
will be automatically selected during training by performing a continuous shrinking
operation.
Given observations xij and corresponding response yi for i =1,2,…,N, j = 1,2,…,p,
where N is the number of observations and p is the number of covariates included in the
model, LASSO algorithm aims to find β={βj} to minimize
∑(𝑦 𝑖 − ∑ 𝑥 𝑖𝑗
𝛽 𝑗 )
2
+ 𝜆 ∑ |𝛽 𝑗 |
𝑝 𝑗=1
𝑁 𝑖=1
The first part of the above equation is Residual Sum of Squares (RSS) and the second
part 𝜆 ∑ |βj|
𝑝 𝑗=1
is a penalty term, which is adjusted by hyperparameter 𝜆 . 𝜆 is calculated
with respect to cross-validated error. Lambda.min refers to minimum mean cross-
validated error. Lambda.1se the 𝜆 for maximum average misclassification error rate
from a set of
avg(mcr)< min(avg(mcr))+se(mcr)
se(mcr) =
𝑠𝑑 (𝑚𝑐𝑟)
𝑁 𝑓𝑜𝑙𝑑
9
mcr refers to misclassification error rates by each 𝜆 ; 𝑁 𝑓𝑜𝑙𝑑𝑠 is the number of folds; avg
refers to average and sd refers to standard error of mcr. [27] Lambda.min can provide
the best model and lambda.1se can provide a simpler model; for this thesis, lambda.min
is used.
The RSS is minimized subject to the sum of the absolute value of the coefficients
which are less than a constant. [23] Because of this constrain, if a certain variable
included in the model decreases RSS negligibly, the impact of shrinkage penalty term
will increase and resulting in zero as the coefficient and such variable will then be
dropped from the model. Furthermore, LASSO will select only one feature as a
representation in groups of highly correlated features related to the outcome. As a
result, the model will be more stable by eliminating highly correlated features.
2.1.3 Random Forest
Random Forest is an algorithm based on decision trees. Although an individual
decision tree can produce a prediction model, it may suffer from high variance. To deal
with this problem, the concept of random forest is generated by combining multiple
decision trees to provide a single decision. Decision trees grow from various subsets of
training data created by bootstrap aggregating and hence diversity of trees increases,
where bootstrapping is a resampling technique that random sampling a dataset with
replacement. Given n independent and identically distributed observations Z1, …, Zn
with variance of σ
2
, then the variance of mean value 𝑍 ̅ is σ
2
/n. Hence averaging over
several decision trees will finally reduce the variance. [11]
10
Mathematically, let a p-dimensional random vector X=(X1, X2,…,Xp) be the p
variables in the model and Y be the target corresponding to each X. The prediction
function f(X) is the most frequently predicted “voting”:
𝑓 (𝑥 ) = 𝑎𝑟𝑔𝑚𝑎𝑥 ∑ 𝐼(𝑦 = ℎ
𝑗 (𝑥))
𝑁 𝑗=1
where hj(x) is the base learner trees and I(*) is the indicator function.
The random forest model is simple and relatively robust to outliers and noise. It can also
provide variable importance to help better understand how each feature performs
towards predicting the target.
2.1.4 Boruta
Although Random Forest can produce variable importance, the threshold of
dropping or containing features should be chosen subjectively if we want to make a
feature selection. An algorithm called Boruta solves this problem. It is a wrapper
algorithm built around random forest using shuffled features to remove unimportant
features as indicated by statistical tests. [12] Randomness is added by creating shuffled
copies of all features called shadow features and merging them with the original
dataset. Then a new random forest classifier is trained based on the extended dataset,
and variable importance is measured: in the thesis, a mean decrease in accuracy is
used. The maximal importance score of shadow features is calculated at every iteration.
A feature is considered “important” when the importance score of a real feature is
greater than the maximal importance score of the shadow feature. Those features with
a lower importance score for the real feature than the maximal importance score of the
11
shadow feature are regarded as “unimportant”. The mark of importance for each feature
in n iterations follows binomial distribution and hence z score can be calculated to
decide which features are unimportance and need to be removed from the model. The
algorithm continues until all features are confirmed or rejected as essential, or it reaches
the maximum number of iterations.
2.1.5 Comparison between LASSO and Random Forest
Model Similarity Difference
LASSO Lower risk of overfitting Train quick; linear data
Random Forest Train slow; non-linear data
Table 1: Comparison of LASSO and Random Forest
Figure 3: Linear vs. Nonlinear
Table 1 shows the similarity and differences between LASSO and Random
Forest. Tuning the degree of shrinkage in LASSO by cross validation avoids overfitting.
Random Forest is also robust to outliers and lower the risk of overfitting compared to
Decision Trees. However, LASSO-based logistic regression is a linear model. It
estimates a straight decision line to do the classification, shown in (a) of figure 3. If the
true decision boundary is non-linear like (b) in figure 3, it will be hard for a linear model
(a)
(b)
12
classify well. By contrast, Random Forest can deal with decision boundaries. In terms of
computational efficiency, the training time for LASSO-based logistic regression is much
faster than Random Forests, in particular large datasets.
2.2 Statistical Analysis
All statistical and machine learning analyses were conducted in R Statistical
Software (version 4.2.2). Package “arules” (version 1.7-3) in R was utilized to generate
tables of frequent itemset related to survival and their corresponding confidence and lift.
“glmnet” package (version 4.1-4) was used to build LASSO-based logistic regression.
Feature selection for the random forest was done using the Boruta algorithm via
package “Boruta”(version 7.0.0), and the random forest classification model was built by
package “randomForest” (Version 4.7-1.1).
The target outcome for both Apriori Algorithm and model training was the
dichotomous variable 4-year breast cancer related survival (Yes/No) calculated by
survival in months derived from the SEER survival time program as provided by CCR.
The performance of the model was evaluated using specificity, sensitivity, and accuracy
derived from the confusion matrix and area under the curve (AUC), where the formula of
specificity, sensitivity, and accuracy are shown below, where TP refers to true positive,
TN refers to true negative, P refers to total positive and N refers to total negative:
Specificity =
𝑇𝑁
𝑇𝑁 +𝐹𝑃
Sensitivity =
𝑇𝑃
𝑇𝑃 +𝐹𝑁
Accuracy =
𝑇𝑃 +𝑇𝑁
𝑃+𝑁
13
TN, TP, P, and N are shown in the confusion matrix as follows:
Prediction Positive (PP) Prediction Negative (PN)
Positive (P) True Positive (TP) False Negative (FN)
Negative (N) False Positive (FP) True Negative (TN)
True
Predicted
14
Chapter 3: Results
3.1 Apriori
From the Apriori algorithm, classical tumor and patient characteristics which
predict prognosis with the status of biomarkers such as estrogen receptor (ER),
progesterone receptor (PR), human epidermal growth factor receptor 2 (HER2), cancer
stage and receive of chemotherapy/radiation therapy have a high frequency.
Race/ethnicity and insurance type also have a high frequency of appearance in the
frequent item set with the outcome of death from breast cancer in 4 years.
Frequent item set Confidence Lift
{race_ethn=NH Black,cstage=4} 0.847 3.144
{ race_ethn=NH Black, prpos=0} 0.839 3.114
{ race_ethn=NH Black, her2pos=0} 0.822 3.051
{ race_ethn=NH Black, radiation=None} 0.813 3.019
{ race_ethn=Hispanic, prpos=0} 0.811 3.010
Table 2: Results of Apriori algorithm related to race/ethnicity. cstage=4 refers to cancer stage IV based on the SEER-
modified American Joint Committee on Cancer (AJCC) staging system (California Cancer Registry 2020); prpos=0
refers to progesterone receptor negative; her2pos=0 refers to human epidermal growth factor receptor two status
negative; radiation=None refers to the patient does not receive radiation therapy. NH: Non-Hispanic.
15
Figure 4: A brief version of all rules found by the Apriori algorithm; rectangles represent variables, circles represent
rules, and lines linked between rectangles and circles refer to the inclusion of variables in the rule. Darker red refers
to a higher lift of the rule, which means a stronger association with the outcome: 4-year breast cancer death.
Table 2 shows the race/ethnicity-related conditional probability of death from
breast cancer in four years (parameter confidence) and how much the likelihood of the
condition will increase known of the patient’s death (parameter lift). Among all 173 rules
satisfied with the parameters set (shown in figure 4), there are four rules for non-
Hispanic Black, one rule for Hispanic, and no rules found for Asian and non-Hispanic
White. Under the condition that the patient is a non-Hispanic black with cancer stage IV,
the probability of death from breast cancer in 4 years is 0.847. The probability of dying
from breast cancer in 4 years is 0.839 under the condition that the patient is a non-
Hispanic black with PR negative and is 0.811 for those Hispanic with PR negative. The
probability of death from breast cancer in 4 years is 0.822 given that the patient is non-
Hispanic Black with HER2 negative and 0.813 for those without radiation therapy.
16
Frequent item set Confidence Lift
{insurance=Self/oth/unk, prpos=0} 0.898 3.335
{insurance=Medicaid, prpos=0} 0.846 3.140
{insurance=Medicaid, erpos=0} 0.843 3.131
{insurance=Self/oth/unk, erpos=0} 0.831 3.087
{radiation=None, insurance=Self/oth/unk} 0.831 3.087
{chemo=None, insurance=Self/oth/unk} 0.825 3.062
{cstage=4, insurance=Medicaid} 0.824 3.059
{insurance=Medicaid, triple_neg=0} 0.821 3.049
{insurance=Self/oth/unk, ses_group=1} 0.815 3.025
Table 3: Results of Apriori algorithm related to insurance type. erpos=0 refers to estrogen receptor negative;
chemo=None refers to the patient who does not receive chemotherapy; ses_group=1 refers to the lowest
neighborhood socioeconomic status indicators.
Table 3 shows the insurance type related conditional probability of death from
breast cancer in four years. Compared to Commercial and Medicare, insurance type of
Self/other/unknown and Medicaid more frequently appeared in the frequent item set
with the outcome of death from breast cancer.
3.2 Descriptive Statistics
variable min max mean sd
age 18 102 61.53 13.91
% who receive public assistance 0 48.68 5.88 6.14
% with less than a high school
education
0 79.88 16.58 14.39
% with low English proficiency 0 80.24 9.143 9.3
% with income-to-poverty ratio ≥ 2 9.54 99.82 71.67 17.45
% who are foreign-born 0.94 90.42 26.1 14.33
% unemployed 0 22.5 5.42 2.64
CES pollution burden score 5.78 80.8 40.41 12.6
Tree canopy 0.043 74.27 8.62 9.4
% with supermarket access 0 100 46.72 34.6
EPA walkability score 2.65 19.67 12.02 3.38
Population density per square mile 0.5 161499.1 5539.2 8047.7
17
Table 4: Descriptive table of continuous variables for training model. CES: CalEnviroScreen; EPA: Environmental
Protection Agency. All the neighborhood characteristics in this table were calculated using data from the American
Community Survey and Healthy Places Index. In detail, % with supermarket access refers to the percent of the
population within ½ mile(urban) or 1 mile (rural) of a supermarket. CES pollution burden score reflects exposure to
environmental pollutants.
Table 5: Descriptive table of categorical variables for training model. Nci_flag refers to whether a patient was seen at
a California NCI-designated cancer center for their cancer by reviewing all abstracts. SES refers to neighborhood
socioeconomic status indicators. Cstage refers to the cancer stage based on SEER. Rural refers to neighborhood
type based on the Medical Service Study Area designation, depending on the 2010 census. Comorbidity is based on
the Charlson comorbidity index (CCI) calculated by CCR. Tumor size was identified based on the TNM category.
Table 4 shows the summary statistics of continuous variables, and table 5 shows
the descriptive statistics of categorical variables used in training a prediction model.
Demographically, the majority of samples in this study were non-Hispanic White
(52.7%), followed by Hispanic (24.6%), Asian (16.8%), and non-Hispanic Black (5.8%).
Variable Level Frequency(%) Variable Level Frequency(%)
NCI_flag No 39744(84.9) HER2 status Negative 39826(85.1)
Yes 7048(15.1) Positive 6966(14.9)
SES 1 6307(13.5) Triple-negative
Breast cancer
No 41344(88.4)
2 8609(18.4) Yes 5448(11.6)
3 9730(20.8) Immigrant US-born 32071(68.5)
4 10814(23.1) Immigrant 14721(31.5)
5 11332(24.2) Rural Urban 40832(87.3)
Race NH White 24645(52.7) Rural or frontier 5960(12.7)
Hispanic 11528(24.6) Surgery None 2112(4.5)
Asian 7882(16.8) Lumpectomy 24048(51.4)
NH Black 2737(5.8) Mastectomy 20632(44.1)
Cancer stage 1 22675(48.5) Hormone
receptor status
Negative 7661(16.4)
2 16273(34.8) Positive 39131(83.6)
3 5812(12.4) Tumor size 0 112(0.2)
4 2032(4.3) 1 26522(56.7)
Marital status married 25374(54.2) 2 15047(32.2)
unmarried/other 21418(45.8) 3 3151(6.7)
Comorbidity 0 33041(70.6) 4 1960(4.2)
1 8746(18.7) 4-year breast
cancer survival
Survive 42356(90.5)
2 5005(10.7) Death 4436(9.5)
Chemotherapy None 19486(41.6) Treatment
delay
No 40358(86.2)
Yes 27306(58.4) Yes 6434(13.8)
Radiation
therapy
None 21945(46.9) Insurance type Commercial or Medicare 40079(85.7)
Yes 24847(53.1) Medicaid 4814(10.3)
Self-pay or other 1899(4.1)
18
84.9% of patients were seen at a California NCI-designated cancer center for their
cancer by reviewing all abstracts. For clinical factors, the majority of patients were in
early cancer stage (stage I and II) (83.3%), CCI=0 (70.6%), with negative HER2 status
(85.1%), with no triple negative breast cancer (88.4%), receiving a surgery (95.5%), HR
positive (83.6%), without a treatment delay (86.2%) and survive in 4 years after
diagnosed (86.2%). The frequency of SES was increasing with its level. 68.5% of
patients were US-born. 87.3% of observations lived in urban. And most of them had
commercial or Medicare insurance (85.7%).
3.3 LASSO-based Logistic Regression
variable coefficient variable coefficient
(Intercept) -5.718007132 acs_pct_pers_incpov_ge2 0.000400028
age 0.012713517 acs_pct_foreignborn -0.008208445
nci_flag -0.341928904 acs_pct_unemploy -0.000852486
ses -0.049966556 ces_pollution_burden 0.003318681
race_cat -0.075229912 hpi_treecanopy -0.003555788
cstage 1.240613002 hpi_supermkts -0.001128309
unmarried 0.169551401 hpi_walkability_ct 0.019848304
comorbidity 0.171321878 popdensqmi10 .
chemotx -0.13611354 immigrant -0.193187576
radiotx 0.315864051 rural -0.07107382
her2pos -0.060412608 surg_cat -0.26928958
triple_neg 0.85295915 hrpos -0.422217781
acs_pct_hh_pub_assist 0.008374291 tsize 0.242942939
acs_pct_lt_hs -0.004225027 tx_delay -0.044833696
acs_pct_eng_lt_well 0.009249278 ins_cat2 0.141266374
Table 6: LASSO-based logistic regression model
Survive Die
Survive 12441 189
Die 1034 374
Table 7: Testing confusion matrix for LASSO logistics regression, the probability cutoff is 50%.
True
Predicted
19
Figure 5: Coefficients of variables vs. log lambda in LASSO logistic regression
Figure 6: ROC of LASSO logistic regression
Figure 5 shows the coefficients path of LASSO-based logistic regression. All the
coefficients will shrink to zero eventually. For the best model performance,
lambda.min=0.000142(best model) was chosen instead of lambda.1se=0.00706
(simpler model). And using this lambda, the final model is shown in table 6, where only
20
the feature of population density per square mile is dropped from the model. Table 7
shows the confusion matrix of the LASSO model, and using this matrix, the error rate of
this model is 8.71%, accuracy is 91.29%, sensitivity is 26.56%, and specificity is
98.50%. Figure 6 presents the receiver operating characteristic curve (ROC) of the
LASSO model, and AUC, which is a global measure for evaluating classification model
performance [21], is 0.845.
3.4 Boruta
Figure 7: Variable importance results produced by Boruta
21
Figure 7 shows the results produced by the Boruta algorithm. The most important
variable is the cancer stage; the two leading factors other than the cancer stage are
tumor size and surgery type. All the features candidates included in the training model
are confirmed as important variables to predict the 4-year survival of breast cancer.
Then, all features were used for the Random Forest prediction model since all the
variables were proven to be important in this prediction model.
3.5 Random Forest
Survive Die
Survive 12380 250
Die 993 415
Table 8: Testing confusion matrix for Random Forest Classification, the probability cutoff is 50%.
Figure 8: ROC of Random Forest classification
True
Predicted
22
Table 8 is the confusion matrix of the Random Forest prediction model using
features selected by the Boruta algorithm (all features). Using this matrix, the error rate
is calculated as 8.85%, prediction accuracy is 91.15%, sensitivity is 29.47%, and
specificity is 98.02%. Figure 8 shows the roc of the Random Forest prediction model.
The value of AUC is 0.831.
23
Chapter 4: Conclusion and Discussion
4.1 Apriori
The results showed that non-Hispanic Black death frequency in the late stages of
breast cancer was the highest among all the four races included in this study in
California, consistent with racial disparities in the diagnosis, treatment, and outcomes of
breast cancer in prior research. [17-19]
In addition to the target of 4-year survival of breast cancer, the target of
race/ethnicity was also set to find potential cluster patterns using the Apriori algorithm. It
was interesting to find that preference of the choice of hospital was different by
races/ethnicity. Here, preference was considered instead of access or availability of
hospitals because there was no hospitals with records of only one or two
races/ethnicity. From 2010 to 2013 in California, non-Hispanic Black with breast cancer
preferred to choose Kaiser Permanente in West Los Angeles Cadillac Ave (Hospital
Number 190315), and Asian with breast cancer were diagnosed mainly with Chinese
Hospitals (Hospital number 38715). Garfield MED CTR (hospital number 190315) and
Seton Medical Center (hospital number 410817) were also well known for Asian
patients with breast cancer. The conditional probability of Asian patients diagnosed with
breast cancer at Garfield MED CTR and surviving in 4 years was even 0.804, which is a
high probability of survival. Hence the disparity in preference of hospital and
accessibility of medical care benefits will influence the survival status of breast cancer
for patients with different races/ethnicity and medical insurance type. And research
results from Laguna JR indicated that variety in advance care planning existed for
24
different races. Both non-Hispanic Black and Hispanic were less likely to designate a
healthcare proxy.[16]
For insurance type, it was intuitive that self-pay may be inadequate for the poor
to complete breast cancer treatment. According to study of Subramanian S et al. [20],
the cost of breast cancer treatment continued to increase and doubled in total costs in 2
years after diagnosis, excluding in situ cancer for Medicaid beneficiaries. And they
claimed that extended Medicaid coverage should be provided to ensure the patient
received a complete and comprehensive breast cancer treatment. This also explained
the reason for the high frequency of Medicaid patients and death from breast cancer in
4 years in California.
In conclusion, more benefits should be provided to minorities in California to
ensure that each patient could receive more comprehensive treatment for breast
cancer, increasing the survival of breast cancer and racial equity in California.
4.2 Prediction Model
By practice, training time for LASSO logistic regression is quicker than for
Random Forest for this dataset.
From an interpretation perspective, Random Forest based on Boruta is better
because algorithm of LASSO is based on the cost function. A feature in LASSO model
is considered essential if the model's performance decreases when removing this
feature; conversely, “unimportant” features increase model performance when removing
them from the model. The problem is that performance that unchanging performance
does not mean that the feature is not related to the target variable; it means that the
25
feature does not help reduce the cost function. While feature selection like Boruta does
not depend on the cost function, it aims to select features related to the target variable
and help to predict the outcome better. Variable importance provided by it is also more
interpretable for understanding the relationships among the variables in the model.
However, this thesis aims to find a better prediction model with a good
performance. The accuracy of the model is more essential than the interpretation. Both
models produce more than 90% accuracy in predicting the survival of breast cancer in 4
years. The LASSO-based logistic regression model's overall performance is better than
the Random Forest model, with higher accuracy, specificity, and AUC but lower
sensitivity. Noticing that each model has a high specificity (more than 98%) and a low
sensitivity (less than 30%), those two models have an excellent performance in
predicting whether a patient will survive breast cancer in 4 years rather than predicting
the status of death. Low sensitivity is reasonable for the dataset with a low prevalence
of positive test results for the candidate patients [22]. For the dataset used in this thesis,
only 9.5% of patients died from breast cancer in 4 years of the study. By changing the
cutoff from the default 0.5 to 0.75, sensitivity decreases a little bit to 93.22% but
specificity increases to 50.67%. Cutoff of 0.9 provides 79.10% sensitivity and 74.75%
specificity, but such prediction performance is not high enough to use. Specificity can
keep increasing to more than 90% by changing the cutoff to 0.96, but there is too much
sacrifice in sensitivity (52.02%).
With low accuracy in predicting the death of breast cancer, the model will be
better for application in predicting the survival status of breast cancer in 4 years. When
the medical resource is limited, for example, scarcity in doctors or medical equipment,
26
the resource can be distributed to those patients who are predicted as death in this
model.
27
References
[1] H. Sung, J. Ferlay, R.L. Siegel, M. Laversanne, I. Soerjomataram, A. Jemal, et
al.Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality
worldwide for 36 cancers in 185 countriesCA A Cancer J Clin, 71 (2021), pp. 209-249
[2] Cancer of the breast (female) - cancer stat facts (no date) SEER. Available at:
https://seer.cancer.gov/statfacts/html/breast.html (Accessed: Jan 20, 2023).
[3] Gutiérrez-Cárdenas, J. and Wang, Z. (2021) “Classification of breast cancer and
breast neoplasm scenarios based on machine learning and sequence features from
lncrnas–mirnas-diseases associations,” Interdisciplinary Sciences: Computational Life
Sciences, 13(4), pp. 572–581.
[4] Abreu, P. et al. (2016) “Predicting breast cancer recurrence using Machine Learning
Techniques,” ACM Computing Surveys, 49(3), pp. 1–40.
[5] Ferroni, P. et al. (2019) “Breast cancer prognosis using a machine learning
approach,” Cancers, 11(3), p. 328.
[6] Prof. Snehal Dhane (2022) “Data analysis using data mining techniques as K means
and apriori algorithms,” International Journal of Advanced Research in Science,
Communication and Technology, pp. 628–629.
[7]Bhargava M, Selwal A. Association Rule mining using Apriori Algorithm: A review.
International journal of advanced research in computer science. 2013;4(2).
[8] Bayardo, R.J. (1998) “Efficiently mining long patterns from databases,” ACM
SIGMOD Record, 27(2), pp. 85–93.
[9] Krisnanto, U. et al. (2022) “Utilizing apriori data mining techniques on sales
transactions,” Webology, 19(1), pp. 5581–5590.
[10] Amaador, K. et al. (2021) “Discriminating between Waldenström macroglobulinemia
and marginal zone lymphoma using logistic lasso regression,” Leukemia &
Lymphoma, 63(5), pp. 1070–1079.
[11]Breiman L. Random forests. Machine learning. 2001;45(1):5-32.
doi:10.1023/A:1010933404324
[12] Kursa, M.B. and Rudnicki, W.R. (2010) “Feature selection with the Boruta
package,” Journal of Statistical Software, 36(11).
[13] Months survived based on complete dates SEER. Available at:
http://seer.cancer.gov/survivaltime
28
[14] Srikant R, Agrawal R. Mining generalized association rules. Future generation
computer systems. 1997;13(2):161-180. doi:10.1016/S0167-739X(97)00019-8
[15] Nobili, S. et al. (2021) “Establishment and characterization of a new spontaneously
immortalized er−/pr−/her2+ human breast cancer cell line, DHSF-BR16,” Scientific
Reports, 11(1).
[16] Laguna JR. Racial/ethnic Variation in Care Preferences and Care Outcomes
Among United States Hospice Enrollees. Los Angeles; 2014.
[17] Daroui, P. et al. (2012) “Utilization of breast conserving therapy in stages 0, I, and II
breast cancer patients in New Jersey,” American Journal of Clinical Oncology, 35(2),
pp. 130–135.
[18] Livaudais, J.C. et al. (2011) “Racial/ethnic differences in initiation of adjuvant
hormonal therapy among women with hormone receptor-positive breast cancer,” Breast
Cancer Research and Treatment, 131(2), pp. 607–617.
[19] Roberts, M.C. et al. (2016) “Racial variation in the uptake of Oncotype DX testing
for early-stage breast cancer,” Journal of Clinical Oncology, 34(2), pp. 130–138.
[20] Subramanian S, Trogdon J, Ekwueme DU, Gardner JG, Whitmire JT, Rao C. Cost
of Breast Cancer Treatment in Medicaid: Implications for State Programs Providing
Coverage for Low-Income Women. Medical care. 2011;49(1):89-95.
doi:10.1097/MLR.0b013e3181f81c32
[21] Hoo ZH, Candlish J, Teare D. What is an ROC curve? Emergency medicine
journal : EMJ. 2017;34(6):357-359. doi:10.1136/emermed-2017-206735
[22] Lütkenhöner B, Basel T. Predictive modeling for diagnostic tests with high
specificity, but low sensitivity: A study of the glycerol test in patients with suspected
Menière’s disease. PloS one. 2013;8(11):e79315-e79315.
doi:10.1371/journal.pone.0079315
[23] Tibshirani R. Regression Shrinkage and Selection via the Lasso. Journal of the
Royal Statistical Society Series B, Methodological. 1996;58(1):267-288.
doi:10.1111/j.2517-6161.1996.tb02080.x
[24] Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective:
Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical
Society Series B, Statistical methodology. 2011;73(3):273-282. doi:10.1111/j.1467-
9868.2011.00771.x
[25] Shi Guoliang, Jing Zhigang, Fan Liwei. Research on the Original Oil Price
Prediction Based on Lasso-Xgboost Combination Method [J]. Industrial Technology &
Economy, 2018, 37(7):31-37.
[26] Site recode ICD-O-3/WHO 2008 - seer data reporting tools (no date) SEER.
Available at: https://seer.cancer.gov/siterecode/icdo3_dwhoheme/
29
[27] Lee, S.-H. (2021) Lambda.min, lambda.1se and cross validation in Lasso : Binomial
response: R-bloggers, R. Available at: https://www.r-bloggers.com/2021/10/lambda-min-
lambda-1se-and-cross-validation-in-lasso-binomial-response/
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Analysis of factors associated with breast cancer using machine learning techniques
PDF
Comparison of Cox regression and machine learning methods for survival analysis of prostate cancer
PDF
Predictive factors of breast cancer survival: a population-based study
PDF
Predicting autism severity classification by machine learning models
PDF
An analysis of disease-free survival and overall survival in inflammatory breast cancer
PDF
Disparities in colorectal cancer survival among Latinos in California
PDF
Cell-specific case studies of enhancer function prediction using machine learning
PDF
Sentiment analysis in the COVID-19 vaccine willingness among staff in the University of Southern California
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Air pollution and breast cancer survival in California teachers: using address histories and individual-level data
PDF
Identifying prognostic gene mutations in colorectal cancer with random forest survival analysis
PDF
Construction of a surgical survival prediction model of stage IV NSCLC patients-based on seer database
PDF
Racial and ethnic disparities in delays of surgical treatment for breast cancer
PDF
Factors influencing the decision and timing to undergo breast reconstruction after mastectomy for breast cancer in public hospital vs. private medical center from 2007 to 2013: a retrospective co...
PDF
Application of statistical learning on breast cancer dataset
PDF
High-dimensional regression for gene-environment interactions
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
Statistical downscaling with artificial neural network
PDF
Covariance-based distance-weighted regression for incomplete and misaligned spatial data
PDF
Statistical analysis of high-throughput genomic data
Asset Metadata
Creator
Nie, Qi
(author)
Core Title
Machine learning-based breast cancer survival prediction
School
Keck School of Medicine
Degree
Master of Science
Degree Program
Biostatistics
Degree Conferral Date
2023-05
Publication Date
05/02/2023
Defense Date
04/27/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
apriori,Boruta,breast cancer,Lasso,machine learning,OAI-PMH Harvest,random forest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Li, Ming (
committee chair
), Lewinger, Juan Pablo (
committee member
), Liu, Lihua (
committee member
)
Creator Email
nq990124@gmail.com,qinie@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113096945
Unique identifier
UC113096945
Identifier
etd-NieQi-11757.pdf (filename)
Legacy Identifier
etd-NieQi-11757
Document Type
Thesis
Format
theses (aat)
Rights
Nie, Qi
Internet Media Type
application/pdf
Type
texts
Source
20230503-usctheses-batch-1035
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
apriori
Boruta
breast cancer
machine learning
random forest