Close
About
FAQ
Home
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Increase colorectal cancer prediction accuracy with the influence (I)-score
(USC Thesis Other)
Increase colorectal cancer prediction accuracy with the influence (I)-score
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Increase Colorectal Cancer Prediction Accuracy with the Influence (I)-Score
by
Yinchuan Xu
A Thesis Presented to the
FACULTY OF THE USC DANA AND DAVID DORNSIFE COLLEGE OF
LETTERS ARTS AND SCIENCE
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
(APPLIED MATHEMATICS)
May 2024
Copywrite 2024 Yinchuan Xu
ii
Acknowledgments
This thesis and related research are conducted under the supervision of Professor
Fengzhu Sun, Departments of Quantitative and Computational Biology and
Mathematics, University of Southern California, 2023-2024 academic year.
The author would like to thank Professor Fengzhu Sun, Professor Steven Heilman, and
Professor Detlof von Winterfeldt for their patient mentoring and advising in his early
career in Mathematics and in the application for his job hunting, as the thesis committee.
The author would also like to thank all his family members, especially his father
Shibing Xu, mother Aihua Gong, and aunt Jianhua Gong. Thank them for their
continued support and care for him, and their selfless love for him.
The author shares one of his favorite quotes: Self-discipline paves the way to turn your
dreams into reality.
iii
Table of Contents
Acknowledgments..........................................................................................................ii
List of Tables.................................................................................................................iv
List of Figures................................................................................................................v
Abstract.........................................................................................................................vi
Chapter A: Introduction .................................................................................................1
Chapter B: Data & Materials .........................................................................................3
I. Data Recourses....................................................................................................3
II. Data Description.................................................................................................5
Chapter C: Methods.......................................................................................................6
I. Influence Score (I-score) .....................................................................................8
II. Backward Dropping Algorithm........................................................................10
IV. Fair I-Score Model ..........................................................................................13
V. Select I-score Threshold ...................................................................................16
VI. Evaluation of Fair Model................................................................................18
Chapter D: Result.........................................................................................................19
I. Prediction Performance of 4 Models on the Hannigan Testing Set...................19
II. Prediction Performance of 4 Models on the Thomas Testing Set ....................21
III. Prediction Performance of 4 Models on the Vogtmann Testing Set ...............22
IV. Prediction Performance of 4 Models on the Zeller Testing Set ......................23
Chapter E: Conclusion .................................................................................................24
Chapter F: Limitations.................................................................................................25
References:...................................................................................................................26
iv
List of Tables
Table 1: Artificial example to explain the selected threshold ......................................17
Table 2: Prediction Performance of 4 Models on the Hannigan Testing Set ...............19
Table 3: Prediction Performance of 4 Models on the Thomas Testing Set..................21
Table 4: Prediction Performance of 4 Models on the Vogtmann Testing Set ..............22
Table 5 :Prediction Performance of 4 Models on the Zeller Testing Set .....................23
v
List of Figures
Figure 1: The process of finding biased features...........................................................7
Figure 2: The table explaining parameters.....................................................................8
vi
Abstract
A recent study proposed an Influence score (I-score) based Method to improve
prediction model accuracy for skin cancer. The method was demonstrated using a
dataset of skin lesions, showcasing its potential effectiveness. In my study, I explore
the applicability of this method in the context of colorectal cancer (CRC) metagenomic
datasets. I employ a random forest algorithm as my foundational model, incorporating
the I-score method as an enhancement technique. I then evaluate the performance of
this model through various metrics to gauge its success. I show that the I-score method
successfully mitigates bias within my original models leading to more accurate
predictions.
Keywords:
Backward Dropping Algorithm, Colorectal Cancer, Eliminate Bias, Fair Model,
Fairness Indicator, I-score, Random Forest
1
Chapter A: Introduction
Predicting phenotypes based on various features such as gene expression, single
nucleotide polymorphisms, microbial abundance, etc. is an important problem in
current genomic studies. The general approach to solve this problem is to collect a
training sample of individuals with their features and phenotypes, learn a model based
on the training sample, and finally apply the learned model to the testing sample with
their features to predict their phenotype. However, currently available data mostly
concentrate on individuals with Caucasian ancestry. The learned models based on such
data can have decent prediction accuracy when the learned models are applied to
Caucasians. On the other hand, when such learned models are applied to populations
different from Caucasians, the prediction accuracy is markedly decreased. Such a
phenomenon is referred to as bias in the machine learning community [1]. How to reduce
bias and make the prediction fair in prediction is an important problem in machine
learning community.
To solve this problem, Wu et al. developed an influence score (I-score) based to
enhance prediction accuracy [2]. It was initially applied to skin lesion datasets to identify
factors influencing lesion occurrence. In this study, I expanded its use to a pre-processed
colorectal cancer metagenomic dataset previously studied by Gao et al [3]. By applying
the I-score Method to the datasets in Gao et al., I investigate how well this approach
boosts prediction accuracy in a different dataset.
In my research, I employed the random forest algorithm as the base model to
identify factors contributing to colon cancer risk. To improve this basic model, I
2
adopted the "I-score Method". This method, also known as the "Fair Influence Score
Method," systematically identifies and removes biased feature subsets—those
negatively impacting model predictions—thereby refining the model with subsets that
positively affect predictions.
This study assesses the effectiveness of the I-score method by comparing metrics
like Sensitivity, Specificity, F1-score, and Accuracy between the initial and optimized
models, aiming to highlight the latter's enhanced predictive capabilities. Section B will
introduce the experimental data. Section C gives the I-score method and the entire
experimental process and evaluation. In Section D, I analyze the CRC metagenomic
datasets with the original fair I-score model to get the results, and use statistical
indicators to analyze whether the predictions are effective. Finally, Sections E and F
present summary and limitations.
3
Chapter B: Data & Materials
I. Data Recourses
The data I used are from selected colorectal cancer (CRC) metagenomic
datasets in Gao et al [3]. Colorectal cancer is a disease in which cells in the colon
or rectum grow out of control. It usually starts as small clumps of cells called
polyps. Over time, some may develop into cancer. With current technology,
doctors are unable to determine the cause of most colorectal cancer. Possible
risk factors include older age, people with enteritis, diabetes, and those who
drink alcohol [4]
.
I took the filtered data from Gao et al. [3] Raw data are publicly available in the
European Nucleotide Archive (ENA) database [5]. The authors used a centrifuge
to preprocess the data and created six publicly accessible, geographically
diverse CRC metagenomic datasets. Samples from adenoma patients were
excluded from the study and only samples from CRC patients and healthy
controls were used. My model uses three kinds of datasets, one is the training
dataset, one is the validation dataset, and the other is the testing dataset. The
training dataset is used to train the model. The validation dataset is used to
provide an unbiased evaluation of a model fit on the training dataset while
tuning model hyperparameters. The testing dataset is used to provide an
unbiased evaluation of a final model fit on the training dataset. I selected two
sets of data as training and validation datasets, namely Feng's (ERP008729) and
Yu's (PRJEB10878). I take 70% of the data from each of these two datasets as
4
training datasets and 30% of the data as validation datasets. Feng's data was
collected from Austria, while Yu's data was collected from China. These two
data come from the East and the West, respectively. The microbiome data from
them are highly distinct. I expect that the distinct microbiome distributions will
generate area bias for the prediction model. To prove this hypothesis, I use four
different datasets from four different regions from either East or West for the
testing datasets, and I test them separately. The first one is Hannigan’s
(PRJNA389927) dataset that came from USA and Canada. There are 27 cases
and 28 controls in this dataset. The next one is Thomas (SRP136711) that came
from Italy with 61 cases and 52 controls. The third one is Vogtmann
(PRJEB12449) also from the USA, which consisted of 52 cases versus 52
controls. The last is Zeller (ERP005534) collected from France with 91 cases
and 93 controls.
5
II. Data Description
The data structures of these six datasets are similar. I will take Feng's dataset
as an example. There are 46 cases and 63 controls in Feng's dataset. Similarly,
there are 74 cases and 54 controls in Yu’s dataset. Next, look at the chart of the
filtered data, taking Feng's as an example. The total table has 109 rows
(individuals) and 8383 columns (operational taxonomic units (OTU)). Each cell
shows the number of reads mapped to the particular OTU for an individual.
Since the numbers of columns in the datasets are different, I take their
intersection when setting up the training, validation, and testing set. For
example, Feng's data set has 8010 columns, and Yu's data set has 5060 columns.
The number of columns in each dataset does not match well, so I can only take
the intersection of the same columns that are present in each dataset.
6
Chapter C: Methods
To make model predictions more accurate, it's important to choose the training
predictions carefully. The I-score method helps find specific predictions that have a
good influence on the model's accuracy [6]. To achieve this aim, we next introduce the
I-score method and the evaluation of prediction in detail.
My study involves comparing two models: a basic model that uses Random Forests
(RF) and a RF model modified by the I-score technique. First, given many randomly
chosen subsets, the process starts with the Backward Dropping Algorithm that sorts
these subsets into influential feature subsets with I-scores from high to low [7]. It is an
iterative process that seeks to refine a set of features by successively removing the least
influential features, based on the I-score, until no improvement can be made by further
dropping any feature. It should be noted here that the I-score of the influential feature
in each subset is equal to the I-score of this subset. Then, the I-scores of these influential
feature subsets are used as thresholds. If the I-score of the influential features in this
subset is equal to this threshold, we set these influential features to 0 and then
implement RF with all remaining influential features and then calculate the AUC value.
By analogy, after obtaining the AUCs, find the highest AUC. The influential feature
subset corresponding to this AUC is the biased feature subset. By excluding this biased
subset and applying RF again, we create an optimized model (Figure 1). Finally, we
compare the basic and optimized models using statistical measures to see how much
the prediction accuracy has improved.
7
Figure 1: The process of finding biased features
8
I. Influence Score (I-score)
The Influence Score (I-score) for a feature subset is a numerical value that
measures the differences among the mean responses of subsets of samples defined
by the 0/1 values of the features as explained below [6]. The larger the value, the
greater the subset of features may impact the prediction accuracy. The formula for
calculating the I-score of a feature set consisting of features is as follows:
() =
1
σ
2 ∑ (
2
( − ))
2
=1
,
with the parameters explained below.
Figure 2: The table explaining parameters
The I-score is calculated based on the validation dataset consisting of samples
with their features and corresponding true binary outcome . For a specific set of
features, for example = {1, 2, ..., }, with
representing the
ℎ
feature,
9
there are a total of m features in this set arranged from 1 to . For each feature, I
define a new 0/1 variable for each individual as shown in Figure 2. I calculate the
median of the numbers of reads of the same feature corresponding to all
examples. If the read number is greater than or equal to the median, it is recorded
as 1. If it is less than the median, it is recorded as 0. That is, = ( ≥ ), where
is the median of all the values of feature
and (. ) is the index function.
There are maximum 2
possible subsets of samples defined by (1 = 1, 2 = 1),
(1 = 1 , 2 = 0 ), (1 = 0 , 2 = 1 ), and (1 = 0 , 2 = 0 ), respectively, when
= 2 for example. The resulting subsets are ordered from 1 to 2
. Let be the
number of samples in the
ℎ
subsect and be the mean response for samples in
the
ℎ
subset. In this way, = ∑
2
=1
. The σ
2
is defined as σ
2 =
1
∑ ( − )
2
=1
. This is the variance of
, where
represents the outcome of
the
ℎ
example, which can be 0 or 1. is the sample mean of all the responses.
is the mean response of the
ℎ
subset of examples.
10
II. Backward Dropping Algorithm
The Backward Dropping Algorithm (BDA) is a method of finding influential
feature subsets from all feature subsets [7]. BDA initiates by selecting a random
subset of features from the entire set of features as its starting point. This
subset, denoted , undergoes evaluation through an iterative process in each
iteration, the algorithm calculates the I-score for to gauge its influence on the
predictive outcome. Subsequently, the algorithm explores the effect of
tentatively dropping each feature within one at a time, creating new subsets
and computing their respective I-scores. If any of these new subsets, formed
by excluding a feature, demonstrates a higher I-score than the original subset ,
the subset is updated to this new, more influential subset. This iterative
process of evaluating and potentially dropping features continues until the
algorithm identifies no further improvement in the I-score by removing
additional features, concluding with identifying a subset of the most influential
features. This algorithm efficiently narrows down the most influential features
by iteratively eliminating those that contribute the least, ensuring the final
subset of features has the maximum possible influence on the predictive
outcome, based on the I-score criterion. The principle behind it is to pass all
feature subsets through an algorithm to get influential subsets step by step. The
BDA is executed multiple times. Then it will arrange the I-score from large to
small to find the influential subsets. We need to focus on observing these
influential subsets, because they may have a better or worse impact on model
11
predictions.
Backward Dropping Algorithm (BDA) (Take One Set as an Example)
Input: Validation set with (1, 2, ..., ), and associated true binary outcome
.
Output: The potential subsets of influential features
while || ≥ 1 do
Calculate the I-score of :
() =
1
σ
2 ∑ (
2
( − ))
2
=1
Denote ={1, 2, ..., ||} with 1 < · · · < ||
for k = 1, 2, ..., || do
Tentatively drop the feature in and refer to the new subset as
Compute the I-score of
end for
if max () > () then
Update =
else
End the algorithm
end if
end while
12
III. Randomly Choose Subsets from the Dataset
The preceding section outlines the application of BDA to a single subset. In
this subsection, the focus shifts towards expanding the approach to get multiple
subsets using BDA, a strategy instrumental in enhancing the robustness and
universality of our predictive model. Adopting a methodical approach, subsets
were selected from the dataset at random, utilizing Python's "random" library
[8]. This randomness in selection plays a pivotal role in the construction of
diverse subsets and is regulated by the “random_state” parameter, set to 42 to
ensure reproducibility.
The procedure involves the "sample" function to randomly select between 1
and 10 features. The process is repeated 1000 times to generate a comprehensive
pool of 1000 distinct subsets. The BDA algorithm is then applied to these 1000
subsects. This strategy not only ensures a varied representation of the dataset's
features but also maintains a consistent feature representation across all datasets
by identifying and including common columns before performing the split [9]
.
13
IV. Fair I-Score Model
The basic model of the fair I-score model is the random forests algorithm,
which is a way to predict response by combining many decision trees. These
trees start with a basic question about whether more classification is needed.
Depending on their answers, data points follow different paths, leading to a final
decision at the end of each path. I want to evaluate if the I-score can indeed be
used to improve the original model. I denote the model that only uses Feng's
dataset as the training and validation set as Model A, and the model that only
uses Yu's dataset as the training and validation set as Model B. Finally, I
designate the model that naively combined Feng and Yu's datasets as the training
and validation set as Model AB, thus obtaining the three most basic models. The
last one is the fair I-score Model, which is essentially the optimized Model AB.
It still utilizes the combined Feng and Yu datasets as training and validation sets
but incorporates the I-score method. I use the remaining four datasets (Hannigan,
Thomas, Vogtmann, and Zeller) as the test datasets of these four models.
We start with an overview of the first three fundamental models, taking Model
A as our primary example. The process begins with segmenting the dataset into
three parts: a training set, a validation set, and a test set. The training set's role
is to build the model, while the validation set assists in the model's training
phase by providing an unbiased performance assessment and aiding in
parameter optimization. The test set is utilized to gauge the model's
effectiveness. For this purpose, Feng's dataset serves as both the training and
14
validation sets, whereas the datasets from Hannigan, Thomas, Vogtmann, and
Zeller are employed as test sets. Initially, Feng's dataset is divided into training
and validation segments in a 7:3 ratio preparing for model training and
validation. I prepare my data by reading CSV files into “Pandas DataFrames”.
To ensure consistency in features across all datasets, I identify and select
common columns. Each dataset's status column is transformed into a binary
form (0/1) using “LabelEncoder”
[10]
. Following this, the construction of the
model's decision trees begins, leveraging the training data. This stage also
includes parameter tuning with the validation set, facilitated by "Scikit-Learn"
a Python library that executes the random forest algorithm [11]. It simplifies the
decision trees' development by selecting random data samples, creating trees
from these samples, and repeating the procedure. The phase of testing then
introduces the previously mentioned four test sets to the developed random
forest model, leading to a voting process. This process averages the outcomes
of the decision trees, selecting the most common prediction as the final outcome.
Finally, I defined a function, “rf_test”, dedicated to evaluating the model's
performance on test datasets, where a confusion matrix plays a crucial role.
A confusion matrix is a way to express how many of a classifier’s predictions
were correct, and when incorrect, where the classifier got confused. In a
confusion matrix, the rows represent the true results, the columns represent the
predicted results, and the values on the diagonal represent the number of times
(or percentage) that the predicted value matches the true value. I use this
15
confusion matrix to calculate metrics like Sensitivity, Specificity, Accuracy, F1-
Score, AUC, etc [12]
.
The other two models, Model B and Model AB, are constructed in a similar
way as above, except that their training and validation sets are Yu’s and
combined Feng and Yu's, respectively. The testing sets remain unchanged and
are still the four datasets mentioned above.
The fair I-score model, the last one, still utilizes the combined Feng and Yu
datasets as the training and validation set, and the four testing datasets remain
unchanged. However, it is structurally different from the Model AB. Two
important functions are used in the process of training and validating the
optimized random forest. The “calculate_i_score”, is used to calculate the Iscores, which determines the significance of feature subsets, aiding in the
identification of influential features. The “fair_train_test” function essentially
is the BDA algorithm and finds the selected threshold that will be discussed in
more detail subsequently. It selects biased feature subset and evaluates models
in a loop for a specified number of iterations. For each iteration, it randomly
selects a subset of features, calculates the I-Score for this subset, and trains a
Random Forest model to evaluate its performance using AUC as the metric. The
features leading to the best AUC are then used to retrain and validate the model.
The final trained model is the fair I-score model. At last, the testing part is also
the same as above. I still use “rf_test” function to evaluate the optimal model's
performance on four test datasets with statistical metrics.
16
V. Select I-score Threshold
The selected threshold of the I-score is determined by maximizing the area
under the receiver-operating characteristic ROC curve (ROC) (AUC) in the
validation dataset. AUC represents the degree to which the model separates two
classes of data points. It informs how much the model is capable of
distinguishing between classes. The higher the AUC, the better the model is at
predicting the responses. This curve plots the relationship between the true
positive rate (TPR) and false positive rate (FPR). TPR represents the probability
that the positive instances are correctly classified. FPR is defined as the
probability of negative instances being incorrectly classified [13]
.
=
+
=
+
From the BDA, I sort the feature subsets based on the I-scores sorted from high
to low. The I-score of each influential feature in a subset is defined as the I-score
of this subset. I select each I-score one by one as a threshold for verification. At
last, the optimal threshold of the I-score was determined by maximizing the
AUC value [14]
.
Table 1 shows an example. There are a total of 100 features and BDA identified
10 subsets with the corresponding I-scores. In the first round, the I-score of the
first subset is 100, and I regard this value as a threshold. I found that there is
only one influential feature 1 in the first subset, which means that the I-score
of influential feature 1 is also equal to 100. Then I set 1 to 0 for all the
17
samples and implemented RF with the remaining 99 features. I then calculate
the AUC by the function in Python “sklearn.metrics.auc”. Finally, I record the
AUC corresponding to the first threshold as 0.833. For the second round, the Iscore of the second subset is 90, and I regard this value as a threshold. I found
that there are two influential features 2 and 3 in the second subset. Then I set
2 and 3 to 0 for all the samples and implement RF with the remaining 97
features. I then calculate the AUC corresponding to the second threshold equals
to 0.849. Repeat this method until all subsets are calculated. Finally, I can record
10 different AUCs corresponding to 10 different subsets. The highest AUC
among the subsets signifies the optimal threshold. Because as mentioned above,
the higher the AUC, the more accurate the model prediction is. So the subset
corresponding to the highest AUC is what I am looking for. I only need to
eliminate these influential features (bias features) with I-score above the
threshold in the validation set to get the I-score model.
Table 1: Artificial example to explain the selected threshold
Subset Influential Features in Subset
I-score
(Threshold)
AUC
1 {1} 100 0.833
2 {2, 3} 90 0.849
3 {4, 5, 6} 80 0.799
… … … …
10 {34, 55, 56, 72, 99} 5 0.891
18
VI. Evaluation of Fair Model
In the evaluation part, I introduce some metrics to compare the original model
and the optimized model to test whether the optimized model really achieves
more accurate predictions [15].
Sensitivity or TPR measures the proportion of positive subjects who are
predicted positive. A highly sensitive result means that most of the positive
subjects are predicted positively.
Specificity in machine learning refers to how well a test correctly identifies
negatives out of all the actual negatives. High specificity indicates that the
model is very good at correctly identifying negative cases.
Accuracy is the fraction of correctly predicted individuals among all the
individuals.
F1-Score is the harmonic mean of precision and recall. Precision is the faction
of true positives among the predicted positives, while recall is the same as
sensitivity. A higher F1-Score indicates better performance.
Sensitivity (detecting actual positive outcome):
+
Specificity (detecting true negative case):
+
Accuracy (reducing the chance of false positives): +
+ + +
F1-Score (balance precision and recall):
+
1
2
(+)
19
Chapter D: Result
I. Prediction Performance of 4 Models on the Hannigan Testing Set
Table 2: Prediction Performance of 4 Models on the Hannigan Testing Set
Model Sensitivity Specificity F1-score Accuracy AUC
Model A 0.983 0.710 0.675 0.509 0.500
Model B 0.250 0.926 0.378 0.518 0.588
Model AB 0.786 0.740 0.587 0.436 0.430
Fair I-score
Model
0.964 0.741 0.675 0.527 0.519
Table 2 shows the performance indicators of the four models on the Hannigan
testing dataset. Firstly, look at the first two rows: Model A and Model B. The relevant
indexes of these two models are not much comparable to the indexes of the optimization
model because their training sets are different. I show Model A and Model B’s metrics
here because they can be compared with the data in Gao et al [3]. Gao et al. [3]
demonstrates realistic applications of merging and integration methods on multiple
CRC metagenomic datasets using random forest classifiers from six individual Leaveone-dataset-out (LODO) experiments. One of the experiments uses the Feng dataset as
the training set and Hannigan as the testing set, which exactly corresponds to my Model
A. The average AUC in Gao et al. [3] is 0.55. Another experimental group used the Yu
dataset as the training set and Hannigan as the testing set which also corresponds to my
Model B. The average AUC obtained was 0.59. The two AUC data I obtained using the
random forest model are 0.500 and 0.588 respectively. These two sets of data are close.
This clearly shows that my basic model based on the random forest model is feasible.
20
Next, look at the third and fourth rows of this table. The two models are comparable.
One is the model before optimization (Model AB), and the other is the optimized model
(fair I-score model). The sensitivity of Model AB is 0.786, while the I-score model is
0.964. Next is the specificity of the model, Model AB is 0.740, and the other is 0.741.
The third observation is the comparison of F1-Score respectively 0.587 and 0.675. In
addition, the accuracy mentioned in the previous paragraph is also an important
criterion, which is 0.436 compared to 0.527. The last AUCs are 0.430 and 0.519
respectively. From the introduction of these indicators above, we know that the closer
these indexes are to 1, the better the model predicts more accurately. From the above
table, the measurement of Model AB's various properties is not as high as that of the Iscore Model. In particular, the sensitivity and AUC are much higher than Model AB.
This fully demonstrates that the fair I-score model optimizes the prediction effect of
Model AB on combined Feng and Yu's dataset.
21
II. Prediction Performance of 4 Models on the Thomas Testing Set
Table 3: Prediction Performance of 4 Models on the Thomas Testing Set
Model Sensitivity Specificity F1-score Accuracy AUC
Model A 0.462 0.639 0.490 0.557 0.550
Model B 0.077 0.869 0.125 0.504 0.473
Model AB 0.173 0.820 0.250 0.522 0.496
Fair I-score
Model
0.365 0.830 0.458 0.602 0.584
Table 3 reveals the prediction performance of four models on the Thomas testing
sets. The same comparison method as above, looking at the first two rows. Comparing
the data in that paper [3], the AUCs of the training sets in Feng and Yu's datasets are 0.58
and 0.47 respectively. The AUCs of Model A and Model B I computed are 0.550 and
0.473 respectively. This is also very close to the data in the comparison article [3], which
also shows that my original model is accurate. Next, look at the data in the third and
fourth rows. The data of each entry in the fair I-score model is greater than Model AB.
The fair I-score model is significantly ahead of Model AB, especially sensitivity (0.173
vs 0.365) and F1-score (0.250 vs 0.458). This also exactly reflects that the fair I-score
model optimizes the prediction effect of Model AB very well.
22
III. Prediction Performance of 4 Models on the Vogtmann Testing Set
Table 4: Prediction Performance of 4 Models on the Vogtmann Testing Set
Model Sensitivity Specificity F1-score Accuracy AUC
Model A 0.250 0.769 0.338 0.510 0.510
Model B 0.462 0.731 0.533 0.596 0.596
Model AB 0.519 0.692 0.568 0.606 0.605
Fair I-score
Model
0.635 0.808 0.695 0.721 0.721
The four models in Table 4 are compared on the Vogtmann testing set. The AUC
data in the first and second rows are 0.510 and 0.596 respectively, while the AUCs
obtained using random forest classifiers Leave-one-dataset-out (LODO) experiments [3]
are 0.513 and 0.60 respectively. Although there are some slight differences of about
0.03-0.04, the results are still relatively close. So my model is still accurate on the
Vogtmann dataset. The following two lines are the key comparison objects, the fair Iscore model has the best sensitivity (0.635), specificity (0.808), F1-score (0.695),
accuracy (0.721), and AUC (0.721). Compared with the Model AB model, each entry
has an accuracy improvement of approximately 13%. Overall, the prediction
performance of the fair I-score model is the best as compared to the baseline models.
23
IV. Prediction Performance of 4 Models on the Zeller Testing Set
Table 5 :Prediction Performance of 4 Models on the Zeller Testing Set
Model Sensitivity Specificity F1-score Accuracy AUC
Model A 0.409 0.802 0.510 0.603 0.605
Model B 0.559 0.703 0.605 0.630 0.631
Model AB 0.570 0.725 0.620 0.647 0.648
Fair I-score
Model
0.699 0.802 0.739 0.750 0.752
The last Table 5 uses the last dataset Zeller as the testing set. Still using the above
method, the AUC obtained by using Feng as the training set in the reference article [3]
is 0.61, and the AUC obtained by using Yu as the training set is 0.76. The AUCs I got
using the same datasets were 0.630 and 0.631, respectively. Comparing the two sets of
indices, I can find that there are some gaps in the model with Yu as the training set. This
is because when selecting the data, I only selected the intersection of columns with the
same properties in Yu and Zeller to build the model [9]. The reference article [3] may
have selected all their data, which resulted in some deviation in AUC. This deviation is
within a reasonable range. Looking at the last two lines, the fair I-score model is
compared to Model AB. The sensitivity increased by 0.129, the specificity grew by
0.077, the F1-score expanded by 0.119, the accuracy escalated by 0.103, and finally,
the AUC enlarged by 0.116. From the data point of view, the optimized models almost
increased the accuracy by about 10% in all entries. This shows that the I-score algorithm
is effective.
24
Chapter E: Conclusion
As machine learning becomes more and more sophisticated, the fairness of its
predictions also comes into focus. Some datasets for machine learning and validation
are biased, which results in biased features in the model. My topic is to use the I-Score
method to eliminate these biases. I used the colon cancer dataset to prove that the
method is feasible. I brought these data into the BDA where the I-Score equation is
stored, and repeated the calculation many times to obtain the I-Score of each feature.
Then bring BDA in the random forest model to find the influential features. Finally, the
selected threshold obtained by maximum AUC is used to filter biased features among
all influential features. After eliminating these biased features, the desired fair model is
obtained. Later, I also used the model fairness metrics to prove that the model I obtained
was indeed performed better.
25
Chapter F: Limitations
This method still has some problems that have not been taken into account. For
example, this method can only handle sensitive attributes with two categories (0/1
variables). Currently, there is no specific solution for samples with multiple categories.
I am also trying to solve this problem, and hope to eventually bring the fair I-score
method into the model training and validation part of all machine learning areas.
26
References:
[1] S. Mitchell, E. Potash, S. Barocas, A. D’Amour, and K. Lum, “Algorithmic fairness:
Choices, assumptions, and definitions,” Annual Review of Statistics and Its
Application, vol. 8, pp. 141–163, 2021.
[2] J. Wu, C. Shih, H. Lu, and S. Lo. “Test-Fairness Deep Learning with Influence
Score.” Artificial Intelligence in Medicine. Academia Sinica. 2023.
[3] Y. Gao, F. Sun. “Batch normalization followed by merging is powerful for
phenotype prediction integrating multiple heterogeneous studies”. Plos
Computational Biology. Public Library of Science. Oct. 2023.
[4] Mármol, Inés, et al. “Colorectal Carcinoma: A General Overview and Future
Perspectives in Colorectal Cancer.” MDPI, Multidisciplinary Digital Publishing
Institute, 19 Jan. 2017.
[5] J. Smith, A. Doe. “Colorectal Cancer Metagenomic Datasets”. European
Nucleotide Archive (ENA).
[6] H. Chernoff, S.-H. Lo, and T. Zheng, “Discovering influential variables: a method
of partitions,” The Annals of applied statistics, vol. 3, no. 4, pp. 1335–1369, 2009.
[7] A. Lo, H. Chernoff, T. Zheng, and S.-H. Lo, “Framework for making better
predictions by directly estimating variables’ predictivity,” Proceedings of the
National Academy of Sciences, vol. 113, no. 50, pp. 14 277–14 282, 2016.
[8] R. Chudoba, et al. “Using Python for Scientific Computing: Efficient and Flexible
Evaluation of the Statistical Characteristics of Functions with Multivariate Random
Inputs.” Computer Physics Communications, North-Holland, 19 Sept. 2012.
[9] H. Wang, S.-H. Lo, T. Zheng, and I. Hu, “Interaction-based feature selection and
classification for high-dimensional biological data,” Bioinformatics, vol. 28, no.
21, pp. 2834–2842, 2012
[10]R. Amanda, E. Negara. “Analysis and Implementation Machine Learning for
YouTube Data Classification by Comparing the Performance of Classification
Algorithms”. Jurnal Online Informatika, 5(1), 61–72. 2020.
[11]A. Kadiyala, A. Kumar. “Applications of Python to evaluate the performance of
bagging methods”. Environmental Progress & Sustainable Energy (Print), 37(5),
1555–1559. 2018.
27
[12]L. Liang. “Confusion Matrix: Machine Learning”. POGIL Activity Clearinghouse,
3(4). 2022.
[13]J. Huang and C. Ling, "Using AUC and accuracy in evaluating learning
algorithms," in IEEE Transactions on Knowledge and Data Engineering, vol. 17,
no. 3, pp. 299-310, March 2005.
[14]Lobo, J. M., Jiménez‐Valverde, A., & Real, R. AUC: a misleading measure of the
performance of predictive distribution models. Global Ecology and Biogeography,
17(2), 145–151. 2007.
[15]Karlijn J. van Stralen, Vianda S. Stel, Johannes B. Reitsma, Friedo W. Dekker,
Carmine Zoccali, Kitty J. Jager, “Diagnostic methods I: sensitivity, specificity, and
other measures of accuracy”, Kidney International, Volume 75, Issue 12, Pages
1257-1263, 2009.
Abstract (if available)
Abstract
A recent study proposed an Influence score (I-score) based Method to improve prediction model accuracy for skin cancer. The method was demonstrated using a dataset of skin lesions, showcasing its potential effectiveness. In my study, I explore the applicability of this method in the context of colorectal cancer (CRC) metagenomic datasets. I employ a random forest algorithm as my foundational model, incorporating the I-score method as an enhancement technique. I then evaluate the performance of this model through various metrics to gauge its success. I show that the I-score method successfully mitigates bias within my original models leading to more accurate predictions.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Application of statistical learning on breast cancer dataset
PDF
Analysis of factors associated with breast cancer using machine learning techniques
PDF
Disparities in colorectal cancer survival among Latinos in California
PDF
Enhancing phenotype prediction through integrative analysis of heterogeneous microbiome studies
PDF
From least squares to Bayesian methods: refining parameter estimation in the Lotka-Volterra model
PDF
Machine learning-based breast cancer survival prediction
PDF
Small area cancer incidence mapping using hierarchical Bayesian methods
PDF
Identifying prognostic gene mutations in colorectal cancer with random forest survival analysis
PDF
NMI: a near infrared conjugated MAO-A inhibitor as a novel targeted therapy for colorectal and other cancers
PDF
Cell-specific case studies of enhancer function prediction using machine learning
PDF
Malignant cell fraction prediction using deep learning: from point estimate to uncertainty quantification
PDF
Conformalized post-selection inference and structured prediction
PDF
Psychosocial adjustment among patients with metastatic colorectal cancer
PDF
The role of altered folate one-carbon metabolism (FOCM) in the development of colorectal cancer
PDF
New methods for asymmetric error classification and robust Bayesian inference
PDF
Early detection of lung cancer by characterizing circulating rare cells using peripheral blood liquid biopsy
PDF
Mindfulness among patients with advanced colorectal cancer
PDF
Research on power load time series forecasting method based on transformer model
PDF
Some results concerning the critical points of directed polymers
PDF
Red and processed meat consumption and colorectal cancer risk: meta-analysis of case-control studies
Asset Metadata
Creator
Xu, Yinchuan (author)
Core Title
Increase colorectal cancer prediction accuracy with the influence (I)-score
Contributor
Electronically uploaded by the author
(provenance)
School
College of Letters, Arts and Sciences
Degree
Master of Science
Degree Program
Applied Mathematics
Degree Conferral Date
2024-05
Publication Date
05/17/2024
Defense Date
05/16/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
backward dropping algorithm,colorectal cancer,eliminate bias,fair model,fairness indicator,I-score,OAI-PMH Harvest,random forest
Format
theses
(aat)
Language
English
Advisor
Sun, Fengzhu (
committee chair
), Heilman, Steven (
committee member
), Winterfeldt, Detlof (
committee member
)
Creator Email
josephxu100@gmail.com,yinchuan@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113939960
Unique identifier
UC113939960
Identifier
etd-XuYinchuan-12924.pdf (filename)
Legacy Identifier
etd-XuYinchuan-12924
Document Type
Thesis
Format
theses (aat)
Rights
Xu, Yinchuan
Internet Media Type
application/pdf
Type
texts
Source
20240517-usctheses-batch-1152
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
backward dropping algorithm
colorectal cancer
eliminate bias
fair model
fairness indicator
I-score
random forest