Close
The page header's logo
About
FAQ
Home
Login
USC Login
Register
0
Selected 
Invert selection
Deselect all
Deselect all
 Click here to refresh results
 Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Identifying and quantifying transcriptional module heterogeneity and genetic co-regulation, with applications in asthma
(USC Thesis Other) 

Identifying and quantifying transcriptional module heterogeneity and genetic co-regulation, with applications in asthma

doctype icon
play button
PDF
 Download
 Share
 Open document
 Flip pages
 More
 Download a page range
 Download transcript
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content Identifying and quantifying transcriptional module heterogeneity and genetic
co-regulation, with applications in asthma
by
Katelyn Queen
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(BIOSTATISTICS)
May 2024
Copyright 2024 Katelyn Queen



I dedicate this dissertation to my husband Michael,
for his never ending support over the past four years.
ii



Acknowledgements
I would like to thank my committee chair, Joshua Millstein, for being a wonderful mentor
through the past four years. I couldn’t have done it without you. I would also like to thank my
committee members Kimberly Siegmund, Nicholas Mancuso, and Mark Selleck for their support
throughout my doctoral studies and for their guidance on completion of projects.
I would like to thank my collaborators at Boston Children’s Hospital, Benjamin Raby and Sung
Chun, at USC, Frank Gilliland and My-Nhi Ngyuen, and at Stanford, Malcolm Barrett for all of
their assistance and thoughtful suggestions for which my dissertation is greater because of.
To Delaney, thank you for supporting me every step of the way. From the first day on the
couches at Lassonde to the day of my defense, there’s no one else I’d rather prove math is false
with.
Finally, I would like to thank my sister, Makenna Queen, for always being a source of inspiration to me and for shoving me along when I thought I couldn’t go any further. I wouldn’t be here
without you.
iii



Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 Genetic Regulation & Asthma . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 ABRIDGE and CAMP Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Overview of Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 2: Software Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1 Super Partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 modACDC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 OSCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 GCTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
iv



Chapter 3: ACDC: a general approach for detecting phenotype or exposure associated coexpression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1 ACDC Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.1 Asthma BRIDGE . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.2 CAMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3 ABRIDGE and CAMP Data Analysis . . . . . . . . . . . . . . . . . . . . 35
3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1 ACDC in ABRIDGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 ACDC in CAMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Differential Expression in ABRIDGE . . . . . . . . . . . . . . . . . . . . 41
3.4 Differential Expression in CAMP . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Methods Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Chapter 4: coVarQTLs and co-eQTLs: genetic co-regulation of large-scale gene expression
regimes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.1 Covariance and Co-expression Calculation . . . . . . . . . . . . . . . . . 51
2.2 Covariance and Co-expression Summary Measure . . . . . . . . . . . . . 52
2.3 QTL Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.4 CAMP and ABRIDGE Data . . . . . . . . . . . . . . . . . . . . . . . . . 54
3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.1 eQTL results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2 Initial CAMP NHW QTL results . . . . . . . . . . . . . . . . . . . . . . . 60
3.3 Non-normality in covariance distributions . . . . . . . . . . . . . . . . . . 61
3.4 CAMP NHW QTL results for capped outliers . . . . . . . . . . . . . . . . 64
3.5 Meta-analysis results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.1 Biological implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Chapter 5: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
1 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A Supplemental Figures and Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
B Supplemental Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
B.1 Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
v



List of Tables
2.1 Patient demographics for ABRIDGE and CAMP whole blood gene expression cohort (n = 865) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 Patient demographics for ABRIDGE and CAMP cohorts . . . . . . . . . . . . . . 36
3.2 CCA Results for gene-gene co-expressions and ACT score components for ABRIDGE
and CAMP cohorts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Ordinal logistic regression estimates for genes in top five modules from CCA on
ACT scores for ABRIDGE and CAMP cohorts . . . . . . . . . . . . . . . . . . . 42
3.4 Results of coXpress analysis in ABRIDGE whole blood gene expression dataset . . 45
4.1 Patient demographics for discovery and replication cohorts . . . . . . . . . . . . . 56
4.2 CAMP NHW QTL analysis results for various expression sets, including outliers . 60
4.3 CAMP NHW QTL analysis results for various quantile normalized expression sets,
including outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 CAMP NHW QTL analysis results for various expression sets with outliers removed 63
4.5 CAMP NHW QTL analysis results for various expression sets with capped outliers 64
4.6 co-eQTL meta-analysis results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.7 coVarQTL meta-analysis results . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
A.1 Gene module information for QTL analyses . . . . . . . . . . . . . . . . . . . . . 97
vi



List of Figures
2.1 Comparing computation time and number of features in reduced data for Partition
and Super Partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Time in hours and percent reduction in features after super-partitioning datasets at
varying reduction levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Time in minutes to partition ABRIDGE whole blood gene expression with varying
maximum cluster sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 OSCA percent variance explained in a random phenotype by gene expression sets
in whole blood samples from asthmatic patients in ABRIDGE and CAMP . . . . . 22
2.5 OSCA percent variance explained in a simulated, 10% associated phenotype by
gene expression sets in whole blood samples from asthmatic patients in ABRIDGE
and CAMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 OSCA percent variance explained in ACT score by gene expression sets in whole
blood samples from asthmatic patients in ABRIDGE and CAMP . . . . . . . . . . 24
2.7 GCTA average heritability of PC1 of gene module expression in whole blood samples from asthmatic patients in ABRIDGE and CAMP . . . . . . . . . . . . . . . 26
2.8 GCTA average heritability of PC1 of gene module covariance in whole blood samples from asthmatic patients in ABRIDGE and CAMP . . . . . . . . . . . . . . . 27
3.1 ACT Score Distributions in CAMP and ABRIDGE . . . . . . . . . . . . . . . . . 37
3.2 Top Gene-gene co-expression Measures and 6-month ACT Score Components in
ABRIDGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Top Gene-gene co-expression Measures and 7-day ACT Score Components in CAMP 41
3.4 Unadjusted Gene Expression and 6-month ACT Score in ABRIDGE . . . . . . . . 43
3.5 Unadjusted Gene Expression and 7-day ACT Score in CAMP . . . . . . . . . . . . 44
4.1 Module information for inflammatory gene set at various information loss criterion 57
4.2 Module information for the full transcriptome at various information loss criterion . 58
vii



4.3 Violin plots of gene covariance by genotype for most significant CAMP NHW
coVarQTLs, including outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4 Violin plots of gene module expression and covariance by genotype for most significant CAMP NHW coVarQTLs, including outliers . . . . . . . . . . . . . . . . 62
4.5 Violin plots of mean gene module expression by genotype for most significant
CAMP NHW co-eQTLs, with outliers capped . . . . . . . . . . . . . . . . . . . . 65
4.6 Violin plots of gene module covariance by genotype for significant CAMP NHW
coVarQTLs, with outliers capped . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.7 Violin plots of gene module expression by genotype for significant METAL coeQTL in each cohort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.8 Violin plots of gene module covariance by genotype for significant METAL coVarQTL in each cohort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
A.1 Percent Variance Explained in 6-month ACT Score by ABRIDGE Whole Blood
Gene Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
A.2 MV FDR Discovery Plot for ABRIDGE Gene co-expression CCA . . . . . . . . . 95
viii



Abstract
In systems biology, a key aim is to identify regulatory mechanisms in normal biological processes, disease, and disease progression. As a result, the number of statistical methods focused on
gene networks, which form the basis of regulatory mechanisms, have greatly increased in the past
decade. However, existing methods are limited both in their genomic and transcriptomic search
spaces as well as what type of outcomes can be studied. Datasets used in these analyses are often
incredibly high dimensional, which leads to computational resource overload and breakdown of existing methodology not built for this type of data. Additionally, traditional statistical models are not
designed to include multi-type phenotypes or exposures, or correlated genes, and addition of these
adds a large magnitude of complexity to the modelling strategy. Because of these methodological
limitations, questions still remain about the feasibility of large-scale network studies. Across the
three parts of my dissertation, I (1) fortify existing software for dimension reduction in high dimensional data, (2) identify and quantify transcriptional covariance in gene sets that cause variance in
a phenotype or external variable of interest, and (3) test for associations between those covariance
features of gene sets and genetic variation across the full transcriptome and millions of locations in
the genome. Projects (2) and (3) involve novel methodology that is widely applicable across both
disease and data type, providing flexible extensions of existing methods.
This methodology and the coinciding applications to cohorts of asthmatic patients provide
evidence for a new type of biological relationship that has long been posited but not proven and
opens the door for new treatment designs. Of even greater significance is the ability to apply
these novel methods to a wide range of diseases and biological questions, from normal biological
functions such as blood sugar regulation to severe disease states such as cancer metastasis.
ix



Chapter 1
Introduction
Systems biology recognizes that ”every object that biology studies is a system of systems”
(Francois Jacob, 1974) and focuses on complex interactions with a goal to tell the complete story of
biological functions. In contrast to the older reductionist school of thought, the systemic approach
studies processes holistically, recognizing that the total is greater than the sum of the parts. One
such idea is that genes seldom act alone, instead acting together in sequential steps to inform a
biological process that cannot be fully understood by looking at genes individually. These gene
pathways or gene regulatory networks (GRNs) form one of the bases for regulatory mechanisms
of disease.
The most traditional statistical analysis for gene expression data is differential expression analysis, which tests for differences in transcriptional abundance across a trait of interest [1] and dates
back to the 1960s [2]. Conventionally, analyses are done for binary traits, such as case versus control, but methodology has expanded to allow for most data types as well as particularly small and
large sample sizes [3]. Since differential expression methods test for single-gene differences independent of all other genes, differences due to changes in gene-gene relationships are not directly
captured and may be missed [4]. Methodology for gene set analyses was developed as a solution,
in which sets of genes with known associations are tested simultaneously [5]. However, these
methods rely on well-developed gene sets, thus limiting the discovery space to already-described
sets. Methods also exist to infer GRNs from real data by looking at gene co-expression, or the patterns of expression of multiple genes in relation to one-another [6]. One such highly cited method
1



is weighted correlation network analysis (WGCNA), which aims to identify groups of highly correlated genes [7]. By combining methods for identifying co-expression groups and methods for
analyzing differences in expression for such groups, changes in group regulation across conditions
can be identified and inferences about the impact of those changes on disease status can be made
[4]. Currently, differential co-expression methodology is limited in that only binary or categorical
outcomes can be studied.
The link between genetic variation and observed phenotypes is explored through genome-wide
association studies (GWAS). GWAS tests for associations between single base-pair allele frequencies, referred to as single nucleotide polymorphisms (SNPs), and a phenotype, independently by
SNP. Since the first successful GWAS on myocardial infarction was published in 2002 [8], > 6,700
studies have been published, describing > 500,000 associations for > 3,300 traits across millions
of people [9, 10]. Sometimes, these associations provide direct insight into disease biology, however it is more common that GWAS results are used as a foundation for further analysis. Most
commonly, traits are influenced by thousands of causal variants, with the risk of any single variant
being negligent, and individual variants are associated with multiple traits. These facts, combined
with linkage disequilibrium (LD), or dependencies between SNPs in a population, make direct
causal inference more complicated, resulting in difficulty deciphering mechanisms of disease using only GWAS results.
One way to begin disentangling these mechanisms is to study genetic regulation of gene expression. A SNP such that the genotype explains some fraction of the variation in the distribution
of gene expression is said to be an expression quantitative locus (eQTL) for that particular gene.
With only 100 study participants, thousands of eQTLs may be detected, and as of 2020, the GTEx
Consortium found eQTLs for 4,278,636 unique SNPs and 23,268 unique genes across 52 tissue
types [11, 12]. Studies show that most detectable eQTLs occur for SNPs within a two megabase
window of the transcription start site of the gene, denoted as a cis-eQTL [13]. eQTLs with SNPs
outside the two megabase window are referred to as trans-eQTLs, and it is still unclear whether
the disparity in number of detected cis-eQTLs vs trans-eQTLs is due to underlying biology or
2



due to the much larger multiple testing burden incurred for trans studies [14]. Additionally, it is
important to note that since gene expression is a snapshot of the time and place from which the
sample was collected, eQTLs can be tissue and disease-state specific, leading to potential issues
with reproducibility of results [15].
However, as previously mentioned, genes most often operate in pathways which drive various
biological processes, and by looking at genes independent of one another, we lose most if not all
of the information contained in gene-gene relationships. Thus, gene pathways or groups must be
investigated together in order to study underlying regulatory mechanisms of disease. For example,
many papers have been published using canonical correlation analysis to study the relationship
between gene expression and SNPs or copy number variation data [16–18]. These methods do not
look specifically at co-expression and therefore still may not detect some of the true regulatory
relationships. Another class of current methodology uses eQTLs as the starting point to construct
co-expression or covariance networks [19–22], with the aim to better understand the underlying
biological function of eQTLs [23]. In this case, co-expressed pairs or modules of genes will only
be considered if all genes have eQTLs, limiting the discovery space. Another method to do this is
Estimating the Genetic Regulatory Effect on Transcription Factors (EGRET), a Bayesian approach
for identifying individual gene regulatory networks (GRNs) [24]. EGRET combines reference
transcription factor (TF) motifs, eQTLs, SNPs, and gene expression profiles to construct patientspecific GRNs and is a downstream analysis of eQTL. It looks at co-expression relationships between TFs and genes, with the aim of building GRNs for disease-associated genetic variants that
impact gene regulation by disrupting TF binding. Finally, there are various methods for detecting
co-expression QTLs (co-eQTLs), which first define modules of co-expressed genes and then test
for associations between the first principal component (PC) of the module-specific expression matrix and SNPs using the same linear models as eQTL analyses [25–29]. Notably, these co-eQTL
methods primarily use single cell RNA sequencing data for discovery and bulk sequencing or microarray for replication, except for the case of Pergola et. al. from 2017. In this case, the authors
3



used bulk sequencing for discovery and replication to find co-eQTLs in patients with schizophrenia. By using the PC of the expression matrix, these methods capture the variance of joint gene
expression profiles and therefore study genetic regulation of gene modules. However, these methods may not capture genetic regulation of gene co-expression, which could preclude inference of
these relationships as regulatory mechanisms. Note that the above references are, to the best of our
knowledge, a comprehensive list of the published co-eQTL studies as of May 2024.
1 Genetic Regulation & Asthma
As of September 2023, over 27 million people in the United States live with asthma [30], an
incurable disease of the airway, characterized by inflammation and excess mucus production. In
children, asthma is more prevalent in males, and Hispanic children are twice as likely as nonHispanic white children to have asthma [31]. In adults, asthma is more prevalent in women, and
the highest prevalence rates are found in Black and Indigenous populations [30]. Traditionally,
chronic asthma management includes inhaled corticosteroids which aim to reduce long-term airway inflammation, and short-term treatments given to alleviate or prevent asthma attacks are more
often bronchodilators, such as Albuterol, which open inflamed airways to provide immediate relief.
Severity of disease is measured by symptomology, as the goal for chronic asthma management
is optimum symptom control. As such, the Asthma Control Test (ACT) was created in the early
2000s to provide a standardized and patient-centric way to measure severity [32]. Additionally,
lung function tests such as spirometry and peak flow may be used as a quantitative way to measure
disease severity and progression.
The genetic underpinnings of asthma are most commonly ascribed to genes involved in inflammation response, such as genes known for ’molecular/signal transducer activity’ and ‘immune
system process’ in the Gene Ontology database [33], of which many are in the interleukin (IL)
receptors family [34]. Many of these single genes found to be differentially expressed in asthmatic
cases versus controls or when comparing severity of asthma symptomology are also known to be
4



more highly expressed in immune cells such as eosinophils or alveolar macrophages. However,
single gene differential expression analyses do not always capture genes in close proximity to loci
known to be associated with asthma, indicating that again, single gene analyses are not sufficient
to tell the full biological story [34, 35]. As a result, analyses testing for differential co-expression
of gene modules and networks gained popularity, finding ’hub genes’, or genes with strong network connections, as potential players in pathogenesis of asthma [36]. These methods also identify
modules of genes with evidence of differential co-expression in severe asthma or specific asthma
sub-types, illuminating molecular differences that could be targets for future therapeutic agents
[37–40].
As of February 2024, there are almost 200 published asthma GWAS identifying > 3,200 associations between genetic variation and asthma incidence rates [9]. Of these, there are a multitude
of studies that have focused on specialized asthma phenotypes such as age of symptom onset,
exacerbation scores, asthma sub-types, and more. Additionally, there are many published eQTL
studies that focus on asthma, either by studying eQTLs in various asthma-related tissues (lung,
bronchial epithelial biopsy, bronchial alveolar lavage, etc.) or by directly performing analyses on
samples from asthmatic patients. The largest study done on asthmatic patients, completed by Kim
et. al. and published in 2023, discovered > 300,000 cis-eQTL from 2,875 genes in whole blood
samples of Korean asthma patients [41]. Of these, only ≈ 98,000 overlap with known eQTLs
in whole blood samples from healthy individuals. As described prior, much the same as single
gene differential expression analyses, eQTL analyses may lack discriminatory power to describe
the full regulatory mechanisms driving asthma incidence and severity. As such, methods to build
networks of eQTLs or discovery of co-eQTLs are a logical next step. However, none of the few
co-eQTL studies that have been published so far focus on asthma, and to our knowledge, there is
only a singular eQTL co-expression network study for asthma. Hao et. al. constructed a network
of Suppressor Of Cytokine Signaling 3 (SOCS3) pathway genes, indicating this pathway could be
a key molecular driver in asthma [42].
5



In this dissertation I aim to fill the aforementioned gaps by defining new methodology for (1)
module-based differential co-expression for phenotypes or exposures of any data type and (2) discovery of covariance quantitative trait loci (coVarQTLs) across the full genome and transcriptome
to study how gene-gene covariance is genetically regulated. A more detailed outline of the proposed methods is seen below in Chapter 1, section 3. I then apply these methods to various cohorts
of asthmatic children and young adults with the goal of illuminating gene expression and genetic
factors driving asthma symptomology in the relevant populations. A detailed description of the
study cohorts can be found in Chapter 1, section 2.
2 ABRIDGE and CAMP Datasets
The Asthma Biorepository for Integrative Genomic Exploration (ABRIDGE) aimed to bring
together data from over 2,700 participants in ongoing (at the time) asthma studies [43]. Patients
were recruited from six cohorts of the EVE Consortium, a group of 11 academic sites who did
genome-wide association studies of asthma [44], and extensive phenotype and genomic data are
publicly available. Twelve studies have been published using data from ABRIDGE. Data from the
Mexico City Children’s Asthma Study (MCCAS) at Hospital Infantil de Mexico Federico Gomez
(HMEX) and Children’s Health Study (CHS) at the University of Southern California (USC) are
used for the applications here.
The Childhood Asthma Management Program (CAMP) was a randomized, placebo-controlled
clinical trial started in the early 1990s for children with mild to moderate asthma. 1,041 children
were enrolled between 1993 and 1995 at eight clinical centers, and extensive baseline data was
collected and is publicly available (GEO accession number GSE22324) [45]. After an average of
three months of baseline assessment, patients were randomized to budesonide dry-powder inhaler
(200 µg twice daily), nedocromil metered-dose inhaler (8 mg twice daily), or matching placebos
for 4–6 years. Patients were then followed for 13 years to assess continued efficacy of treatment
and asthma progression through puberty and into early adulthood. The main conclusion of the trial
6



was that inhaled corticosteroid use is safe for child at a low-medium dosage, although it comes
with a minor reduction in height. More than 140 papers have been published using this data. For
the applications reported here, we use only baseline data including genotype, whole blood gene
expression, and clinical questionnaires.
Gene expression data is available for the following tissue types in ABRIDGE: HeLa, whole
blood, unstimulated CD4, stimulated CD4, bronchoalveolar cells, bronchial epithelium brush, and
alveolar macrophages. In CAMP, gene expression data is available for the HeLa and whole blood
cells. For both, expression profiles were measured via the Illumina Human HT-12 v4 array. All
gene expression profile data were normalized via a log2-transformation and quantile-normalization
within tissue. Expression profiles showed no major differences by sequencing date or clinical site,
indicating an absence of technical batch effects.
Genotype matrices were profiled on various chips by site for ABRIDGE, and for CAMP were
profiled on either the Illumina HumanHap 550K v3 array or the Illumina Human610-Quad v1.0
DNA Analysis BeadChip. Imputation was computed via the Michigan Imputation Server [46],
and genotypes with an R
2
imputation score < 0.80 or minor allele frequency < 0.10 were removed. Additionally, extensive questionnaire data about symptomology, environmental factors,
family history, and more are available for all patients in both studies.
All genetic data was built under the GRCh37 and hg19 human reference genomes.
3 Overview of Projects
Project One: Develop software necessary for projects two and three.
First, we provide a scalable implementation of the Partition dimension-reduction method to
expand the use case to datasets with large numbers of features. We then implement the GCTA
GREML approach, a mixed-effects heritability model, and the OSCA OREML approach, which
estimates the percent variance explained in an external feature by an -omics profile using a similar
7



mixed-effects model, in R. In doing so, we provide a description of the relationship between information loss and variance explained in various external features and are able to create a heritability
or percent variance explained curve over the full range of information loss criterions (ILCs). We
use this to determine optimal levels of dimension reduction based on an external variable. The
OSCA and GCTA functions are included in an R package with the functions for performing the
differential co-expression analysis outlined in project two.
Project Two: Define a module-based differential co-expression analysis for multivariate and
continuous conditions.
The functions of genes and their associated proteins are related to other genes and proteins
through a series of interactions. The modules of genes found by the Partition method should consist of genes that are functionally related, and therefore we expect the genes inside the modules to
be co-expressed. We developed an approach to summarize this co-expression within each module,
defining co-expression as the arithmetic mean of all gene pair covariances. Canonical correlation analysis (CCA) is then applied to quantify correlations between module co-expression and
patient-level features of any data type. A Wilks-Lambda test is used to determine significance of
those correlations, and we further explore significant associations using Kruskal-Wallis tests for
module-phenotype pairs of interest. We also provide an analytical approach for high-dimensional
modules and a comparison to existing methodology.
Project Three: Develop an association test for module-based co-expression and genetic variation.
Using the framework established in Project Two to define module co-expression, we use traditional eQTL linear models, replacing gene expression with gene module covariance measures, to
test for genetic co-regulation of gene module covariance. As this type of study has never been
done before, we explore various options for determining gene modules, summarizing module
8



covariance, and for managing deviations from model assumptions. We also compare this new
methodology to existing co-eQTL models.
9



Chapter 2
Software Development
1 Super Partition
The algorithm defined below, called super partition, is available in the ’partition’ R package
through the CRAN repository [47]. An application note with some of the writing for this section
is in the process of being published under the article name ”Super Partition: fast, flexible, and
interpretable large-scale data reduction in R”.
1.1 Background
Dimension reduction methods reduce complexity in data with the goal to eliminate noise while
preserving signal. Partition is an agglomerative approach that divides data into sets of related
features, where each set can be reduced to a single feature that captures a user-specified minimum
amount of information. It’s a surjective approach in that each original feature maps to one and only
one new feature, facilitating interpretability. The metric for information capture can be selected
or designed by the user, e.g. mutual information for non-normal distributions or the intraclass
correlation coefficient (ICC) for more normally distributed data. It’s also flexible in how features
are reduced within a set, e.g., mean, first principal component, etc. Under some conditions likely
to be common in real data, reduction with Partition can increase power to detect true associations
10



as compared to principal component analysis (PCA), non-negative matrix factorization (NNMF),
or no reduction [48].
However, the original implementation is limited by memory and run-time, due to the requirement that a dissimilarity matrix be computed that includes all feature pairs. Thus, the maximum
size dataset is restricted to tens of thousands of features. In contrast, the Genie hierarchical clustering algorithm incorporates single linkage [49] and is not subject to these constraints, allowing it to
scale up to over one million features in reasonable run-time [50]. Although Genie was not designed
as a data reduction method, it can be used to form a super-partition in a high-dimensional setting,
allowing Partition to be applied to the components, yielding a computationally tractable reduction
approach. We developed an approximation to Partition using Genie that scales the algorithm to big
data, with run-time benchmarking for datasets up to almost one-half million features.
1.2 Methods
In an initial step, Genie is used to form a super-partition of the data composed of ⌈
N
c
⌉ components or clusters, where N is the number of features in the full dataset, and c is the user-defined
maximum cluster size (default value = 4,000). For initial super-partition clusters with size greater
than c, Genie is applied in an iterative process to constrain cluster sizes to below c. Genie allows
users to supply a threshold between zero and one, which ensures that an economic inequity measure of the clusters, the Gini index, does not supersede the threshold. We choose a threshold of
0.05 to highly penalize the formation of small clusters. However, clusters of size Nk > c, small
clusters may still form if features are highly correlated. Thus, if a large cluster is broken into
smaller parts, and the smallest is of size 50 or less, we reject the Genie partition at this step and
instead use k-means [51] with ⌈
Nk
c
⌉ centroids to reduce the large cluster into more evenly sized
parts. Finally, the Partition algorithm is applied to each super-partition cluster. A consequence of
the super-partition step is that the minimum number of features in the final dataset is ⌈
N
c
⌉ or the
initial number of super-partitions.
11



The computational efficiency of Super Partition was assessed and compared to Partition in
several datasets, including transcriptome-wide microarray data in whole blood from the ABRIDGE
and CAMP asthma cohorts (n = 865, number of features (f) = 19,428) [44, 45], transcriptome-wide
RNA-seq data in colon adenocarcinoma tumor tissue from The Cancer Genome Atlas (TCGA)
(n = 481, f = 60,660) [52], and Illumina 450K DNA methylation (DNAm) array data from the
ABRIDGE cohort. The DNAm data were analyzed separately by chromosomes 1, 2, and 3 (n =
705, f = 98,216) and across the full methylome. Benchmarking was completed using the University
of Southern California Center for Advanced Research Computing (CARC) cluster on a single
Xeon-2640v3 2.60 GHz node with a single CPU. Analyses were conducted using R version 4.2.3.
1.3 Results
In a direct comparison to the base Partition algorithm across 19,428 features in the ABRIDGE
whole blood gene expression dataset, Super Partition dramatically outperforms Partition in computation time (Figure 2.1(a)) for all thresholds. The time differences hits a maximum at an ILC of
0.35 and a minimum at the largest ILC values (ILC ∈ [0.90,1]), where minimal dimension reduction occurs. The number of features in the reduced data for each algorithm are similar for the entire
range of correlation thresholds (Figure 2.1(b)), although Super Partition almost always reduces the
dataset slightly more. This trend is particularly visible for small ILCs (ILC ∈ [0,0.25]).
Figure 2.2(a) shows the time in hours to reduce the three datasets across a range ([0, 1]) of
information loss criteria (ILC)s (the minimum ICC for any accepted cluster). As expected, increasing numbers of features result in longer analysis times. Mid-range ILCs may be computationally
expensive because there is a large number of clusters and large sizes of clusters being formed, requiring more algorithm steps. Figure 2.2(b) shows the percent reduction in the number of features
after using Super Partition across the range of ILCs.
Figure 2.3 shows the time in minutes to Super Partition the ABRIDGE whole blood gene
expression data at ILCs of 0.10 and 0.60 for varying maximum cluster sizes (function default
value = 4,000). Reasonably, as the maximum cluster size increases and thus the complexity of
12



Figure 2.1: Comparing computation time and number of features in reduced data for Partition
and Super Partition
0
40
80
120
0.00 0.25 0.50 0.75 1.00
Minimum Intraclass Correlation Coefficient
Time (hours)
(a) Computation Time
0
5000
10000
15000
20000
0.00 0.25 0.50 0.75 1.00
Minimum Intraclass Correlation Coefficient
Number of Features in Reduced Data
Method
Partition
Super Partition
(b) Number of Features in Reduced Data
Figure 2.1 Plot showing (a) time in hours and (b) number of clusters identified after Partitioning (solid
line) and Super Partitioning (dashed line) data across varying thresholds for minimum intraclass correlation
coefficients in the ABRIDGE whole blood gene expression data (n = 865, number of features = 19,428).
The default Super Partition maximum cluster size of 4,000 was used.
each super-partition, so does the computation time for both ILCs. The number of features in the
reduced data for these cluster sizes range from 949 to 652 or 95.12% reduction in the number of
features to 96.67% for an ILC of 0.10, and 15,342 to 15,396 or 21.03% reduction in features to
20.75% for an ILC of 0.60.
Additionally, the entire ABRIDGE 450k DNA methylation data (n = 705, number of features
= 449,580) was partitioned with an ILC of 0.50, resulting in 62.75% reduction in features. This
analysis took 4374.40 minutes or just under 73 hours to complete and required 45 GB of memory
with a single CPU. We note that allocation of additional CPUs does not decrease computation time
due to the single threaded nature of the algorithm.
1.4 Discussion
Super Partition approximates the Partition method. Information loss constraint and distance
metric flexibility, feature mapping, and all other strengths and options of Partition are preserved in
13



Figure 2.2: Time in hours and percent reduction in features after super-partitioning datasets at
varying reduction levels
0.0
2.5
5.0
7.5
10.0
12.5
0.00 0.25 0.50 0.75 1.00
Minimum Intraclass Correlation Coefficient
Time (hours)
(a) Time
0
25
50
75
100
0.00 0.25 0.50 0.75 1.00
Minimum Intraclass Correlation Coefficient
Percent Reduction in Features
Data
ABRIDGE Gene Expression (f = 19,428)
ABRIDGE DNAm (f = 98,216)
TCGA scRNA−seq (f = 60,660)
(b) Percent Reduction in Features
Figure 2.2 Plot showing (a) time in hours to complete Super Partition algorithm and (b) percent reduction
in features after Super Partitioning data across varying thresholds for minimum intraclass correlation
coefficients. ABRIDGE whole blood gene expression data is shown in green, TCGA scRNA-seq data in
pink, and ABRIDGE whole blood 450k DNA methylation in blue.
14



Figure 2.3: Time in minutes to partition ABRIDGE whole blood gene expression with varying
maximum cluster sizes
21.03%
20.75%
20.76%
95.12%
96.45%
95.58%
% − Percent Reduction
in Features
0
30
60
90
0 1000 2000 3000 4000
Maximum Cluster Size
Time (minutes)
Minimum ICC
0.10
0.60
Figure 2.3 Plot showing time in hours to complete Super Partition algorithm with an information loss
criterion of 0.60 (red) and an information loss criterion of 0.10 (blue) for varying maximum cluster sizes in
the ABRIDGE whole blood gene expression dataset. Percent reduction in features for the reduced data are
displayed for selected maximum cluster sizes (250, 2,500, 4,500).
15



Super Partition, and a real data application shows that Super Partition greatly reduces computation
time when directly compared to the original Partition algorithm.
A limitation of the approach is that some sets of features may not be fully reduced. Feature
reduction is only considered within each super-partition cluster, meaning that even if the information loss for combining super-partitions is below the user-specified information loss threshold,
these clusters will not be combined. However, the proposed solution approximates the original
Partition method. That is, Genie is also an agglomerative algorithm, and it is solely used for an
initial super-partition step, thus differences are likely to be modest.
This process preserves the Partition capability of constraining information loss to a user-specified
threshold on a local level. That is, information loss is constrained with respect to a limited number
of related features within a cluster rather than the entire dataset. In contrast, if the user were to
apply PCA and select the top principal components that explain some proportion of the variation
in the data, information loss would be constrained on a global level.
We showed that maximum cluster size is inversely related to computation time, and as stated,
there is a lower bound on the number of features which is dependent on maximum cluster size.
Across the full range of maximum cluster sizes tested, there is a five-fold and eight-fold increase
in computation times for ILCs of 0.60 and 0.10, respectively. However, the largest change in
percent reduction in features occurs between a maximum cluster size of 250 and 2500, and in
this region, there is a 350% increase in computation time and less than a half-percent increase in
percent reduction in features for an ILC of 0.60. Comparatively, for the analysis using an ILC of
0.10, there is also a 350% increase in computation time and a 27.29% increase in feature reduction.
Therefore, we note the trade-off between dimension reduction and computation time, particularly
for small ILCs. Users should consider analysis priorities when choosing these parameters.
In summary, the Partition package is a fast, flexible way to reduce datasets while preserving
information and interpretability of results, now feasible for datasets with hundreds of thousands of
features.
16



2 modACDC
’modACDC’ is an R software package available from the CRAN repository [53] and is the
accompanying software for the association of covariance for detecting differential co-expression
(ACDC) methodology, outlined in Chapter 3. Below, we detail the modelling strategy of OSCA
OREML and GCTA GREML, two methodologies for aiding in the choice of dimension reductions
thresholds.
2.1 Background
As previously mentioned, Partition requires the user to specify an ILC, with no meaningful default value. While this ability is helpful should the user know an optimal reduction level based
either on the downstream analyses planned or based on their specific data, this poses a challenge when no such information is available. Here, we propose the use of both the Yang Lab’s
omics-data-based complex trait analysis (OSCA) omics restricted maximum likelihood (OREML)
method and the Yang Lab’s genome-wide complex trait analysis (GCTA) genomics restricted
maximum likelihood (GREML) method to select the optimal data reduction level for Partition.
Narrow-sense heritability, denoted by ρ
2
, is the proportion of phenotypic variation explained
by genotype and is based on the idea that phenotypic variance can be divided into two components,
genetic, o, and environmental, e [54]:
ρ
2 = σ
2
o /(σ
2
o +σ
2
e
) (2.1)
To estimate heritability, we need to jointly estimate the effect sizes of all SNPs across the genome,
but this cannot be done without satisfying further assumptions [55]. One solution is to assume
that phenotypic variation is explained by only a few causal SNPs, and therefore, we only include
SNPs in the global estimator that have a significant association with the phenotype in a univariate
linear regression. However, the omnigenic model of complex traits says that phenotypic variation
is explained by weak associations with many SNPs and excluding non-significant SNPs throws
17



away important signals [56]. Therefore, a global estimator of all SNPs must be used. Assuming
that phenotypic variance as it relates to gene expression has the same explanatory components as
it does in GWAS, it is reasonable to say that the same omnigenic model applies, and thus, methods
of heritability estimation will also apply.
2.2 OSCA
2.2.1 Methods
As given in the OREML method [57], assume we have a numeric vector of phenotype values
y[n×1]
, with n the sample size. Additionally, assume a numeric matrix of standardized gene expression values G[n×g] with g the number of genes. A standard multi-level, linear mixed-effects model
for y based on G is
y = Gu+Cβ +ε (2.2)
with Cn×c the matrix of covariates and ε ∼ N(0,Iσ
2
e
) the error term. Here, β represents fixed
effects and u ∼ N(0,Iσ
2
u
) represents random effects. The variance-covariance matrix for y is
Var(y) = V = GG′σ
2
u +Iσ
2
e
= Aσ
2
o +Iσ
2
e
(2.3)
with A = GG′/g defined as the omics-data-based relationship matrix (ORM), analogous to the
genetic relationship matrix (GRM) in traditional heritability methods, and σ
2
o = mσ
2
u
the phenotypic variance captured by all genes. Given this is a standard multi-level model, σ
2
o
and σ
2
e
can be
estimated by restricted maximum likelihood (REML) methods [58], and thus, ρˆ
2
from Equation
2.1 can be calculated. These estimates are all computed by the OREML function in the OSCA
software [57], for which we created an R wrapper function.
We calculate ρˆ
2
for each Partitioned data set across the full range of possible ILCs (ILC ∈
[0,1]) by a user-supplied increment (default value of 0.05) for both the observed phenotype and a
18



permutation of the phenotype. If the data set includes more than 4,000 genes, the Super Partition
method (described in Chapter 2, Section 1) is used. Standard errors for all point estimates are also
calculated. In addition, percent reduction in the number of features as well as information lost,
calculated as
Information Lost = 1−
∑
M
i=1
ICCmi
|mi
|
∑
M
i=1
|mi
|
(2.4)
where M is the total number of modules from Partition for a given ILC, mi
is module i, and |mi
|
gives the size of module i, are returned for each ILC. Options for including adjustment covariates,
choosing an alternate REML algorithm, and altering the maximum number of iterations for the
REML algorithm are also included. Finally, a function to create the graphs shown in the results
section (Figures 2.4, 2.5, and 2.6) is also available in the software package.
Below, we present results for runs of the analysis in the ABRIDGE and CAMP asthmatic,
whole blood gene expression cohort (n = 865, see Table 2.1 for demographics and Section Chapter
1, section 2 for more information on the data set) for two gene sets: genes that are annotated for
inflammatory response in Gene Ontology (g = 623) [33], and all genes (g = 19,428). Both gene
sets are used to estimate percent variance explained for three phenotypes:
1. a randomly generated phenotype ∼ N(0,1),
2. a 10% associated, simulated phenotype, and
3. ACT score.
The ACT score analysis was adjusted for patient age, sex assigned at birth, and self-identified
ethnicity.
The 10% associated, simulated phenotype, y0.10 is generated by the sum of a standardized
linear combination of weights and gene expression plus noise:
y0.10 =
Gα
σGα
+ν, where Gα = ∑
i
αG (2.5)
19



Table 2.1: Patient demographics for ABRIDGE and CAMP whole blood gene expression cohort
(n = 865)
Table 2.1 CAMP = Childhood Asthma Management Program; CHS = Children’s Health Study; MCCAS =
Mexico City Children’s Asthma Study
with α[1×g] ∼ U[0,1] and ν[n×1] ∼ N(0,3). Note that the gene expression data used for generating
the weights is the full, unreduced data.
2.2.2 Results
Figure 2.4 indicates that, as expected, the OSCA OREML percent variance explained estimates
for the randomly generated phenotype are never significantly different than zero for any level of
data reduction in either the inflammatory gene set or the full transcriptome.
When we simulated a phenotype such that 10% of the variation can be explained by the gene
expression data, we see that OSCA OREML captures that 10% variance explained in the inflammatory gene set, as seen in Figure 2.5(a). The percent variance explained estimate is stable at
0.08 or 8% across almost the entire range of reduction. In contrast, in Figure 2.5(b), we see that
while the percent variance explained estimate starts at 0.10 or 10%, it increases until the data is
20



reduced by 50%, hitting a maximum of 0.213 before returning to 0.10. Additionally, the percent
variance explained in a permutation of the phenotype, while less than in the simulated phenotype,
is non-zero for some of the largely reduced datasets. We also see that the OSCA OREML percent
variance explained estimates for the inflammatory gene set are consistently below the estimates for
the full transcriptome, which is consistent with the percent variance explained calculation being an
increasing function in relation to number of features.
In an application to real data, we estimate percent variance explained in ACT score by gene
expression (Figure 2.6) and see the same trends as in the simulation of an associated phenotype.
Estimates are consistently lower in the inflammatory gene set (a) compared to the full transcriptome
(b), and variance explained increases as the dataset is reduced in the full transcriptome, hitting a
maximum of 27.7% at 45% information loss, corresponding to an ILC of 0.30. Additionally,
all variance explained estimates for the permuted ACT score are not significantly different from
zero. We also note that compared to an unadjusted analysis, estimates are drastically lower after
adjusting for patient age, sex assigned at birth, and self-identified ethnicity.
2.3 GCTA
2.3.1 Methods
Assume the same set up as in OREML (Chapter 2, Section 2.2.1), with multi-level, linear fixed
effects model
y = Wu+Cβ +ε (2.6)
where in this case, W is the matrix of standardized genotypes, with a given entry wi j = (xi j −
2pi)/(
p
2pi(1− pi)). Here, xi j is the number of copies of the reference allele for SNP i in person
j, and pi
is the minor allele frequency in the study population. Note that this is an additive model
of autosomal SNPs. ρˆ
2
(Equation 2.1) can then be estimated via REML methods, as previously
mentioned. These estimates are all computed by GCTA GREML functions [59], for which we
created an R wrapper.
21



Figure 2.4: OSCA percent variance explained in a random phenotype by gene expression sets in
whole blood samples from asthmatic patients in ABRIDGE and CAMP
0123 7 13 17 24 31 36 46 54 65 73 83 91 100
0.00
0.25
0.50
0.75
1.00
0 25 50 75 100
Percent Reduction
Information Lost
Percent Variance Explained
Randomly Generated Variable
Observed
Permuted
(a) Inflammatory Genes
01361015 21 27 32 39 45 51 59 67 77 89 96 100100
0.00
0.25
0.50
0.75
1.00
0 25 50 75 100
Percent Reduction
Information Lost
Percent Variance Explained
Randomly Generated Variable
Observed
Permuted
(b) Full Transcriptome
Figure 2.4 Plot showing percent variance explained in a randomly generated phenotype by (a)
inflammatory genes and (b) the full transcriptome in ABRIDGE and CAMP whole blood gene expression
data at varying levels of dataset reduction, calculated for observed phenotype in blue and permuted
phenotype in red. Estimates for observed phenotype measures include standard error bars. An information
loss value of zero represents the unreduced dataset, and an information loss level of 100 represents the data
being reduced to the average expression of all genes. Estimates were calculated with the average
information REML algorithm.
22



Figure 2.5: OSCA percent variance explained in a simulated, 10% associated phenotype by gene
expression sets in whole blood samples from asthmatic patients in ABRIDGE and CAMP
0123 7 13 17 24 31 36 46 54 65 73 83 91 100
0.00
0.25
0.50
0.75
1.00
0 25 50 75 100
Percent Reduction
Information Lost
Percent Variance Explained
10% Associated, Simulated Phenotype
Observed
Permuted
(a) Inflammatory Genes
01361015 21 27 32 39 45 51 59 67 77 89 96 100100
0.00
0.25
0.50
0.75
1.00
0 25 50 75 100
Percent Reduction
Information Lost
Percent Variance Explained
10% Associated, Simulated Phenotype
Observed
Permuted
(b) Full Transcriptome
Figure 2.5 Plot showing percent variance explained in a simulated, 10% associated phenotype by (a)
inflammatory genes and (b) the full transcriptome in ABRIDGE and CAMP whole blood gene expression
data at varying levels of dataset reduction, calculated for observed phenotype in blue and permuted
phenotype in red. Estimates for observed phenotype measures include standard error bars. An information
loss value of zero represents the unreduced dataset, and an information loss level of 100 represents the data
being reduced to the average expression of all genes. Estimates were calculated with the average
information REML algorithm.
23



Figure 2.6: OSCA percent variance explained in ACT score by gene expression sets in whole
blood samples from asthmatic patients in ABRIDGE and CAMP
0123 7 13 16 25 31 38 48 53 62 72 83 92 100
0.00
0.25
0.50
0.75
1.00
0 25 50 75 100
Percent Reduction
Information Lost
Percent Variance Explained
ACT Score
Observed
Permuted
(a) Inflammatory Genes
01361015 21 27 32 39 45 51 59 67 77 89 96 100100
0.00
0.25
0.50
0.75
1.00
0 25 50 75 100
Percent Reduction
Information Lost
Percent Variance Explained
ACT Score
Observed
Permuted
(b) Full Transcriptome
Figure 2.6 Plot showing percent variance explained in 6-month ACT score by (a) inflammatory genes and
(b) the full transcriptome in ABRIDGE and CAMP whole blood gene expression data at varying levels of
dataset reduction, calculated for observed ACT scores in blue and permuted ACT scores in red. Estimates
for observed ACT scores include standard error bars. An information loss value of zero represents the
unreduced dataset, and an information loss level of 100 represents the data being reduced to the average
expression of all genes. Estimates were calculated with the average information REML algorithm and are
adjusted for patient age, sex assigned at birth, and self-identified ethnicity.
24



For each ILC across the full range of possible ILCs (ILC ∈ [0,1]) determined by a user-supplied
increment (default value of 0.05), the data is Partitioned and modules extracted. If the data set
includes more than 4,000 genes, the Super Partition method (described in Chapter 2, Section 1)
is used. For each module M, either the covariance matrix is calculated (definition and calculation
defined in Chapter 3, Section 2.1) or the module-specific expression matrix is extracted, dependent
on which summary type the user supplies (”covariance” or ”coexpression”, respectively). Then,
the first principle component (PC) of the matrix is used as the outcome measure, y, for calculating
the module-specific ρˆ, denoted as ρˆ
2
M. Finally, all ρˆ
2
M are averaged, providing either a covariance
or module-specific expression ρˆ
2
for each ILC value. The standard deviation for the average
estimates is also returned as well. In addition, ρˆ
2
is calculated for a permutation of the first PC
for each matrix if the ”permute” option is selected, and the standard deviation for the average
estimates is returned. Percent reduction in the number of features as well as information lost
(see Equation 2.4) are calculated and returned for each ILC. Options for including adjustment
covariates, choosing an alternate REML algorithm, and altering the maximum number of iterations
for the REML algorithm are also included. Finally, a function to create the graphs shown in the
results section (Figures 2.7 and 2.8) is also available in the software package.
Below, we present results for runs of the analysis in the meta whole blood gene expression and
genotype cohort (n = 496, see Table 4.1 for patient demographics and Chapter 1, Section 2 for
more information on the data sets) for covariance and co-expression of two gene sets: genes that
are annotated for inflammatory response in Gene Ontology (g = 627), and all genes (g = 20,492).
Models are adjusted for patient age, sex assigned at birth, and the first four PCs of the genotype
matrix.
2.3.2 Results
Figure 2.7 shows the average heritability for gene module expression for increasingly reduced
datasets. Heritability estimates decrease as the dataset is reduced, and this decline is more pronounced in the full transcriptome. Estimates for the permuted data are stable across the span of
25



Figure 2.7: GCTA average heritability of PC1 of gene module expression in whole blood
samples from asthmatic patients in ABRIDGE and CAMP
0113 6 11 15 22 30 35 42 51 61 65 73 84 93 100
0.00
0.25
0.50
0.75
1.00
0 25 50 75 100
Percent Reduction
Information Lost
Heritability
Gene Coexpression
Observed
Permuted
(a) Inflammatory Gene Set
024 8 13 18 24 30 36 42 49 56 63 72 81 91 98 100 100
0.00
0.25
0.50
0.75
1.00
0 25 50 75
Percent Reduction
Information Lost
Heritability
Gene Coexpression
Observed
Permuted
(b) Full Transcriptome
Figure 2.7 Plot showing average heritability of PC1 of gene module co-expression for (a) inflammatory
genes and (b) the full transcriptome in ABRIDGE and CAMP whole blood gene expression data at varying
levels of dataset reduction, adjusted for patient age, sex assigned at birth, and the first four PCs of the
genotype matrix. Values are calculated for observed module covariance in blue and permuted gene
co-expression in red. Estimates include standard deviation bars. An information loss value of zero
represents the unreduced dataset, and an information loss level of 100 represents the data being reduced to
the average expression of all genes. Estimates were calculated with the expectation-maximization REML
algorithm.
26



Figure 2.8: GCTA average heritability of PC1 of gene module covariance in whole blood
samples from asthmatic patients in ABRIDGE and CAMP
0113 6 11 15 22 30 35 42 51 61 65 73 84 93 100
0.00
0.25
0.50
0.75
1.00
0 25 50 75 100
Percent Reduction
Information Lost
Heritability
Gene Covariance
Observed
Permuted
(a) Inflammatory Gene Set
024 8 13 18 24 30 36 42 49 56 63 72 81 91 98 100 100
0.00
0.25
0.50
0.75
1.00
0 25 50 75
Percent Reduction
Information Lost
Heritability
Gene Covariance
Observed
Permuted
(b) Full Transcriptome
Figure 2.8 Plot showing average heritability of PC1 of gene module covariance for (a) inflammatory genes
and (b) the full transcriptome in ABRIDGE and CAMP whole blood gene expression data at varying levels
of dataset reduction, adjusted for patient age, sex assigned at birth, and the first four PCs of the genotype
matrix. Values are calculated for observed module covariance in blue and permuted gene co-expression in
red. Estimates include standard deviation bars. An information loss value of zero represents the unreduced
dataset, and an information loss level of 100 represents the data being reduced to the average expression of
all genes. Estimates were calculated with the expectation-maximization REML algorithm.
27



data reduction and are non-zero. Notably, there is only one point across both datasets where there
is not overlap between the permuted and observed heritability estimates, and that is in Figure 2.7(b)
at 10% information loss, or for an ILC of 0.85.
Figure 2.8 shows the average heritability for gene module covariance for the same set of increasingly reduced datasets as the previous figure. In contrast to module expression heritability
estimates, estimates here are much more stable as the data is reduced, besides a quick yet dramatic
decrease in the first 10% of reduction in the inflammatory gene set (a). Again, estimates for the
permuted data are stable across the span of data reduction and are also non-zero. Here, there are
no points for either dataset where the permuted and observed heritability estimates are statistically
significantly different.
2.4 Discussion
Together, the data indicates that using percent variance explained estimates as a supervised
method for choice of dimension reduction in gene expression data is potentially useful. However,
further study is needed to determine the usefulness of heritability estimates of gene co-expression
and covariance for data reduction. Like heritability estimators in GWAS, we acknowledge that
both heritability and percent variance explained estimates are highly dependent on the statistical
model used, particularly on the assumptions surrounding distributions of effect sizes. Both OSCA
OREML and GCTA GREML assume mean zero, normally distributed effect sizes, and therefore,
estimates are most directly comparable to other models that make the same assumptions. Additionally, we note that heritability is a scale-dependent measure, meaning that heritability estimates
change when the range of values of the outcome change, creating concerns about the true interpretability of estimates.
In simulations, OSCA OREML performs as expected. When looking at percent variance explained in a 10% associated, simulated phenotype, we see that the true, 10% association is captured
by OSCA OREML under the model assumptions, and when a random phenotype is simulated, the
percent variance explained estimates are zero.
28



There is a consistent pattern across the two non-null phenotypes tested here in which datasets
can be reduced by roughly 50% before any noticeable loss in percent variance explained. This
indicates that all the percent variance explained until greater than 50% reduction occurs between
modules rather than within them. This phenomenon was observed in simulations for Partition,
leading to increased power due to fewer features explaining more variation in the data [48]. Further
study is needed to see if this is a general pattern across gene expression data.
The results of using GCTA GREML to estimate heritability for gene co-expression and covariance are not as expected. Estimates are quite high for all values and are particularly high at
either end of the ILC range (ILC ∈ [0,0.10]∪[0.90,1]). Additionally, estimates for permuted coexpression and covariance measures are much higher than expected given the broken dependence
between independent and dependent variables. A potential cause of the problem is that gene covariance measures are non-normal. However, since gene co-expression measures are roughly normally
distributed, this cannot account for all concerns with the GCTA GREML analyses. Compared to
other heritability estimators, GCTA GREML estimates are consistently higher, and simulations
also showed instability in GCTA GREML estimates both when model assumptions are and are not
met [60], although the original authors of GCTA refute those claims [61].
A limitation of using GCTA GREML to generate estimate curves along the full possible range
of ILCs is the computation time. While a single call to the GCTA GREML algorithm takes only
seconds, each graphs makes thousands of calls, dependent on the number of features of the original
dataset. The most computationally expensive point on Figure 2.7(b), run on a dataset of just under
20,000 features, took roughly five days to calculate. However, the algorithm is parallelized to help
offset the time cost.
In conclusion, we present R wrappers for OSCA OREML and GCTA GREML methods as well
as functions to look at percent variance and heritability estimates, respectively, across a range of
increasingly-reduced datasets. These methods may be useful in determining optimum levels for
data reduction in transcriptomics and other types of -omics studies.
29



Chapter 3
ACDC: a general approach for detecting phenotype or exposure
associated co-expression
Note: This chapter is published in Frontiers in Medicine, under the same article name [62]. There
have been edits to reflect additional work completed and to create better flow between chapters.
1 Introduction
Differential expression analysis has long been used to test for differences in transcriptional
dependencies across conditions, and may explain phenotypic variation in a population. However, differential expression methods study each gene independent of any other and therefore may
not capture transcriptional differences due to changes in gene-gene relationships. Differential coexpression methods test for differences in gene co-expressions, and thus, such approaches may
illuminate regulatory mechanisms not identified by differential expression analysis alone [63].
Module-based differential co-expression methods incorporate information about gene connectivity, and assume that the genes within a module are correlated in the general population. These
approaches can have good statistical power due to a reduction in ’noise’ [4], or unrelated variation
of individual genes by collapsing related genes into a single feature. Generally, these modulebased methods can be distinguished from one another by, i) whether modules are defined by the
user or the method, ii) if differential co-expression is detected within or between modules, and iii)
30



how many conditions are assessed. Methods may also detect differential co-expression for gene
pairs across the phenotype of interest and then apply post-hoc clustering methods to identify coexpressed modules. One highly-cited method, CoXpress, determines differentially co-expressed
modules given microarray data [64]. By cutting the trees determined by average-linkage hierarchical clustering at a user-defined threshold, genes are split into modules. Then, pairwise correlation
coefficients are used to created a distribution of co-expression for each module under two conditions. If these distributions are statistically significantly different from random in one condition
and not the other, the module is considered differentially co-expressed.
While many methods exists for binary conditions and a few for greater than two, we are unaware of any module-based differential co-expression approaches designed to detect differences
across continuous conditions or multiple types of conditions simultaneously. Here we describe a
novel method, association of covariance for detecting differential co-expression (ACDC), to detect
differential co-expression across multiple binary, ordinal, or continuous phenotypes or exposures.
We report an application to gene expression measured in two independent cohorts of asthmatics
to determine whether genes in inflammatory pathways are co-expressed across levels of asthma
control.
2 Materials and Methods
2.1 ACDC Description
ACDC is designed to detect dependencies between gene-gene co-expression (or connectivity)
and a set of external features that can be either exposures or responses. That is, ACDC is applied to
test for evidence of association between measures of co-expression and measures of external features. Notably, the external features are not constrained to be categorical, the typical requirement
[4], but could be continuous or ordinal.
The concept of co-expression can be used to quantify the dependence between two random
variables and thus to quantify gene-gene co-expression. It is possible for the co-expression of
31



a pair of genes to depend on external features. For example, suppose in a biological pathway,
two genes tend to be co-regulated and thus co-expressed, resulting in positive co-expression. A
perturbation to the pathway could alter that relationship, resulting in a change in co-expression
and thus a change in co-expression. If candidate perturbagens and the expression of genes in the
pathway are measured, ACDC may be applied to detect these types of effects simultaneously for
the multiple genes and perturbagens. Using a similar rationale, ACDC could be applied to detect
downstream results of pathway perturbations if the affected phenotypes are measured.
Suppose all individuals have measurements for all M gene expression features in the set, referred to here as a ’module’, and all P external features. Assume the vector of P external features
are distributed as multivariate normal,
x = (x1, x2,..., xP)
T ∼ N (µx,Σx) (3.1)
with xp representing each external feature. Though we describe x as multivariate normal here, we
can relax this assumption in practice and allow other distributions and variable types, as in a design
matrix.
Suppose the M gene expression features are also distributed as multivariate normal with the
co-expression matrix depending on x,
g = (g1,g2,...,gM)
T ∼ N (µg,Σg|x) (3.2)
where gj denotes the expression of gene j. The co-expression matrix can be represented by,
Σg =












σ1,1 σ1,M
σ2,2 σ2,M
σM,1 σM,2 σM,M












(3.3)
32



The off-diagonal elements of Σg can be considered measures of co-expression, and (for given
values of x) estimated in the conventional way,
σˆ j,k =
1
N −1
Σs

gs, j −g¯j
gs,k −g¯k

. (3.4)
Note that this is essentially an average over individuals. Letting s denote an individual, each
contribution is,
σˆ j,k =

gs, j −g¯j
gs,k −g¯k

(3.5)
These individual components have approximately the same expectation as the the scaled sum,
therefore they can also be described as estimators for σj,k
. We leverage this property to test for
dependencies between the co-expressions and the external features.
We can denote the co-expression profile for a given module as,
C =

σ1,2,...,σj,k
,...σ(M−1),M

;|C| =

M
2

= G. (3.6)
We are interested in dependencies that may exist between the external features, x, and the genepair co-expressions, the off diagonals of Σg. If we have a single external feature or a single pair of
genes, conventional general linear modeling (GLM) approaches could be used to relate x toC. For
multiple gene pairs and external features, CCA can be applied, or sparse CCA for high dimensional
settings. CCA finds min[G,P] linear combinations, a ∈ R
P
,b ∈ R
G, of C and x, respectively, that
maximize the correlation,

a
′
1
,b
′
1

= argmax corr
a
T
1
x,b
T
1C

; ρ1 = corr
a
T
1
x,b
T
1C

, (



for example, for the first pair of canonical variables. Note that CCA can be applied even if G and/or
P is equal to one [65]. Wilks-Lambda can be used to conduct a joint hypothesis test of whether the
correlation coefficients found by CCA are significantly different from zero,
H0 : ρi = 0, for all 1 ≤ i ≤ min[G,P]
HA : ρi ̸= 0, for some 1 ≤ i ≤ min[G,P].
(3.8)
A rejected test implies dependent co-expression, i.e., that there are linear combinations of genegene co-expressions associated with linear combinations of external features.
False discovery rates (FDR) can be computed using the Benjamini-Hochberg (BH) [66] method
when multiple modules are tested and parametric assumptions apply. If severe departures from the
assumed distributions may be present, permutation-based approaches such as the Millstein and
Volfson (MV) FDR [67] method can be used.
The R code for ACDC is publically available at https://github.com/USCbiostats/ACDC,
and is also available from the CRAN repository, https://cran.r-project.org/web/packages/
modACDC/index.html. R version 4.2.0 was used for all analyses.
2.2 Datasets
2.2.1 Asthma BRIDGE
For an overall description of the ABRIDGE study, see Chapter 1.2.
The discovery dataset includes gene expression in whole blood from 245 patients with doctordiagnosed asthma from ABRIDGE (Table 3.1), profiled using the Illumina HumanHT-12 v4 Expression array. Six-month asthma control test (ACT) scores were calculated from questionnaire
responses about wheezing with and without exercise, patient waking due to wheezing, and the
need for Albuterol in the last six months (range: [4,20]), where higher scores indicate suboptimal
control (Figure 3.1).
34



The gene expression profile data were normalized via a log2-transformation and quantilenormalization. Duplicate probes were condensed using the largest median absolute deviation,
leaving only probes with unique targets. The analysis includes 623 probes with targets annotated
for inflammatory response in Gene Ontology.
2.2.2 CAMP
For an overall description of the CAMP study, see Chapter 1.2.
Results from the initial analysis were followed up in an independent dataset that included whole
blood gene expression from 604 asthmatics, primarily young adults who were enrolled in CAMP
as children (Table 3.1), profiled using the HumanRef8 v2 BeadChip array. Seven-day ACT scores
were calculated using baseline questionnaire responses about rescue and preventative bronchodilator use, activity limits, and frequency of waking due to wheezing in the past seven days (range:
[0,28]), where higher scores indicate suboptimal control (Figure 3.1). The same data processing
normalization steps were taken as in Asthma BRIDGE.
2.3 ABRIDGE and CAMP Data Analysis
To identify modules of correlated genes, we applied the Partition data reduction method [48,
68], an agglomerative approach that requires the user to specify an acceptable proportion of information loss when collapsing all features to a single measure such as the mean. Selection of
the information loss threshold was guided by the aim to maximize information explained in the
ACT score while minimizing noise. Further explanation is provided in the supplemental material
[57]. We used an information loss constraint of 0.35 which corresponds to a minimum of 65%
information from the non-reduced data captured by each new feature, as assessed by the intraclass correlation coefficient (ICC). This reduction threshold resulted in roughly 50% reduction in
features when compared to the full dataset (Figure A.1).
In analyses of individual genes within modules of interest, gene-ACT score relationships were
modeled using ordinal logistic regression, adjusting for patient age, race, sex, data collection site,
35



Table 3.1: Patient demographics for ABRIDGE and CAMP cohorts
Table 3.1 CAMP = Childhood Asthma Management Program
36



Figure 3.1: ACT Score Distributions in CAMP and ABRIDGE
Figure 3.1 (a) The distribution of 6-month ACT scores in ABRIDGE Whole Blood gene expression, with
scores being calculated with information about wheezing with and without exercise, patient waking due to
wheezing, and the need for rescue medications in the last six months. (b) The distribution of 7-day ACT
scores in CAMP Whole Blood gene expression, with scores being calculated with information about the
need for rescue and preventative medications, activity limits, and patient waking due to wheezing in the
past seven days.
37



and the first three principal components (PCs), which may capture global dependencies due to
cell-type composition and technical artifacts, of each gene expression data set. Associations were
identified at the 0.05 FDR level.
To clarify the novel attributes of ACDC, a comparative analysis was conducted in the ABRIDGE
cohort using CoXpress. The ACT score was dichotomized at the median value to indicate better
vs worse asthma control to conform to the coXpress requirement of a binary phenotype. The Pearson correlation coefficient was used as the similarity measure, and for module identification the
dendrogram was cut at a height of 0.35 for consistency with the Partition approach.
3 Results
3.1 ACDC in ABRIDGE
ACDC was performed on 65 modules identified by Partition in the ABRIDGE dataset. The
results for the top five modules based on BH FDR can be found in Table 3.2. Evidence suggestive
of differential co-expression as determined by CCA Wilks-Lambda p-value ≤ 0.05 was found for
two modules including genes NOD-like Receptor Family Pyrin Domain Containing 12 (NLRP12),
Meteorin Like, Glial Cell Differentiation Regulator (METRNL), and Ghrelin And Obestatin Prepropeptide (GHRL) in module A (BH FDR = 0.0737), and Adenosine A3 Receptor (ADORA3),
Arachidonate 15-Lipoxygenase (ALOX15), and Indoleamine 2,3-Dioxygenase 1 (IDO1) in module B (BH FDR = 0.1569). We also computed the non-parametric, permutation based FDR estimate Millstein-Volfson (MV) to account for departures from the normality assumption by the ACT
variable, which is ordinal. However, the results of the MV FDR test are in approximate agreement
with the BH FDR results, yielding two modules with evidence of differential co-expression (Figure
A.2), FDR = 0.0554, 95% CI: (0.0054, 0.5742)).
To further explore the relationship between co-expression of genes in modules A and B and
asthma control, Kruskal-Wallis tests were performed to determine whether co-expression measures
for all possible pairs of these genes differ across levels of the ACT score components. Eight of the
38



Table 3.2: CCA Results for gene-gene co-expressions and ACT score components for ABRIDGE
and CAMP cohorts
Table 3.2 The top five modules in ABRIDGE out of the 65 are shown for brevity, and the analysis was
repeated in the CAMP cohort for these five modules.
total 24 tests resulted in p-values less than 0.05, with the top six coming from module B. The most
significant test involved the co-expression of IDO1 and ADORA3 and the frequency of waking
from wheezing in the past six months (p = 0.0021) (Figure 3.2).
3.2 ACDC in CAMP
We performed ACDC using data from CAMP in an attempt to replicate results observed for the
top five modules identified in ABRIDGE. We found evidence of differential co-expression for module B (p = 0.0315) but not module A (p = 0.6823) 3.2. Also, evidence of differential co-expression
was observed for gene pairs in modules C and D, which were not significant in ABRIDGE, Interleukin 5 Receptor Subunit Alpha (IL5RA) and Peripheral Myelin Protein 22 (PMP22) in module
C, and Interleukin 17 Receptor B (IL17RB) and Interleukin 6 (IL6) in module D. Note that these
results have not been adjusted for multiple testing.
Kruskal-Wallis tests were also performed for the same gene-pair co-expressions tested in ABRIDGE.
Of the 24 tests performed, there were three with p-values less than 0.05, all from module B. The
most significant test compared the co-expression of IDO1 and ADORA3 across levels of rescue
bronchodilator use in the past 7 days (p = 0.02) (Figure 3.3). Additionally, we performed KruskalWallis tests for all gene-pair co-expressions and 7-day ACT components for the three modules with
39



Figure 3.2: Top Gene-gene co-expression Measures and 6-month ACT Score Components in
ABRIDGE
Kruskal−Wallis, p = 0.0021
0.015
0.013
0.0087
0.0054
0.043
0
2
4
6
Never At least once Monthly Weekly Almost every night
6 Month Wake from Wheezing
IDO1, ADORA3 Covariance
(a) IDO1, ADORA3 Covariance vs 6 Month Wake from Wheezing
Kruskal−Wallis, p = 0.0044 0.0049
0.017
0.036
−1
0
1
2
3
Never At least once Monthly Weekly Almost every night
6 Month Albuterol Use
ALOX15, ADORA3 Covariance
(b) ALOX15, ADORA3 Covariance vs 6 Month Albuterol Use
Kruskal−Wallis, p = 0.0044
0.013
0.0079
0.034
0.018
0.032
−1
0
1
2
3
4
Never At least once Monthly Weekly Almost every night
6 Month Wheezing with Exercise
ALOX15, ADORA3 Covariance
(c) ALOX15, ADORA3 Covariance vs 6 Month Wheezing with Exercise
Kruskal−Wallis, p = 0.0071
0.0042
0.044
0.00031
−1
0
1
2
3
Never At least once Monthly Weekly Almost every night
6 Month Waking from Wheezing
ALOX15, ADORA3 Covariance
(d) ALOX15, ADORA3 Covariance vs 6 Month Waking from Wheezing
Figure 3.2 Violin plots for the most statistically significant gene-gene co-expression measures (Equation
3.5) and 6-month ACT score components relationships for the ABRIDGE cohort, where each dot
represents values for one patient. Kruskal-Wallis was used to test for global differences, and Wilcoxon rank
sum was used to test for pairwise differences. (a) IDO1 and ADORA3 co-expression in 6-month frequency
of waking from wheezing; (b) ALOX15 and ADORA3 co-expression in 6-month Albuterol use; (c) ALOX15
and ADORA3 co-expression in 6-month frequency of wheezing with exercising; (d) ALOX15 and ADORA3
co-expression in 6-month frequency of waking from wheezing
40



Figure 3.3: Top Gene-gene co-expression Measures and 7-day ACT Score Components in CAMP
Kruskal−Wallis, p = 0.02
0.041
0.023
0.029
0.012
0.034
−1
0
1
2
3
4
0 1 2 3 4 5 6 7
7 Day Rescue Bronchodilator Use
IDO1, ADORA3 Covariance
(a) IDO1, ADORA3 Covariance vs 7 Day Rescue Bronchodilator Use
Kruskal−Wallis, p = 0.024
0.0069
0.043
0.0036
0.044
0.034
0
1
2
3
4
0 1 2 3 4 5 6 7
7 Day Rescue Bronchodilator Use
ALOX15, ADORA3 Covariance
(b) ALOX15, ADORA3 Covariance vs 7 Day Rescue Bronchodilator Use
Kruskal−Wallis, p = 0.028
0.0084
0.005
0.011
−1
0
1
2
3
0 1 2 3 6 7
7 Day Activity Limits
ALOX15, IDO1 Covariance
(c) ALOX15, IDO1 Covariance vs 7 Day Activity Limits
Kruskal−Wallis, p = 0.091
0.019
0.014
−1
0
1
2
3
0 1 2 3 4 5 6 7
7 Day Rescue Bronchodilator Use
ALOX15, IDO1 Covariance
(d) ALOX15, IDO1 Covariance vs 7 Day Rescue Bronchodilator Use
Figure 3.3 Violin plots for the most statistically significant gene-gene co-expression measures (Equation
3.5) and 7-day ACT score components relationships for the CAMP cohort, where each dot represents
values for one patient. Kruskal-Wallis was used to test for global differences, and Wilcoxon rank sum was
used to test for pairwise differences. (a) IDO1 and ADORA3 co-expression in 7-day frequency of rescue
bronchodilator use; (b) ALOX15 and ADORA3 co-expression in 7-day frequency of rescue bronchodilator
use; (c) ALOX15 and IDO1 co-expression in 7-day activity limit
CCA Wilks-Lambda p-values below 0.05. Of the 20 tests performed, the same three pairs from
module B showed evidence of differential co-expression, but no others had p-values less than 0.05.
3.3 Differential Expression in ABRIDGE
Following the differential co-expression analysis, we performed ordinal logistic regression for
each of the 13 genes in the top five modules and found increased risk of suboptimal acute asthma
41



Table 3.3: Ordinal logistic regression estimates for genes in top five modules from CCA on ACT
scores for ABRIDGE and CAMP cohorts
Table 3.3 Models were adjusted for patient age, sex assigned at birth, self-identified ethnicity, data
collection site, and the top three gene expression PCs.
control for all genes in modules B and C, after adjusting for covariates (Table 3.3). Higher expression of ADORA3, ALOX15, and IDO1 was associated with suboptimal 6-month ACT scores
(Figure 3.4).
3.4 Differential Expression in CAMP
Adjusted ordinal logistic regressions were performed for the same 13 genes as the ABRIDGE
cohort (Section 3.3.3). In the CAMP cohort, the regressions also showed highly statistically significant associations for all genes in modules B and C, and non-significant associations for modules A
and D (Table 3.3). Unlike the results from ABRIDGE, a significant protective effect was seen for
NOD-like Receptor Family CARD Domain Containing 3 (NLRC3) (OR: 0.3926, 95% CI: (0.1864,
0.8268)). Associations between these genes and 7-day ACT scores (Figure 3.5) also imply that
increasing gene expression is associated with suboptimal acute asthma control.
42



Figure 3.4: Unadjusted Gene Expression and 6-month ACT Score in ABRIDGE
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
6 7 8 9 10
ADORA3 Expression
6 Month ACT Score
(a) ADORA3 Expression and 6 Month ACT Score
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
7 8 9 10 11
ALOX15 Expression
6 Month ACT Score
(b) ALOX15 Expression and 6 Month ACT Score
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
6 8 10
IDO1 Expression
6 Month ACT Score
(c) IDO1 Expression and 6 Month ACT Score
Figure 3.4 Violin plots for comparing unadjusted (a) ADORA3, (b) ALOX15, and (c) IDO1 expression
across 6-month ACT score levels in the ABRIDGE cohort.
43



Figure 3.5: Unadjusted Gene Expression and 7-day ACT Score in CAMP
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
16
19
25
7 8 9
ADORA3 Expression
7 Day ACT Score
(a) ADORA3 Expression and Day ACT Score
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
16
19
25
7 8 9
ALOX15 Expression
7 Day ACT Score
(b) ALOX15 Expression and Day ACT Score
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
16
19
25
7 8 9 10 11
IDO1 Expression
7 Day ACT Score
(c) IDO1 Expression and Day ACT Score
Figure 3.5 Violin plots for comparing unadjusted (a) ADORA3, (b) ALOX15, and (c) IDO1 expression
across 7-day ACT score levels in the CAMP cohort.
44



Table 3.4: Results of coXpress analysis in ABRIDGE whole blood gene expression dataset
Table 3.4 t1 and t2 are the observed t-statistics in the better and worse control patient subgroups
respectively, prg1
and prg2
are the probability of randomness statistics for the same groups, corrg1
and corrg2
are the mean pairwise correlation coefficients for the genes in the same groups and mean difference is the
mean, pairwise difference between the correlation matrices for the groups.
3.5 Methods Comparison
The five most differentially co-expressed modules identified by the CoXpress analysis can be
seen in Table 3.4. As a rule of thumb for identifying differentially co-expressed modules, the
coXpress authors suggest prg1 ≤ 0.05 and prg2 ≥ 0.05, which implies correlations different than
zero in one of the classes but not the other. None of the ABRIDGE modules met this threshold, and
values of prg1
,prg2 ≤ 0.05 for all of the five top modules indicate that the intra-module correlations
are non-zero for patients with both better and worse asthma control. We note that genes ADORA3
and ALOX15 appear in module 1, the most differentially co-expressed module.
4 Discussion
Here we’ve described a novel approach to differential co-expression analysis that accommodates categorical, ordinal, or continuous exposures or outcomes. We suggest that co-expression
features can be included in a linear modeling framework either as predictors or outcomes. To handle multivariate external features, we introduce ACDC, for either exploratory analyses or formal
hypothesis testing. This strategy contrasts to most existing methods that test for differences in coexpression across a small number of classes. Another key difference is that identified modules can
be small or large, which is not possible in many other methods. For example, DICER only accepts
modules with at least fifteen genes [69]. Although Partition was applied here to identify modules
45



of correlated genes, other methods could be used, such as weighted gene co-expression network
analysis (WGCNA) [7]. Additionally, this framework can be applied to other types of molecular
data, such as proteomics or metabolomics.
Application of the ACDC differential co-expression approach and ordinal logistic regression
analyses identified three genes, ADORA3, ALOX15, and IDO1 whose co-expressions and expression levels were associated with 6-month and 7-day ACT scores in the ABRIDGE and CAMP
cohorts, respectively.
Adenosine is a nucleoside which exhibits increased production during periods of lung inflammation. Mediation is controlled through adenosine receptors like ADORA3. Previously, studies
have shown that while single nucleotide polymorphisms (SNPs) of ADORA3 loci are not associated with asthma [70, 71], ADORA3 expression is associated with immunoglobulin E levels in
whole blood samples of asthmatic patients [72] and is differentially expressed when comparing
patients with severe asthma to controls [73].
ALOX15 has both anti-inflammatory and inflammatory effects depending on its regulation and
has been previously implicated in the development of inflammatory diseases, including asthma. A
few studies have shown that ALOX15 can be found in airway mucosa of asthmatic patients [74,
75], and another study found evidence of differential expression of ALOX15 between controls and
asthmatics [76]. Additionally, one study found that haplotypic genetic variation at the locus for
ALOX15 is associated with asthma [77].
The best understood function of IDO1 is it’s role as an immunoregulator in cancer, inhibiting
the body’s ability to fight diseased cells, but its role in autoimmune responses is less clear. A mouse
study showed that the entire indoleamine family promotes allergic airway inflammation [78], and
a human study found evidence of differential expression of IDO1 between patients with severe
eosinophilic asthma, a more severe subtype of asthma typically found in adults and categorized by
high peripheral blood concentration of eosinophils, and healthy controls [79].
46



Though all three genes have been previously identified as differentially expressed in asthma,
there are varying degrees of understanding as to the biological roles that they play. To our knowledge, there are no studies that identify any of these genes as differentially co-expressed in asthma.
This additional information could help to fill knowledge gaps about how the genes regulate or
co-regulate asthma control.
A limitation of this analysis is the difficulty differentiating cause and effect between gene
expression, acute asthma control, and medication use. Does gene expression affect response to
asthma exacerbations or is it determined primarily by asthma control medications? The directionality of the relationship is particularly muddled by the inclusion of medication use in the calculation
of ACT scores, which is standard practice [32].
The number of co-expression features grows much more quickly than the number of genes in a
module (or other gene set). Thus, for large modules it may be useful to reduce the dimensionality of
the co-expression features or apply a feature selection mechanism in a preliminary step. To do this,
we rank all gene pairs in a module by their correlation and only calculate the co-expression matrix
for the minimum of pairs with correlation above a user-defined threshold or n/2. Additionally, the
ability to adjust for covariates in the CCA step would add to the utility of the approach.
In the comparison analysis using coXpress, an existing and highly-cited module-based differential co-expression method, genes ADORA3 and ALOX15 were identified among the most important, but no modules reached statistical significance. To achieve statistical significance, coXpress
requires that correlations be undetectable in one condition and detectable in the other. KruskalWallis tests of the co-expression matrices showed differences in co-expression across levels of
ACT, indicating that while co-expression is present at all levels of ACT, it is nevertheless different
across levels. This type of relationship cannot be captured by coXpress. Also, to use coXpress, the
ACT score must be dichotomized, which results in information loss.
Further study is needed to understand the larger network that includes ADORA3, ALOX15, and
IDO1. All three are part of the Nakajima Eosinophil pathway, a group of the top 30 eosinophilspecific genes [80]. This pathway is not well studied and while much has been published about the
47



role of eosinophils in asthma, few studies have looked at the role this pathway plays in asthma exacerbations or symptomology. More study is needed to determine what drives the associations with
co-expressions observed here. They could be related to differences in the expression of eosinophil
genes between eosinophilic and non-eosinophilic asthmatics. Alternatively, within eosinophilic
asthmatics, within non-eosinophilic asthmatics or for all subtypes, co-expressions may be associated with symptom control. That is, differences in expression of eosinophil genes within some
of these groups may be associated with symptom control or the associations may be driven by
differences between groups.
In summary, we propose a novel strategy for differential co-expression analysis that is a flexible extension to prior methodology. In applications to ABRIDGE and CAMP cohorts, we find
evidence of both differential co-expression and differential expression across ACT scores for
ADORA3, ALOX15, and IDO1, all genes which have been previously implicated in asthma. These
genes may be involved in the underlying regulatory mechanisms behind acute asthma control,
however, further study is needed.
48



Chapter 4
coVarQTLs and co-eQTLs: genetic co-regulation of large-scale
gene expression regimes
1 Introduction
An expression quantitative trait locus (eQTL) is a single base pair of a chromosome, or locus,
which explains some fraction of variation in a gene expression phenotype. Traditional association
analysis via linear regression is performed on both a genome-wide and a transcriptome-wide scale,
estimating associations between all loci and all gene expression profiles independently [81]. Given
Y, an N ×G gene expression matrix with G representing the number of genes and N the number of
samples, and X, an N × M genotype matrix with M representing the number of single nucleotide
polymorphisms (SNPs), the linear regression takes form
Y = Xβ +ε (4.1)
where ε ∼ N(0,σ
2
)is a Gaussian white-noise term. Here, we estimate β, the M×G matrix of geneSNP associations. While this scale of search can be computationally expensive, new software like
Matrix eQTL reduces the computational time from days to minutes [82]. Computation time can
also be reduced by limiting the search space to only genomic or transcriptomic regions associated
with a disease of interest.
49



Traditional differential expression analyses explore associations between single gene expression profiles and external data, while differential co-expression analyses allow for exploration of
the association between gene correlation structures and external data. Similarly, we posit that a
locus explaining some fraction of variation in a co-expression profile, or the gene-gene covariance
for all gene pairs in a module across all samples, will allow for exploration of how dynamic correlation structures within gene modules are associated with genomic variation when compared to
eQTL analyses. We call this novel QTL a covariance quantitative trait locus (coVarQTL).
For example, Qiu et. al. found that Thymic Stromal Lymphopoietin (TSLP) and transcription
factor Aryl Hydrocarbon Receptor (AHR) are differentially co-expressed in asthma patients who
respond poorly to corticosteroid treatments but are not significantly related in patients who respond
well [83]. Additionally, Li et. al. found that SNP rs3806932 is an eQTL for TSLP in cells
from bronchial alveolar lavages and human bronchial epithelial biopsies [84]. In this case, it is
reasonable to hypothesize that SNP rs3806932 is a coVarQTL for TSLP and AHR in asthmatic
patients.
To our knowledge, there are no existing methods for discovery of coVarQTLs. See Chapter
1 for a more thorough background on existing methodology, including the use of CCA, posteQTL co-expression network construction including EGRET, the methodology most similar to
our proposed method, and co-eQTL methods. We note that our proposed coVarQTL analysis
aims to capture genetic regulation of gene-gene covariance, which is novel compared to all of the
aforementioned methods.
Here we propose an analysis plan for discovery of coVarQTLs using various co-expression
features and compare results with existing co-eQTL methods.
50



2 Methods
Given sets of correlated genes, we can construct a coVarQTL analysis in an analogous way to
an eQTL analysis with the linear regression
YC = Xβ +ε (4.2)
with YC the N × D matrix of co-expression values. Therefore, our association matrix, β, has
dimensions P×D with D being the total number of gene sets, or modules.
Modules of correlated genes are identified by Partition, an agglomerative data reduction method
which performs both feature condensation and extraction based on a user provided information loss
criterion (ILC, ILC ∈ [0,1]), implemented by default as the intraclass correlation coefficient [48]
The ILC is the minimum ICC required for features to be condensed into a module. We complete
the analysis using low (0.15), medium (0.40), and high (0.65) ILC values.
We also explored the full range of possible ILCs using the Genome-wide Complex Trait Analysis (GCTA) Genomics Restricted Maximum Likelihood (GREML) method [59]. GREML estimates the percent of variance for an external variable that can be explained by all SNPs, also
known as the SNP-based heritability. In this analysis, we estimate the SNP-based heritability for
co-expression and covariance of gene modules created by increasingly reduced gene expression
datasets (see Chapter 2.3 for more information). The results of this analysis for the discovery
cohort can be seen in Figures 2.7 and 2.8.
2.1 Covariance and Co-expression Calculation
To calculate gene module covariance, we use the calculation from association of covariance
for detecting differential co-expression method (ACDC), an association test for gene covariance
measures and external variables [85] (see Chapter 3 for more details). Briefly, given gene sets
M = (M1,...,MD), hereafter referred to as ’modules’, with D the total number of modules, we
51



can calculate the module covariance matrices, CM. Assume module Md is comprised of g genes,
∼ N (µg,Σg). Let the co-expression value for gene pair j, k in sample s be
σˆ j,k ≡

gs, j −g¯j
gs,k −g¯k

(4.3)
where gs, j denotes the expression of gene j in sample s and ¯gj
the average expression across all
samples. We note that these components are a method of moments estimator for σj,k
, the offdiagonal elements of the overall covariance matrix Σg. Then, we can denote the co-expression
profile for module Md across all samples as,
C =

σ1,2,...,σj,k
,...σ(g−1),g

;|C| =

g
2

= Gd. (4.4)
These co-expression profiles form the rows of CMd
.
2.2 Covariance and Co-expression Summary Measure
Given the set of CM, YC can be constructed in numerous ways, and we explored three options.
First, the row means of each CMd
are computed by averaging all the covariance measures for
each sample. Thus, each CMd
contributes one column to YC making it N ×D dimensional. Row
means are a computationally efficient way to reduce the transcriptomic search space.
The second option is to use the first PC of each CMd
. Again, in this case, each CMd
contributes one column to YC making it N ×D dimensional. As mentioned later in the discussion, PCs
are computationally inexpensive to compute, reduce the transcriptomic search space, and capture
information about the variance within co-expression profiles.
The third option is to apply the Partition algorithm to each CMd with an information loss criterion of 0.75, meaning that each new feature comprised of gene-pair covariance measures will have
an intra-class correlation coefficient of at least 0.75. Each new feature will contribute a column via
the scaled row means of the combined original features to YC, and any gene pair covariance features that are not collapsed into groups will each contribute a column to YC. In this case, YC will



have N rows and a varying number of columns. Applying Partition to each CMd
preserves a large
magnitude of information from the complete covariance profiles while also reducing the search
space. Varying the information loss criterion chosen for this step will change the balance between
information preservation and dimension reduction. Additionally, since Partition implements an
easily tractable mapping technique, interpretability of the results is not diminished.
To compare approaches, we completed coVarQTL analyses in the full transcriptome for the
three covariance matrix designs described above, with data partitioned at an ILC of 0.40. Using
the PCs of the covariance matrix yielded 523 significant coVarQTLs, determined by BH FDR <
0.05, while the row means of the gene covariance matrix found 476, and the Partitioned gene
covariance matrix found 161. Thus, moving forward, we used the PCs as our summary measure.
To compare to existing similar methods, we also perform a co-eQTL analysis by summarizing
each module-specific expression matrix by its first PC [25, 26].
2.3 QTL Analyses
Once YC is constructed, Matrix eQTL is used to estimate the association matrix β, adjusting
for patient age, sex assigned at birth, and the first four PCs of the genotype matrix. Significance
of results was assessed by BH FDR < 0.05 [86]. Additionally, the MV FDR was estimated. MV
FDR is a non-parametric, permutation based approach that sets the rejection region and estimates
the FDR and a 95% CI after a user-determined number of permutations and allows user to quantify
the uncertainty in an FDR estimate [67]. We choose 25 permutations to calculate the MV FDR
estimate and 95% CI.
Both the inflammatory probe set and the full transcriptome with modules determined by Partition with ILCs of 0.15, 0.40, and 0.65 are used for analysis, for a total of six parameter sets. In
order to compare to existing co-eQTL methodology, a parallel set of analyses is run where each
column of YC is the first PC of a module-specific expression matrix.
The genomic inflation factor, λGC, defined as the ratio of the median of the empirically observed
test statistic distribution for the QTL tests to the expected median, is calculated for each set of
53



results. Ideally, λGC ≤ 1, indicating that there is no systematic bias in the analysis leading to an
inflated number of discoveries. Many genomic inflation measures in Tables 4.2, 4.3, 4.4, and 4.5
are estimates since Matrix eQTL returns p-values as a distribution rather than individual values
when the number of test exceeds roughly 50 million. Therefore, we are only able to determine that
the median p-value falls within a narrow window (for example, median p-value ∈ [0.04,0.05]),
resulting in a small range for the true λGC. The estimate reported in the tables is the maximum
possible value.
The distributions of our constructed covariance measures are non-normal, and thus the linearity assumption of the linear regression model is violated. Various solutions for normalizing the
distribution are explored including quantile normalization [87], removal of outliers, and capping
the outliers at the appropriate boundary. We defined outliers as any data point, x, such that
x > P75 +3 ∗ IQR or x < P25 −3 ∗ IQR (4.5)
with the interquartile range defined as IQR = P75 −P25.
A traditional eQTL analysis is also performed to allow for comparison of results and data
validation. Models are adjusted for patient age, sex assigned at birth, the first three genotype PCs,
and the first five probabilistic estimation of expression residuals (PEER) factors [88], which aim to
capture latent variables that explain variability in gene expression profiles.
2.4 CAMP and ABRIDGE Data
For an overall description of the CAMP study, see Chapter 1, Section 2. Genomic data, gene
expression profiles from whole blood cells, and clinical questionnaire data are used for analysis.
Non-Hispanic white (NHW), asthmatic patients that contributed both genomic chip arrays and
whole blood microarray data were included in the cohort, for a total of 254 patients. Demographics
for this discovery cohort can be seen in Table 4.1.
54



For an overall description of the ABRIDGE study, see Chapter 1, Section 2. Genomic data,
gene expression profiles from whole blood cells, and clinical questionnaire data from the Mexico
City Children’s Asthma Study (MCCAS) at Hospital Infantil de Mexico Federico Gomez (HMEX)
and Children’s Health Study (CHS) at the University of Southern California (USC) are used for
analysis here. All cohorts are comprised of individuals with physician diagnosed asthma who
contributed whole blood gene expression samples and genomic chip arrays. Demographics for the
various cohorts making up the meta, replication cohort can be seen in Table 4.1.
Genotypes were measured using the Illumina HumanHap 550K v3 array. Imputation was completed via the Michigan Imputation Server [46], and subsequently, probes with R
2 < 0.80, minor
allele frequency < 0.10, or located on sex chromosomes were removed. Haplotype blocks were estimated using PLINK v1.9 with 90% D’ CI set to the lowest allowable thresholds of (0.505, 0.835)
and a maximum span of one megabase [89]. This results in 105,041 haplotype blocks containing
4,151,198 SNPs in the CAMP NHW cohort. We filter QTL results such that there is a maximum
of one significant discovery per gene-haplotype block pair per analysis.
Gene expression profiles were measured using the Illumina Human HT-12 v4 array. The data
was normalized via a log2-transformation and quantile-normalization. Duplicate probes were condensed by the largest median absolute deviation, leaving only probes with unique targets. Probes
with targets located on sex chromosomes were also removed, leaving 19,428 probes for analysis
across the full transcriptome. The inflammatory gene set includes 623 probes with targets annotated for inflammatory response in Gene Ontology [33]. Six-month asthma control test (ACT)
scores were calculated from questionnaire responses about wheezing with and without exercise,
patient waking due to wheezing, and the need for Albuterol in the last six months (range: [4,20]),
where higher scores indicate higher frequency of symptoms.
After the initial analysis in the CAMP NHW cohort, results with BH FDR < 0.05 for mean
summarized QTLs and p-value < 10−7
for covariance QTLs are followed up in independent cohorts of patients from MCCAS and USC. Additionally, only modules with ICC > 0.75 ∗ ILC in
each replication cohort are included to ensure sufficient gene correlation in all cohorts. Estimates
55



Table 4.1: Patient demographics for discovery and replication cohorts
Table 4.1 CAMP = Childhood Asthma Management Program; MCCAS = Mexico City Children’s Asthma
Study; USC = University of Southern California; HW = Hispanic White; NHW = Non-Hispanic White
and standard errors from Matrix eQTL for each cohort are input to METAL, an inverse-variance
based method for meta-analysis of genome-wide association summary statistics, to get composite
p-values and to test for heterogeneity of effect sizes between cohorts [90]. From there, BH FDR
estimates are calculated. All analyses are completed using R version 4.2.2.
3 Results
3.1 eQTL results
In the inflammatory probe set, there were 274 eQTLs with BH FDR < 0.05. This set includes 75 unique probes and 274 unique SNPs and had an MV FDR of 5.32 × 10−4
. 259 of the
eQTLs were cis-eQTLs, defined by the gene transcription start site being located within a two
megabase window of the SNP, and 15 were trans-eQTLs, defined as the gene transcription start
site being outside a two megabase window of the SNP. The top three most significant eQTLs are
all cis-eQTLs and included rs10954213 and Interferon Regulatory Factor 5 (IRF5), rs10193543
and Cytochrome P450 Family 26 Subfamily B Member 1 (CYP26B1), both of which are known
56



Figure 4.1: Module information for inflammatory gene set at various information loss criterion
6
1
Total Number of Modules: 7
Total Number of Probes: 15
2
3
0 2 4 6
Count
Module Size
(a) ILC = 0.65, Module Sizes
4
1
1
1
0
2
3
13
0 1 2 3 4
Count
Number of eQTLs per Module
(b) ILC = 0.65, Number of eQTLs per Module
21
13
5
7
1
1
1
1
Total Number of Modules: 50
Total Number of Probes: 176
2
3
4
5
6
8
11
15
0 5 10 15 20
Count
Module Size
(c) ILC = 0.40, Module Sizes
21
6
1
1
3
2
2
1
1
6
4
2 Maximum: 44
0
1
2
3
4
5
6
9
10
11−15
16−25
26+
0 5 10 15 20
Count
Number of eQTLs per Module
(d) ILC = 0.40, Number of eQTLs per Module
18
13
8
5
4
2
2
2
2
5
5
3
Total Number of Modules: 69
Total Number of Probes: 477
Maximum Size: 25
2
3
4
5
6
7
8
9
10
11−15
16−25
26+
0 5 10 15
Count
Module Size
(e) ILC = 0.15, Module Sizes
20
6
4
3
2
2
1
3
3
3
4
5
10
3
Maximum: 72
0
1
2
3
4
5
6
7
9
10
11−15
16−25
26−50
51+
0 5 10 15 20
Count
Number of eQTLs per Module
(f) ILC = 0.15, Number of eQTLs per Module
Figure 4.1 Plots showing number of genes per module (left) and eQTLs for genes in the module (right) for
each of the information loss criterion levels (rows) used to partition the inflammatory gene data set.
eQTLs in whole blood, and rs7121273 and Alkaline Ceramidase 3 (ACER3), which is a known
eQTL in the esophagus/mucosa [11].
In the full transcriptome, there were 7,807 eQTLs with BH FDR < 0.05. This set includes 2,178
unique probes and 7,478 unique SNPs and had an MV FDR of 7.64×10−5
. 6,699 of the eQTLs
were cis-eQTLs, and 1,108 were trans-eQTLs. The top three most significant discoveries were all
cis-eQTLs and included rs4902343 and Churchill Domain Containing 1 (CHURC1), rs12230244
and C-Type Lectin Domain Family 12 Member A (CLEC12A), and rs7309256 and CLEC12A, all
of which are known eQTLs in whole blood [11].
57



Figure 4.2: Module information for the full transcriptome at various information loss criterion
315
125
46
25
26
14
6
6
5
18 Total Number of Modules: 586
Total Number of Probes: 1992
Maximum Size: 25
2
3
4
5
6
7
8
9
10
11+
0 100 200 300
Count
Module Size
(a) ILC = 0.65, Module Sizes
389
54
24
31
15
13
39
18
3 Maximum: 33
0
1
2
3
4
5
6−10
11−24
25+
0 100 200 300 400
Count
Number of eQTLs per Module
(b) ILC = 0.65, Number of eQTLs per Module
484
252
147
86
65
44
39
29
23
107
36
5
3
Total Number of Modules: 1320
Total Number of Probes: 7678
Maximum Size: 120
2
3
4
5
6
7
8
9
10
11−24
25−49
50−99
100+
0 100 200 300 400 500
Count
Module Size
(c) ILC = 0.40, Module Sizes
738
107
39
63
43
37
148
116
23
6 Maximum: 81
0
1
2
3
4
5
6−10
11−24
25−49
50+
0 200 400 600 800
Count
Number of eQTLs per Module
(d) ILC = 0.40, Number of eQTLs per Module
664
400
227
119
95
64
47
34
34
151
67
37
18
4
1
Total Number of Modules: 1962
Total Number of Probes: 18059
Maximum Size: 750
2
3
4
5
6
7
8
9
10
11−24
25−49
50−99
100−149
250−499
750
0 200 400 600
Count
Module Size
(e) ILC = 0.15, Module Sizes
1412
96
48
41
32
32
105
102
50
44 Maximum: 524
0
1
2
3
4
5
6−10
11−24
25−49
50+
0 500 1000 1500
Count
Number of eQTLs per Module
(f) ILC = 0.15, Number of eQTLs per Module
Figure 4.2 Plots showing number of genes per module (left) and eQTLs for genes in the module (right) for
each of the information loss criterion levels (rows) used to partition the full transcriptome probe set.
58



Figure 4.3: Violin plots of gene covariance by genotype for most significant CAMP NHW
coVarQTLs, including outliers Kruskal−Wallis, p = 0.02
0.022
−9
−6
−3
0
3
AA (3) AG (47) GG (204)
rs34419562 Genotype
Module 826 Covariance
(a) Module 826 Covariance by rs34419562 Genotype
Kruskal−Wallis, p = 0.37
−1
0
1
2
AA (3) AG (47) GG (204)
rs34419562 Genotype
Module 306 Covariance
(b) Module 306 Covariance by rs34419562 Genotype
Kruskal−Wallis, p = 0.22
0
1
2
AA (3) AG (47) GG (204)
rs34419562 Genotype
Module 49 Covariance
(c) Module 49 Covariance by rs34419562 Genotype
Figure 4.3 Violin plots for (a) PC1 of module 826 covariance by rs34419562 genotype, (b) PC1 of module
306 covariance by rs34419562 genotype, and (c) PC1 module 826 covariance by rs34419562 genotype, the
top three most statistically significant coVarQTLs. Each dot represents values for one patient.
Kruskal-Wallis was used to test for global differences, and Wilcoxon rank sum was used to test for pairwise
differences.
59



Table 4.2: CAMP NHW QTL analysis results for various expression sets, including outliers
Table 4.2 Probe set refers to the gene expression probes used for analysis, either probes that map to genes
with inflammatory annotations or all probes. The information loss criterion (ILC) is the value used in
Partition to create the gene modules and determines the number of tests. Number of discoveries is
determined by BH FDR < 0.01 for mean summarized and BH FDR < 0.05 for covariance summarized.
MV FDR values are for the corresponding p-value thresholdds.
3.2 Initial CAMP NHW QTL results
co-eQTLs and coVarQTLs analyses were performed on the CAMP NHW dataset for both inflammatory probe set and the full transcriptome partitioned at ILCs of 0.65, 0.40, and 0.15. Results
of these analyses can be seen in Table 4.2. For all parameter sets, we see genomic inflation factors near one, indicating that number of discoveries are not inflated compared to the number of
total tests. Number of discoveries is defined as BH FDR < 0.01 for co-eQTLs and BH FDR <
0.05 for coVarQTLs. This difference in significance thresholds is supported by the difference in
strength of the biological signals between co-expression/covariance and genotypes (see Chapter
2, Section 2.3). While there were minimal discoveries in the inflammatory probe set, discoveries
are abundant in the full transcriptome, with the number of discoveries increasing as ILC decreases
for co-eQTLs and number of discoveries peaking at an ILC of 0.40 in the full transcriptome. Additionally, the non-parametric, permutation based MV FDR was estimated. The set of maximum
discoveries for co-eQTLs (full transcriptome, ILC = 0.15) has an MV FDR of 0.059 (95% CI:
(0.047, 0.074)), while the set of maximum discoveries for coVarQTLs (full transcriptome, ILC
= 0.40) has an MV FDR of 0.958 (95% CI: (0.782, 0.999)). The discrepancy between BH FDR
60



Table 4.3: CAMP NHW QTL analysis results for various quantile normalized expression sets,
including outliers
Table 4.3 Probe set refers to the gene expression probes used for analysis, either probes that map to genes
with inflammatory annotations or all probes. The information loss criterion (ILC) is the value used in
Partition to create the gene modules and determines the number of tests. Number of discoveries is
determined by BH FDR < 0.01 for mean summarized and BH FDR < 0.05 for covariance summarized.
MV FDR values are for the corresponding p-value thresholds.
and MV FDR estimates indicates a departure from parametric assumptions [67], and this was confirmed when looking at violin plots for the top three most significant coVarQTLs. Figure 4.3 shows
very minimal differences in gene module covariance between genotype groups and seems to indicate that differences are driven by a combination of outliers and small minor allele homozygote
groups. Note that a list of the genes comprising the modules depicted in Figure 2.3 can be found
in Appendix Table A.1.
3.3 Non-normality in covariance distributions
Plotting the distribution of the first PC of module covariance for most significant module from
the coVarQTL results (module 826, comprised of seven genes; a list of these genes can be found in
Appendix Table A.1) reveals a long left tail (Figure 4.4(b)). Since this measure is the outcome in a
linear regression model, there is a departure from the linearity assumption. The first PC of module
expression, the outcome for co-eQTL analysis, shows only a slight right skew, indicating minimal
if any assumption violations (Figure 4.4(a)).
61



Figure 4.4: Violin plots of gene module expression and covariance by genotype for most
significant CAMP NHW coVarQTLs, including outliers
0.00
0.25
0.50
0.75
1.00
−1 0 1 2 3
PC1(Module 826 Expression)
Density
(a) Mean − Outliers Included
0
2
4
6
−4 −2 0
PC1(Module 826 Covariance)
Density
(b) Covariance − Outliers Included
0.00
0.25
0.50
0.75
1.00
−1 0 1 2
PC1(Module 826 Expression)
Density
(c) Mean − Quantile Normalized
0
1
2
3
−4 −3 −2 −1 0
PC1(Module 826 Covariance)
Density
(d) Covariance − Quantile Normalized
0.00
0.25
0.50
0.75
1.00
−1 0 1
PC1(Module 826 Expression)
Density
(e) Mean − Outliers Removed
0
2
4
6
8
−0.3 −0.2 −0.1 0.0 0.1 0.2
PC1(Module 826 Covariance)
Density
(f) Covariance − Outliers Removed
0.00
0.25
0.50
0.75
1.00
−1 0 1
PC1(Module 826 Expression)
Density
(g) Mean − Capped Outliers
0
2
4
6
−0.2 0.0 0.2
PC1(Module 826 Covariance)
Density
(h) Covariance − Capped Outliers
Figure 4.4 Density plots for first principal components of gene module 826 mean expression (left) and
covariance (right) for data sets with outliers included (plots a and b), quantile normalized data sets (plots c
and d), data sets with outliers removed (plots e and f), and finally for data sets with outliers capped at either
the 25th percentile minus three times the interquartile range or the 75th percentile plus three times the
interquartile range (plots g and h).
62



Table 4.4: CAMP NHW QTL analysis results for various expression sets with outliers removed
Table 4.4 Probe set refers to the gene expression probes used for analysis, either probes that map to genes
with inflammatory annotations or all probes. The information loss criterion (ILC) is the value used in
Partition to create the gene modules and determines the number of tests. Number of discoveries is
determined by BH FDR < 0.01 for mean summarized and BH FDR < 0.05 for covariance summarized.
MV FDR values are for the corresponding p-value thresholds.
Since the range of possible values includes negative values, a traditional log transformation is
not applicable. Instead, we first tried quantile normalization [87], and the results of the subsequent
QTL analyses can be seen in Table 4.3. The patterns in number of discoveries between co-eQTLs
and coVarQTLs and genomic inflation factor values are the same as in the previous analysis but so
is the inflated MV FDR values for coVarQTLs. Based on this and the distribution of the first PC of
module covariance (Figure 4.4(d)), quantile normalization is not sufficient to deal with deviations
from model assumptions for covariance measures.
Outliers were removed for the next analyses (see Equation 4.5 for definition of outliers), and
the results can be seen in Table 4.4. For the mean summarized, co-eQTL analysis, the number
of discoveries slightly decreased and strength of discoveries slightly increased compared to prior
runs of the analysis, confirming again that outliers are not the primary signal driving the significant
associations for mean expression. However, for the covariance summarized coVarQTL analysis,
there are no discoveries across any of the parameter sets, indicating that the signal from outliers is
the primary driver of associations. We also see in the distribution of the first PC of module covariance in Figure 4.4(f) that while the long left tail has been cut off (range: [-0.35, 0.25] compared to a
63



Table 4.5: CAMP NHW QTL analysis results for various expression sets with capped outliers
Table 4.5 Probe set refers to the gene expression probes used for analysis, either probes that map to genes
with inflammatory annotations or all probes. The information loss criterion (ILC) is the value used in
Partition to create the gene modules and determines the number of tests. Number of discoveries is
determined by BH FDR < 0.01 for mean summarized and BH FDR < 0.05 for covariance summarized.
MV FDR values are for the corresponding p-value thresholds.
range in the raw distribution of [-5.25, 0.25]), the peak density is 33% higher than when including
outliers, confirming that there is significantly less variation in the data due to removing outliers.
3.4 CAMP NHW QTL results for capped outliers
In order to deal with departures from the models assumptions of linear regression without
removing signal-driving variation from the data, the final run of the analysis was completed with
outliers capped at the values described in Equation 4.5. As seen in Figure 4.4(h), the distribution
of PC1 of module covariance with capped outliers has only a small left tail and an increase in the
density at the left capping point.
The results of the QTL analysis for all parameter sets for data with outliers capped can be seen
in Table 4.5. co-eQTL results have again stayed consistent in strength and number of discoveries, finding 91 co-eQTLs with BH FDR < 0.01 in the full transcriptome partitioned at an ILC of
0.40 (MV FDR = 0.043, 95% CI: (0.030, 0.061)). Violin plots for PC1 of module expression by
genotype for the three most significant discoveries can be seen in Figure 4.5. Figures 4.5(a) showing module 769 expression by genotype at rs1131017 and 4.5(b) showing module 348 expression
64



Figure 4.5: Violin plots of mean gene module expression by genotype for most significant
CAMP NHW co-eQTLs, with outliers capped Kruskal−Wallis, p < 2.2e−16
2.1e−05
p < 2.22e−16
p < 2.22e−16
0
2
4
GG (48) CG (129) CC (77)
rs1131017 Genotype
Module 769 Expression
(a) Module 769 Expression by rs1131017 Genotype
Kruskal−Wallis, p < 2.2e−16
6.9e−07
p < 2.22e−16
p < 2.22e−16
0
1
2
GG (48) CG (129) CC (77)
rs1131017 Genotype
Module 348 Expression
(b) Module 348 Expression by rs1131017 Genotype
Kruskal−Wallis, p < 2.2e−16
2.4e−05
3.4e−09
6.6e−12
−1
0
1
2
3
TT (16) CT (90) CC (148)
rs1314656593 Genotype
Module 433 Expression
(c) Module 433 Expression by rs1314656593 Genotype
Figure 4.5 Violin plots for (a) PC1 of module 769 expression by rs1131017 genotype, (b) PC1 of module
348 expression by rs1131017 genotype, and (c) PC1 module 433 expression by rs1314656593 genotype,
the top three most statistically significant co-eQTLs. Each dot represents values for one patient.
Kruskal-Wallis was used to test for global differences, and Wilcoxon rank sum was used to test for pairwise
differences.
65



Figure 4.6: Violin plots of gene module covariance by genotype for significant CAMP NHW
coVarQTLs, with outliers capped
Kruskal−Wallis, p = 1.3e−06
0.006
2.8e−06
−7.5
−5.0
−2.5
0.0
2.5
5.0
GG (10) AG (57) AA (187)
rs34815776 Genotype
Module 1029 Covariance
(a) Module 1029 Covariance by rs34815776 Genotype
Kruskal−Wallis, p = 5.1e−08
0.00013
5.9e−06
0.00011
0.0
0.2
0.4
AA (8) AG (83) GG (163)
rs9978034 Genotype
Module 287 Covariance
(b) Module 287 Covariance by rs9978034 Genotype
Figure 4.6 Violin plots for (a) PC1 of module 1029 covariance by rs34815776 genotype, (b) PC1 of
module 287 covariance by rs9978034 genotype, the two coVarQTLs with BH FDR < 0.05. Each dot
represents values for one patient. Kruskal-Wallis was used to test for global differences, and Wilcoxon rank
sum was used to test for pairwise differences.
66



for the same genotype exhibit decreasing trends with regard to number of copies of the major allele, while Figure 4.5(c) showing module 433 expression by genotype at rs1314656593 indicates
increasing module expression by number of copies of the major allele.
Most of the coVarQTL signal from the initial analysis including outliers is not discoverable
after capping the outliers. However, there are two significant discoveries from the full transcriptome partitioned at an ILC of 0.40 which have an MV FDR of 0.140 (95% CI: (0.011, 1)). This
is substantially lower than previous discovery sets, indicating that we have successfully handled
departures from parametric assumptions. Violin plots for PC1 of module covariance by genotype
for the two discoveries can be seen in Figure 4.6, with (a) showing module 1029 covariance by
rs34815776 genotype and (b) showing module 287 covariance by rs9978034 genotype. While the
number of minor allele homozygotes is still small for both genotypes, there are significant differences between the major allele homozygotes and heterozygotes (Wilcoxon rank sum = 2.8x10−6
and 1.1x10−4
, respectively), which were not present in initial results.
3.5 Meta-analysis results
co-eQTL analyses for a discovery set from the CAMP NHW cohort were completed in each
of the three replication cohorts, MCCAS (n = 135), USC HW (n = 71), and USC NHW (n = 36).
Module-SNP pairs with BH FDR < 0.05 in the full transcriptome partitioned with an ILC of 0.40
were included in the discovery set, for a total of 184 pairs containing 132 unique modules and 179
unique SNPs. The 20 most significant results from the METAL meta-analysis can be seen in Table
4.6. The only result with BH FDR ̸= 1 is between module 279, comprised of 2-Hydroxyacyl-CoA
Lyase 1 (HACL1) and Propionyl-CoA Carboxylase Subunit Beta (PCCB), and rs10896851 (BH
FDR = 0.362). This was the 60th most significant module-SNP pair in the discovery cohort (BH
FDR = 0.583). We note that the χ
2
test for heterogeneity has a p-value of 0.005, indicating that
effect sizes are not consistent across cohorts. This can be seen in the violin plots of mean HACL1
and PCCB expression by rs10896851 genotype across the discovery (Figure 4.7(a)) and replication
cohorts (Figures 4.7(b), (c) and (d)). For the CAMP NHW and MCCAS cohorts, mean module
67



Table 4.6: co-eQTL meta-analysis results
Table 4.6 Top 20 results from METAL inverse-variance based meta-analysis, with composite effects,
standard errors, and p-values. BH FDR values are based on composite p-values. Direction indicates the
sign of cohort-specific effects in the MCCAS, USC HW, and USC NHW cohorts, respectively. The final
column gives the p-value for a χ
2
test for heterogeneity, with p < 0.05 indicating differences in effect sizes
between cohorts.
68



Figure 4.7: Violin plots of gene module expression by genotype for significant METAL co-eQTL
in each cohort
Kruskal−Wallis, p = 0.66
−1.0
−0.5
0.0
0.5
1.0
TT (10) GT (70) GG (174)
rs10896851 Genotype
HACL1, PCCB Covariance
(a) CAMP NHW (n = 254)
Kruskal−Wallis, p = 0.00045
0.012
5.4e−05
−0.5
0.0
0.5
1.0
TT (27) GT (65) GG (43)
rs10896851 Genotype
HACL1, PCCB Covariance
(b) MCCAS (n = 135)
Kruskal−Wallis, p = 0.53
−0.5
0.0
0.5
1.0
TT (9) GT (33) GG (29)
rs10896851 Genotype
HACL1, PCCB Covariance
(c) USC HW (n = 71)
Kruskal−Wallis, p = 0.38
−0.4
0.0
0.4
0.8
TT (2) GT (15) GG (19)
rs10896851 Genotype
HACL1, PCCB Covariance
(d) USC NHW (n = 36)
Figure 4.7 Violin plots for PC1 of HACL1 and PCCB (module 279) expression by rs10896851 genotype,
the only statistically significant co-eQTL in the meta-analysis, in the CAMP NHW (plot a), MCCAS (plot
b), USC HW (plot c), and USC NHW (plot d) cohorts. Each dot represents values for one patient.
Kruskal-Wallis was used to test for global differences, and Wilcoxon rank sum was used to test for pairwise
differences.
69



Table 4.7: coVarQTL meta-analysis results
Table 4.7 Top 20 results from METAL inverse-variance based meta-analysis, with composite effects,
standard errors, and p-values. BH FDR values are based on composite p-values. Direction indicates the
sign of cohort-specific effects in the MCCAS, USC HW, and USC NHW cohorts, respectively. The final
column gives the p-value for a χ
2
test for heterogeneity, with p < 0.05 indicating differences in effect sizes
between cohorts.
70



Figure 4.8: Violin plots of gene module covariance by genotype for significant METAL
coVarQTL in each cohort
Kruskal−Wallis, p = 0.54
−0.1
0.0
0.1
GG (15) GT (92) TT (147)
rs894204 Genotype
HYLS1, GPATCH4, and EEF1AKMT1 Covariance
(a) CAMP NHW (n = 254)
Kruskal−Wallis, p = 0.058
0.04
0.017
−0.1
0.0
0.1
0.2
GG (19) GT (59) TT (57)
rs894204 Genotype
HYLS1, GPATCH4, and EEF1AKMT1 Covariance
(b) MCCAS (n = 135)
Kruskal−Wallis, p = 0.15
−0.1
0.0
0.1
GG (8) GT (41) TT (22)
rs894204 Genotype
HYLS1, GPATCH4, and EEF1AKMT1 Covariance
(c) USC HW (n = 71)
Kruskal−Wallis, p = 0.8
−0.1
0.0
0.1
GG (3) GT (16) TT (17)
rs894204 Genotype
HYLS1, GPATCH4, and EEF1AKMT1 Covariance
(d) USC NHW (n = 36)
Figure 4.8 Violin plots for PC1 of HYLS1, GPATCH4, and EEF1AKMT1 (module 383) covariance by
rs894204 genotype, the only statistically significant coVarQTL in the meta-analysis, in the CAMP NHW
(plot a), MCCAS (plot b), USC HW (plot c), and USC NHW (plot d) cohorts. Each dot represents values
for one patient. Kruskal-Wallis was used to test for global differences, and Wilcoxon rank sum was used to
test for pairwise differences.
71



expression increases with number of copies of the major allele whereas in both USC cohorts, there
is no clear relationship between expression and genotype.
The same analysis was performed for coVarQTLs. Module-SNP pairs with a p-value < 10−7
in
the full transcriptome partitioned with an ILC of 0.40 were included in the discovery set, for a total
of 155 pairs containing 100 unique modules and 155 unique SNPs. A different number of SNPs
were available in each cohort (4,236,483 in CAMP NHW, 4,748,890 in MCCAS, 4,904,665 in
USC NHW, and 4,199,403 in USC HW) because of the differing ancestry-based reference panels
used in the imputation process, and due to limited sample sizes, we restrict inclusion in the final
meta-analysis to only module-SNP pairs available in all cohorts. This leaves only 37 results for
meta-analysis. We note that varying number of SNPs did not reduce the discovery set in the coeQTL analysis. The 20 most significant results from the METAL meta-analysis for coVarQTLs can
be seen in Table 4.7. The only result with BH FDR ̸= 1 is between module 383, comprised of Centriolar And Ciliogenesis Associated HYLS1 (HYSL1), G-Patch Domain Containing 4 (GPATCH4),
and EEF1A Lysine Methyltransferase 1 (EEF1AKMT1), and rs894204 (BH FDR = 0.025, heterogeneity χ
2 p-value = 0.2125). This pair was the fifth most significant coVarQTL in the discovery
set (BH FDR = 0.010). Figure 4.8 shows violin plots of HYSL1, GPATCH4, and EEF1AKMT1
covariance by rs894204 genotype across the discovery ((a)) and replication cohorts ((b), (c) and
(d)). We note that the only significant pairwise differences between cohorts appear in the MCCAS
cohort.
4 Discussion
4.1 Biological implications
Application of the coVarQTL framework to the CAMP NHW cohort after capping outliers
identified two gene module-SNP pairs with BH FDR < 0.05: module 1029 (size = 20) and
rs34815776 and module 287 (size = 3) and rs9978034.
72



rs34815776, or chromosomal location 3:1006653, has not been implicated in any published
studies and is not a known eQTL for any gene within the GTex portal [11]. Module 1029 contains 20 genes, for which a complete list can be found in Appendix Table A.1. A statistical overrepresentation test for gene ontology molecular function pathways using PANTHER found that
two genes, CD14 and C5AR1, are part of the opsonin receptor activity pathway (expected count
= 0, BH FDR = 0.0453) [91]. Opsonins act as a bridge between a pathogen and macrophage, tagging the pathogen for phagocytosis [92], and studies have indicated that other genes in the opsonin
pathway may play a role in asthma symptomology based on expression in airway epithelial cells
[93, 94].
rs9978034, or chromosomal location 21:41808907, is a known eQTL for gene PR/SET Domain 15 (PRDM15) across many tissues, including whole blood [11]. Module 287 includes genes
Alpha-1,2-mannosyltransferase ALG9 (ALG9), Mitochondrial Ribosomal Protein L49 (MRPL49),
and Coiled-Coil Domain Containing 86 (CCDC86). ALG9 is a protein-coding gene with a gene
ontology annotation for glycosyltransferase activity and is implicated in diseases of glycosylation.
While the gene itself is not known to be associated with asthma, the locus nearest to the gene
transcription start site, rs659529, is known to be associated with asthma and other allergic diseases
[95]. MRPL49 is a protein-coding gene involved in mitochondrial translation and metabolism of
proteins which has been shown to be important in development of various cancers [96, 97] as
well as Parkinson’s disease [98] but not yet in asthma. CCDC86 is a protein-coding gene with
a gene ontology annotation for RNA binding that is associated with epilepsy. CCDC86 has also
been implicated as a potential target for various cancers [99, 100] and schizophrenia [101]. To our
knowledge, nothing has yet been published about the interactions between these three genes and
the role they may play in asthma or any other phenotype.
A meta-analysis with the coVarQTL framework using the CAMP NHW cohort for discovery
and three, independent ABRIDGE cohorts for replication identified a single gene module-SNP
pair with BH FDR < 0.05: module 383 (size = 3) and rs894204. The same set up was used for a
meta-analysis of co-eQTLs, which found no statistically significant associations.
73



rs894204, or chromosomal location 4:179125944, has not been implicated in any published
studies and is not a known eQTL for any gene within the GTex portal [11]. Module 383 includes genes HYLS1, GPATCH4, and EEF1AKMT1. HYLS1 is a protein-coding gene for which
a missense mutation is known to cause hydrolethalus syndrome, a recessive and lethal fetal malformation [102]. Additionally, newer research implicates HYLS1 as a cell-specific marker for
deuterosomal cells, a subtype of epithelial cells which decrease in concentration during airway
inflammation events [103]. GPATCH4 is a protein-coding gene with gene ontology annotations for
nucleic acid binding and RNA binding [33]. In asthma, GPATCH4 has been shown to be differentially expressed when comparing asthmatic patients with both atopic and non-atopic controls [104].
Finally, EEF1AKMT1 is a protein-coding gene involved in protein methylation and metabolism of
proteins, and it has not yet been indicated in asthma. To our knowledge, nothing has yet been
published about the effect of these three genes together, in asthma or any other phenotype.
Given how little information is known about many of the SNPs, that the majority of the genes
individually have never been implicated in asthma, and that none of the gene modules have been
discovered prior, further study is necessary to understand the implications of our results. Similar
to the current theory that trans-eQTL signal capture mediation of gene expression through transcription factors nearby the locus [105, 106], we posit that coVarQTLs capture mediation between
transcription factors nearby the locus of interest and the gene covariance network. This is supported
by that fact that none of the genes in the relevant modules fall within a two megabase window of
the corresponding SNP for the three coVarQTLs discovered. One such possible explanation involves the long non-coding RNA (lncRNA) sequence AC087430.1, which is 1,500 base pairs from
rs34815776. AC087430.1 is already known to be part of many eQTLs in testis [11]. Given that
recent studies suggest the importance of lncRNA in gene regulation [107, 108], it is possible that
rs34815776 as a coVarQTL for module 1029 captures the regulatory effect of AC087430.1 on the
genes in the module.
We note that our top two most significant co-eQTL results in the CAMP NHW cohort include
SNP rs1131017 at chromosomal location 12:56042145. This SNP has been implicated in another
74



co-eQTL study by Li et. al. which found that rs1131017 is a co-eQTL for 371 gene pairs, all
of which include Ribosomal Protein S26 (RPS26) [26]. We note that rs1131017 the transcription
start site of RPS26 are less than 1,000 base pairs apart, making the captured signal a cis-co-eQTL
mechanism. rs1131017 is a well documented eQTL for RPS26 in many tissues, and Li et. al. say
that their results point to a possible regulatory mechanism of rs1131017 in autoimmune diseases.
Although none of the genes in the modules for which we discovered rs1131017 contain RPS26
or its relevant pairs, the replication of the SNP is an important indication of the reproducibility of
co-eQTL studies. Additionally, since our co-eQTL discoveries do not contain any genes on the
same chromosome as rs1131017, our results could capture mediation of RPS26 on the relevant
gene modules.
4.2 Methodology
Here we propose a novel analysis for identifying coVarQTLs, or genomic loci that explain
a portion of variance in gene covariance profiles. This method is the first of its kind in that it
identifies relationships between gene module covariance and SNPs without requiring a prior eQTL
analysis. We also present results using existing co-eQTL methodology for comparison.
We choose to identify modules of correlated genes via Partition, however, there are many other
options for module identification. One such option is to reduce the transcriptomic search space
using a priori information. One could first identify TFs, proteins which control the transcription
rate of a gene, known to be associated with the disease of interest and use databases such as
TRANSFAC to identify downstream genes [109]. Then, each gene set would consist of a TF and a
single downstream gene. In a similar fashion, genes of interest may be identified and then any gene
or TF known to be correlated would be included in the gene set. Finally, known gene networks
could be used to identify gene sets.
By first identifying known genes or TFs of interest, the search spaced would be reduced and so
would the computational overhead. This would also provide a validity check for the methodology
since expected results would be known. However, this use of existing information eliminates
75



potential for novel discoveries outside of the known space. Using Partition to identify modules
is uniquely helpful in that it not only identifies modules of correlated genes for further analysis
but also by only using the features condensed into modules, feature selection and data reduction
are also being performed simultaneously. However, any other clustering or dimension reduction
algorithm that identifies groups of correlated genes could be used instead. Future directions include
comparison of discoveries found using gene modules identified by Partition to discoveries found
using gene modules identified by other similar algorithms and gene modules identified using a
priori information.
Computing the overall co-expression matrix YC as the first PC of each module-specific coexpression matrix CMd
is computationally efficient both in that PCs are quick to compute and that
in it reduces the transcriptomic search space. Additionally, the first PC captures the variance in
co-expression profiles rather than just the mean and compared to other summary measures, finds
the most discoveries. We note that the summary comparison was done in the dataset with outliers
included, however, we do not expect a significant change in results between methods after dealing
with departures from model assumptions.
We use the existing MatrixEQTL software to test for gene module covariance-genotype associations, which is computationally advantageous when compared to other software for similar sized
analyses. Additionally, MatrixEQTL has the added benefits of including adjustment covariates and
flexibility in modelling genotype as additive or categorical.
Previous results of GCTA GREML analysis shows that genetic variation explains a percentage
gene module covariance variation (Figure 2.8), providing evidence for biological signal of genetic
co-regulation of genes and the existence of coVarQTLs as mechanisms of such co-regulation. Our
results suggest that we are under powered in our analysis, both in ability to detect genetic regulation
of module expression through the covariance and through the mean. The number of tests run for
each parameter set incur a large multiple testing burden, and thus, a correspondingly large sample
size would be required to adequately power the analysis. Future directions include expansion to
larger, publicly available data sets.
76



In summary, coVarQTLs provides a new way to identify relationships between SNPs and gene
co-expression profiles. The methodology provides flexibility by allowing users to pre-define gene
sets based on a priori information about disease or biological processes or to use a data-driven
approach to determine the gene sets.
77



Chapter 5
Conclusion
In conclusion, throughout this dissertation I:
1. Developed and implemented Super Partition, a scalable approximation of the original Partition algorithm. In a dataset with 20,000 features, Super Partition produces very similar
reduced datasets to and is 100 times faster for mid-range reduction than Partition. Additionally, I applied Super Partition to datasets with numbers of features up to 450,000 without
issue. An application note describing this method was is in the process of being published,
and the method is publicly available in the ’partition’ R package from the CRAN repository
[47].
2. Created a novel differential gene module co-expression hypothesis test, ACDC, which aims
to provide information about how changing gene-gene relationships are related to exposures
and outcomes of interest. Published in Frontiers in Medicine in May 2023, this method is
the first of its kind to allow for continuous and multi-type outcomes and exposures, and in an
application to ABRIDGE and CAMP gene expression data, I found evidence that differential expression and co-expression of genes ADORA3, ALOX15, and IDO1 is associated with
asthma symptomology [62]. Given the flexibility of CCA, both in directionality of relationships and in types of data allowable, the potential applications for this framework expand far
beyond the one provided.
78



3. Released the ’modACDC’ R package which includes a range of functions for users to implement the aforementioned ACDC methodology [53]. Also included are wrapper functions for
the OSCA OREML and GCTA GREML methods from the Yang lab as well as methods for
using estimates provided by OREML and GREML to inform selection of dimension reduction levels for various -omics datasets. As of May 2024, the package has more than 65,000
downloads from the CRAN repository.
4. Developed a novel covariance quantitative trait loci analysis to detect genetic co-regulation
of gene expression regimes. In an application to real data, I discovered significant coVarQTLs that warrant further study to understand their potential mechanistic role in asthma
symptomology. I also compared coVarQTL methodology to existing co-eQTL methodology
and provided one of the first examples of co-eQTL discovery in microarray data.
Taken together, these methods form a toolkit for studying regulatory mechanisms of disease,
from genotypes to gene expression profiles to phenotypes. Perhaps the most important feature is
the flexibility of the methodological frameworks created, which allow for adaptations to fit a wide
range of applications to inform our knowledge of disease.
1 Future directions
Latent variables are increasingly being used as adjustment covariates for eQTL analyses with
the goal of correcting for unobserved confounders to increase statistical power. One such variable,
probabilistic estimation of expression residuals (PEER) factors, claims to estimate ”hidden determinants of gene expression” [88] and increase the rate of discovery, and it is now common practice
to include some number of PEER factors as adjustment covariates in a standard eQTL analysis.
In Chapter 4, we include five PEER factors as adjustment covariates in the standard eQTL analysis as inclusion led to an increase in the number of discoveries. However, the addition of PEER
factors into coVarQTL and co-eQTL models resulted in minimal discoveries compared to models
79



without. We therefore hypothesize that PEER factors capture the same signals as gene module expression and covariance and therefore should not be included in co-eQTL and coVarQTL models.
Future work to understand the relationship between PEER factors and gene module information
could provide an alternate way to summarize this information and lead to more evidence of genetic
co-regulation of gene networks.
Single cell RNA sequencing (scRNA-seq) data, which profiles the gene expression in each
individual cell in a sample, has risen in popularity over the past decade. Compared to more traditional bulk RNA-seq or microarray data, scRNA-seq is more granular, allowing for the study of
phenotypic heterogeneity. It is well-documented that phenotypic heterogeneity, meaning the existence of varying phenotypes in a population of genetically identical cells, exists [110], and that this
property is an important aspect of disease symptomology [111]. For example, understanding how
an inflammatory event such as chronic wheezing leads to heterogeneity in various types of airway
cells may elucidate mechanisms driving asthma attacks. Given this, analyses involving scRNA-seq
data have also risen in popularity, necessitating adaption of existing methodologies and creation of
new methodologies for handling the substructure of this data.
Previously, I identified two cell sub-types in an analysis of scRNA-seq of sputum macrophages
using principal component, clustering, and differential expression analyses. Expression of multiple
marker genes [112] aligned with k-means clustering in the PC1 and PC2 plane cleanly identified
interstitial and alveolar macrophage subgroups in a 2:1 ratio. We note that analyses did not identify
cell clusters based on person. Alveolar macrophages live in the alveoli and airway, and are the
first line of defense against the outside environment such as dust or pathogens [113]. In contrast,
interstitial macrophages are found in the interstitium. Their role in immunity is less clear, although
it is thought that they play both a role as a gatekeeper of the airway [112] and perform regulatory
functions [114, 115]. During lung homeostasis, alveolar macrophages are much more abundant
that interstitial macrophages, indicating that our samples were taken from patients undergoing
inflammatory events in the airway and analysis of these samples could provide insight into the
mechanisms of these events.
80



Given the differences in functionality and gene expression between cell types and sub-types,
application of differential co-expression methods to these various cell sub-types could provide
further information into regulatory mechanisms of various diseases, and new methodology for differential co-expression of independent cell types from scRNA-seq data is just beginning to emerge
[116–119]. To appropriately extend either ACDC or co-eQTLs and coVarQTL methodology for
scRNA-seq data sets, additional modelling terms to account for the hierarchical structure would
need to be added.
81



References
1. Bloom, J. S., Khan, Z., Kruglyak, L., Singh, M. & Caudy, A. A. Measuring differential gene
expression by short read sequencing: quantitative comparison to 2-channel gene expression
microarrays. BMC Genomics 10, 221. doi:10.1186/1471-2164-10-221 (2009).
2. Schwartz, D. GENETIC STUDIES ON MUTANT ENZYMES IN MAIZE. III. CONTROL
OF GENE ACTION IN THE SYNTHESIS OF pH 7.5 ESTERASE. Genetics 47, 1609–
1615 (1962).
3. Soneson, C. & Delorenzi, M. A comparison of methods for differential expression analysis
of RNA-seq data. BMC Bioinformatics 14, 91. doi:10.1186/1471-2105-14-91 (2013).
4. Bhuva, D. D., Cursons, J., Smyth, G. K. & Davis, M. J. Differential co-expression-based
detection of conditional relationships in transcriptional data: comparative analysis and application to breast cancer. Genome Biology 20, 236. doi:10.1186/s13059-019-1851-8
(2019).
5. Mathur, R., Rotroff, D., Ma, J., Shojaie, A. & Motsinger-Reif, A. Gene set analysis methods:
a systematic comparison. BioData Mining 11, 8. doi:10 . 1186 / s13040 - 018 - 0166 - 8
(2018).
6. Montenegro, J. D. Plant Bioinformatics, Methods and Protocols. Methods in Molecular Biology 2443, 387–404. doi:10.1007/978-1-0716-2067-0\_19 (2022).
7. Langfelder, P. & Horvath, S. WGCNA: an R package for weighted correlation network
analysis. BMC Bioinformatics 9, 559. doi:10.1186/1471-2105-9-559 (2008).
8. Ozaki, K., Ohnishi, Y., Iida, A., Sekine, A., Yamada, R., Tsunoda, T., et al. Functional SNPs
in the lymphotoxin-gene that are associated with susceptibility to myocardial infarction.
Nature Genetics 32, 650–654. doi:10.1038/ng1047 (2002).
9. Sollis, E., Mosaku, A., Abid, A., Buniello, A., Cerezo, M., Gil, L., et al. The NHGRIEBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Research 51,
D977–D985. doi:10.1093/nar/gkac1010 (2022).
82



10. Uffelmann, E., Huang, Q. Q., Munung, N. S., de Vries, J., Okada, Y., Martin, A. R., et al.
Genome-wide association studies. Nature Reviews Methods Primers 1, 59. doi:10.1038/
s43586-021-00056-9 (2021).
11. Lonsdale, J., Thomas, J., Salvatore, M., Phillips, R., Lo, E., Shad, S., et al. The GenotypeTissue Expression (GTEx) project. Nature Genetics 45, 580–585. doi:10.1038/ng.2653
(2013).
12. Consortium, T. G., Aguet, F., Anand, S., Ardlie, K. G., Gabriel, S., Getz, G. A., et al. The
GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369,
1318–1330. doi:10.1126/science.aaz1776 (2020).
13. Goring, H. H. H., Curran, J. E., Johnson, M. P., Dyer, T. D., Charlesworth, J., Cole, S. A., ¨
et al. Discovery of expression QTLs using large-scale transcriptional profiling in human
lymphocytes. Nature Genetics 39, 1208–1216. doi:10.1038/ng2119 (2007).
14. Wray, G. A. The evolutionary significance of cis-regulatory mutations. Nature Reviews Genetics 8, 206–216. doi:10.1038/nrg2063 (2007).
15. Arvanitis, M., Tayeb, K., Strober, B. J. & Battle, A. Redefining tissue specificity of genetic
regulation of gene expression in the presence of allelic heterogeneity. The American Journal
of Human Genetics 109, 223–239. doi:10.1016/j.ajhg.2022.01.002 (2022).
16. Witten, D. M., Tibshirani, R. & Hastie, T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10,
515–534. doi:10.1093/biostatistics/kxp008 (2009).
17. Naylor, M. G., Lin, X., Weiss, S. T., Raby, B. A. & Lange, C. Using Canonical Correlation
Analysis to Discover Genetic Regulatory Variants. PLoS ONE 5, e10395. doi:10.1371/
journal.pone.0010395 (2010).
18. Lin, D., Zhang, J., Li, J., Calhoun, V. D., Deng, H.-W. & Wang, Y.-P. Group sparse canonical
correlation analysis for genomic data integration. BMC Bioinformatics 14, 245. doi:10 .
1186/1471-2105-14-245 (2013).
19. Zhang, L. & Kim, S. Learning Gene Networks under SNP Perturbations Using eQTL Datasets.
PLoS Computational Biology 10, e1003420. doi:10 . 1371 / journal . pcbi . 1003420
(2014).
20. Yuan, H., Li, Z., Tang, N. L. & Deng, M. A network based covariance test for detecting
multivariate eQTL in saccharomyces cerevisiae. BMC Systems Biology 10, S8. doi:10 .
1186/s12918-015-0245-0 (2016).
83



21. Millstein, J., Zhang, B., Zhu, J. & Schadt, E. E. Disentangling molecular relationships with
a causal inference test. BMC Genetics 10, 23. doi:10.1186/1471-2156-10-23 (2009).
22. Taurisano, P., Pergola, G., Monda, A., Antonucci, L. A., Carlo, P. D., Piarulli, F., et al. The
interaction between cannabis use and a CB1-related polygenic co-expression index modulates dorsolateral prefrontal activity during working memory processing. Brain Imaging
and Behavior 15, 288–299 (2021).
23. Villa-Vialaneix, N., Liaubet, L., Laurent, T., Cherel, P., Gamot, A. & SanCristobal, M.
The Structure of a Gene Co-Expression Network Reveals Biological Functions Underlying eQTLs. PLoS ONE 8, e60045 (2013).
24. Weighill, D., Guebila, M. B., Glass, K., Quackenbush, J. & Platig, J. Predicting genotypespecific gene regulatory networks. Genome Research 32, 524–533. doi:10 . 1101 / gr .
275107.120 (2022).
25. Kolberg, L., Kerimov, N., Peterson, H. & Alasoo, K. Co-expression analysis reveals interpretable gene modules controlled by trans-acting genetic variants. eLife 9, e58705. doi:10.
7554/elife.58705 (2020).
26. Li, S., Schmid, K. T., Vries, D. H. d., Korshevniuk, M., Losert, C., Oelen, R., et al. Identification of genetic variants that impact gene co-expression relationships using large-scale
single-cell data. Genome Biology 24, 80. doi:10.1186/s13059-023-02897-x (2023).
27. Oelen, R., Vries, D. H. d., Brugge, H., Gordon, M. G., Vochteloo, M., consortium single-cell
eQTLGen, s.-c. e., et al. Single-cell RNA-sequencing of peripheral blood mononuclear cells
reveals widespread, context-specific gene expression regulation upon pathogenic exposure.
Nature Communications 13, 3267 (2022).
28. Pergola, G., Carlo, P. D., D’Ambrosio, E., Gelao, B., Fazio, L., Papalino, M., et al. DRD2
co-expression network and a related polygenic index predict imaging, behavioral and clinical phenotypes linked to schizophrenia. Translational Psychiatry 7, e1006–e1006 (2017).
29. Pergola, G., Carlo, P. D., Jaffe, A. E., Papalino, M., Chen, Q., Hyde, T. M., et al. Prefrontal
Coexpression of Schizophrenia Risk Genes Is Associated With Treatment Response in Patients. Biological Psychiatry 86, 45–55 (2019).
30. NHIS Adult Summary Health Statistics 2023.
31. NHIS Child Summary Health Statistics 2023.
84



32. Nathan, R. A., Sorkness, C. A., Kosinski, M., Schatz, M., Li, J. T., Marcus, P., et al. Development of the asthma control test A survey for assessing asthma control. Journal of Allergy
and Clinical Immunology 113, 59–65. doi:10.1016/j.jaci.2003.09.008 (2004).
33. Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., et al. Gene
Ontology: tool for the unification of biology. Nature Genetics 25, 25–29 (2000).
34. Melen, E. & Pershagen, G. Pathophysiology of asthma: lessons from genetic research with ´
particular focus on severe asthma. Journal of Internal Medicine 272, 108–120 (2012).
35. Zhang, Y., Moffatt, M. F. & Cookson, W. O. Genetic and genomic approaches to asthma.
Current Opinion in Pulmonary Medicine 18, 6–13 (2012).
36. He, L.-L., Xu, F., Zhan, X.-Q., Chen, Z.-H. & Shen, H.-H. Identification of critical genes associated with the development of asthma by co-expression modules construction. Molecular
Immunology 123, 18–25 (2020).
37. Chen, G., Chen, D., Feng, Y., Wu, W., Gao, J., Chang, C., et al. Identification of Key Signaling Pathways and Genes in Eosinophilic Asthma and Neutrophilic Asthma by Weighted
Gene Co-Expression Network Analysis. Frontiers in Molecular Biosciences 9, 805570 (2022).
38. Cao, Y., Wu, Y., Lin, L., Yang, L., Peng, X. & Chen, L. Identifying key genes and functionally enriched pathways in Th2-high asthma by weighted gene co-expression network
analysis. BMC Medical Genomics 15, 110 (2022).
39. Zhang, Z., Wang, J. & Chen, O. Identification of biomarkers and pathogenesis in severe
asthma by coexpression network analysis. BMC Medical Genomics 14, 51 (2021).
40. Kozlik-Siwiec, P., Buregwa-Czuma, S., Zawlik, I., Dziedzina, S., Myszka, A., Zuk-Kuwik,
J., et al. Co-Expression Analysis of Airway Epithelial Transcriptome in Asthma Patients
with Eosinophilic vs. Non-Eosinophilic Airway Infiltration. International Journal of Molecular Sciences 24, 3789 (2023).
41. Kim, D. J., Lim, J. E., Jung, H.-U., Chung, J. Y., Baek, E. J., Jung, H., et al. Identification
of asthma-related genes using asthmatic blood eQTLs of Korean patients. BMC Medical
Genomics 16, 259 (2023).
42. Hao, K., Bosse, Y., Nickle, D. C., Par ´ e, P. D., Postma, D. S., Laviolette, M., ´ et al. Lung
eQTLs to Help Reveal the Molecular Underpinnings of Asthma. PLoS Genetics 8, e1003029
(2012).
85



43. Croteau-Chonka, D. C., Qiu, W., Martinez, F. D., Strunk, R. C., Lemanske, R. F., Liu, A. H.,
et al. Gene Expression Profiling in Blood Provides Reproducible Molecular Insights into
Asthma Control. American Journal of Respiratory and Critical Care Medicine 195, 179–
188. doi:10.1164/rccm.201601-0107oc (2016).
44. Torgerson, D. G., Ampleford, E. J., Chiu, G. Y., Gauderman, W. J., Gignoux, C. R., Graves,
P. E., et al. Meta-analysis of genome-wide association studies of asthma in ethnically diverse
North American populations. Nature Genetics 43, 887–892. doi:10.1038/ng.888 (2011).
45. Covar, R. A., Fuhlbrigge, A. L., Williams, P., Kelly, H. W. & Group, t. C. A. M. P. R. The
Childhood Asthma Management Program (CAMP): contributions to the understanding of
therapy and the natural history of childhood asthma. Current Respiratory Care Reports 1,
243–250. doi:10.1007/s13665-012-0026-9 (2012).
46. Das, S., Forer, L., Schonherr, S., Sidore, C., Locke, A. E., Kwong, A., ¨ et al. Next-generation
genotype imputation service and methods. Nature Genetics 48, 1284–1287. doi:10.1038/
ng.3656 (2016).
47. Malcom Barrett Katelyn Queen, J. M. partition: Agglomerative Partitioning Framework for
Dimension Reduction R package version 0.2.0 (2024).
48. Millstein, J., Battaglin, F., Barrett, M., Cao, S., Zhang, W., Stintzing, S., et al. Partition:
a surjective mapping approach for dimensionality reduction. Bioinformatics 36, 676–681.
doi:10.1093/bioinformatics/btz661 (2019).
49. Gagolewski, M., Bartoszuk, M. & Cena, A. Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm. Information Sciences 363, 8–23. doi:10.1016/j.ins.2016.
05.003 (2016).
50. Gagolewski, M. genieclust: Fast and robust hierarchical clustering. SoftwareX 15, 100722
(2021).
51. MacQueen, J. et al. Some methods for classification and analysis of multivariate observations in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability 1 (1967), 281–297.
52. Muzny, D. M., Bainbridge, M. N., Chang, K., Dinh, H. H., Drummond, J. A., Fowler, G.,
et al. Comprehensive molecular characterization of human colon and rectal cancer. Nature
487, 330–337. doi:10.1038/nature11252 (2012).
53. Katelyn Queen, J. M. modACDC: Association of Covariance for Detecting Differential CoExpression R package version 2.0.1 (2024).
86



54. Visscher, P. M., Hill, W. G. & Wray, N. R. Heritability in the genomics era — concepts and
misconceptions. Nature Reviews Genetics 9, 255–266. doi:10.1038/nrg2322 (2008).
55. Balding, D. J., Bishop, M. & Cannings, C. Handbook of statistical genetics (John Wiley &
Sons, 2008).
56. Boyle, E. A., Li, Y. I. & Pritchard, J. K. An Expanded View of Complex Traits: From
Polygenic to Omnigenic. Cell 169, 1177–1186. doi:10 . 1016 / j . cell . 2017 . 05 . 038
(2017).
57. Zhang, F., Chen, W., Zhu, Z., Zhang, Q., Nabais, M. F., Qi, T., et al. OSCA: a tool for omicdata-based complex trait analysis. Genome Biology 20, 107. doi:10.1186/s13059-019-
1718-z (2019).
58. PATTERSON, H. D. & THOMPSON, R. Recovery of inter-block information when block
sizes are unequal. Biometrika 58, 545–554. doi:10.1093/biomet/58.3.545 (1971).
59. Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: A Tool for Genome-wide
Complex Trait Analysis. The American Journal of Human Genetics 88, 76–82. doi:10 .
1016/j.ajhg.2010.11.011 (2011).
60. Kumar, S. K., Feldman, M. W., Rehkopf, D. H. & Tuljapurkar, S. Limitations of GCTA
as a solution to the missing heritability problem. Proceedings of the National Academy of
Sciences 113, E61–E70 (2016).
61. Yang, J., Lee, S. H., Wray, N. R., Goddard, M. E. & Visscher, P. M. GCTA-GREML
accounts for linkage disequilibrium when estimating genetic variance from genome-wide
SNPs. Proceedings of the National Academy of Sciences 113, E4579–E4580 (2016).
62. Queen, K., Nguyen, M.-N., Gilliland, F. D., Chun, S., Raby, B. A. & Millstein, J. ACDC:
a general approach for detecting phenotype or exposure associated co-expression. Frontiers
in Medicine 10, 1118824. doi:10.3389/fmed.2023.1118824 (2023).
63. Chowdhury, H. A., Bhattacharyya, D. K. & Kalita, J. K. (Differential) Co-Expression Analysis of Gene Expression: A Survey of Best Practices. IEEE/ACM Transactions on Computational Biology and Bioinformatics 17, 1154–1173. doi:10.1109/TCBB.2019.2893170
(2020).
64. Watson, M. CoXpress: differential co-expression in gene expression data. BMC Bioinformatics 7, 509. doi:10.1186/1471-2105-7-509 (2006).
87



65. Widmann, M. One-Dimensional CCA and SVD, and Their Relationship to Regression
Maps. Journal of Climate 18, 2785–2792. doi:10.1175/jcli3424.1 (2005).
66. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful
approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 57, 289–300 (1995).
67. Millstein, J. & Volfson, D. Computationally efficient permutation-based confidence interval
estimation for tail-area FDR. Frontiers in Genetics 4, 179. doi:10 . 3389 / fgene . 2013 .
00179 (2013).
68. Barrett, M. & Millstein, J. partition: A fast and flexible framework for data reduction in R.
Journal of open source software 5 (2020).
69. Amar, D., Safer, H. & Shamir, R. Dissection of Regulatory Networks that Are Altered in
Disease via Differential Co-expression. PLoS Computational Biology 9, e1002955. doi:10.
1371/journal.pcbi.1002955 (2013).
70. Blumenthal, M., Miller, M., Reilly, C., Oetting, W., Brott, M., Willaert, R., et al. Analysis of ADORA3 as a possible candidate gene for asthma. Journal of Allergy and Clinical
Immunology 115, S219. doi:10.1016/j.jaci.2004.12.883 (2005).
71. Kim, S.-H., Kim, Y.-K., Park, H.-W., Kim, S.-H., Kim, S.-H., Ye, Y.-M., et al. Adenosine
deaminase and adenosine receptor polymorphisms in aspirin-intolerant asthma. Respiratory
Medicine 103, 356–363. doi:10.1016/j.rmed.2008.10.008 (2009).
72. Saferali, A., Yun, J. H., Lee, S., Chase, R. P., Bowler, R. P., Castaldi, P. J., et al. Transcriptomic Signature of Asthma–Chronic Obstructive Pulmonary Disease Overlap in Whole
Blood. American Journal of Respiratory Cell and Molecular Biology 64, 268–271. doi:10.
1165/rcmb.2020-0382le (2021).
73. Sanchez-Ovando, S., Simpson, J. L., Barker, D., Baines, K. J. & Wark, P. A. B. Transcrip- ´
tomics of biopsies identifies novel genes and pathways linked to neutrophilic inflammation
in severe asthma. Clinical & Experimental Allergy 51, 1279–1294. doi:10 . 1111 / cea .
13986 (2021).
74. Bradding, P., Redington, A. E., Djukanovic, R., Conrad, D. J. & Holgate, S. T. 15-Lipoxygenase
Immunoreactivity in Normal and in Asthmatic Airways. American Journal of Respiratory
and Critical Care Medicine 151, 1201–1204. doi:10.1164/ajrccm/151.4.1201 (1995).
88



75. Profita, M., Sala, A., Riccobono, L., Paterno, A., Mirabella, A., Bonanno, A., ` et al. 15-
Lipoxygenase expression and 15(S)-hydroxyeicoisatetraenoic acid release and reincorporation in induced sputum of asthmatic subjects. Journal of Allergy and Clinical Immunology
105, 711–716. doi:10.1067/mai.2000.105122 (2000).
76. Laprise, C., Sladek, R., Ponton, A., Bernier, M.-C., Hudson, T. J. & Laviolette, M. Functional classes of bronchial mucosa genes that are differentially expressed in asthma. BMC
Genomics 5, 21. doi:10.1186/1471-2164-5-21 (2004).
77. Permaul, P., Raby, B., Levy, B. & Israel, E. ALOX-15 Haplotypes and Association with
Asthma. B30. GENETICS OF AIRWAY DISEASES I, A2746. doi:10 . 1164 / ajrccm -
conference.2009.179.1\_meetingabstracts.a2746 (2009).
78. Xu, H., Oriss, T. B., Fei, M., Henry, A. C., Melgert, B. N., Chen, L., et al. Indoleamine
2,3-dioxygenase in lung dendritic cells promotes Th2 responses and allergic inflammation. Proceedings of the National Academy of Sciences, 6690–6695. doi:10.1073/pnas.
0708809105 (2008).
79. Sanchez-Ovando, S., Baines, K., Barker, D., Wark, P. & Simpson, J. L. Endobronchial ´
biopsy gene expression between different severe asthma inflammatory phenotypes. European Respiratory Journal 54. doi:10.1183/13993003.congress-2019.PA5207 (2019).
80. Nakajima, T., Matsumoto, K., Suto, H., Tanaka, K., Ebisawa, M., Tomita, H., et al. Gene expression screening of human mast cells and eosinophils using high-density oligonucleotide
probe arrays: abundant expression of major basic protein in mast cells. Blood 98, 1127–
1134. doi:10.1182/blood.v98.4.1127 (2001).
81. Chen, J. & Nodzak, C. eQTL Analysis, Methods and Protocols. Methods in Molecular Biology 2082, 87–104. doi:10.1007/978-1-0716-0026-9\_7 (2019).
82. Shabalin, A. A. Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics 28, 1353–1358. doi:10.1093/bioinformatics/bts163 (2012).
83. Qiu, W., Guo, F., Glass, K., Yuan, G. C., Quackenbush, J., Zhou, X., et al. Differential
connectivity of gene regulatory networks distinguishes corticosteroid response in asthma.
Journal of Allergy and Clinical Immunology 141, 1250–1258. doi:10 . 1016 / j . jaci .
2017.05.052 (2018).
84. Li, X., Hastie, A. T., Hawkins, G. A., Moore, W. C., Ampleford, E. J., Milosevic, J., et al.
eQTL of bronchial epithelial cells and bronchial alveolar lavage deciphers GWAS-identified
asthma genes. Allergy 70, 1309–1318. doi:10.1111/all.12683 (2015).
89



85. Queen, K., Nguyen, M.-N., Gilliland, F. D., Chun, S., Raby, B. A. & Millstein, J. ACDC:
a general approach for detecting phenotype or exposure associated co-expression. Frontiers
in Medicine 10. doi:10.3389/fmed.2023.1118824 (2023).
86. Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological) 57, 289–300. doi:10.1111/j.2517-6161.1995.tb02031.x (1995).
87. Bolstad, B. preprocessCore: A collection of pre-processing functions R package version
1.64.0 (2023).
88. Stegle, O., Parts, L., Piipari, M., Winn, J. & Durbin, R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression
analyses. Nature Protocols 7, 500–507. doi:10.1038/nprot.2011.457 (2012).
89. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M., Bender, D., et al. PLINK:
a tool set for whole-genome association and population-based linkage analyses. American
Journal of Human Genetics 81, 559–575. doi:10.1086/519795 (2007).
90. Willer, C. J., Li, Y. & Abecasis, G. R. METAL: fast and efficient meta-analysis of genomewide
association scans. Bioinformatics 26, 2190–2191. doi:10.1093/bioinformatics/btq340
(2010).
91. Mi, H., Muruganujan, A. & Thomas, P. D. PANTHER in 2013: modeling the evolution of
gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids
Research 41, D377–D386 (2013).
92. Islam, S. u. Infectious Diseases, 1–24 (2023).
93. Smole, U., Kratzer, B. & Pickl, W. F. Soluble pattern recognition molecules: Guardians and
regulators of homeostasis at airway mucosal surfaces. European Journal of Immunology 50,
624–642 (2020).
94. Khan, M. A., Assiri, A. M. & Broering, D. C. Complement mediators: key regulators of
airway tissue remodeling in asthma. Journal of Translational Medicine 13, 272 (2015).
95. Zhu, Z., Lee, P. H., Chaffin, M. D., Chung, W., Loh, P.-R., Lu, Q., et al. A genome-wide
cross-trait analysis from UK Biobank highlights the shared genetic architecture of asthma
and allergic diseases. Nature Genetics 50, 857–864 (2018).
96. Gao, H.-Y., Wang, W., Luo, X.-G., Jiang, Y.-F., He, X., Xu, P., et al. Screening of prognostic
risk microRNAs for acute myeloid leukemia. Hematology 23, 747–755 (2018).
90



97. Elmusrati, A. A., Pilborough, A. E., Khurram, S. A. & Lambert, D. W. Cancer-associated fibroblasts promote bone invasion in oral squamous cell carcinoma. British Journal of Cancer
117, 867–875 (2017).
98. Lualdi, M., Ronci, M., Zilocchi, M., Corno, F., Turilli, E. S., Sponchiado, M., et al. Exploring the Mitochondrial Degradome by the TAILS Proteomics Approach in a Cellular Model
of Parkinson’s Disease. Frontiers in Aging Neuroscience 11, 195 (2019).
99. Emadali, A., Rousseaux, S., Bruder-Costa, J., Rome, C., Duley, S., Hamaidia, S., et al.
Identification of a novel BET bromodomain inhibitor-sensitive, gene regulatory circuit that
controls Rituximab response and tumour growth in aggressive lymphoid cancers. EMBO
Molecular Medicine 5, 1180–1195 (2013).
100. Wang, Z., Zhou, T., Chen, X., Zhu, X., Liao, B., Liu, J., et al. CCDC86 promotes the aggressive behavior of nasopharyngeal carcinoma by positively regulating EGFR and activating
the PI3K/Akt signaling. Neoplasma 70, 761–776 (2024).
101. Chervenkov, T., Shishkov, R. & Tonchev, A. B. Expression and differential response to
haloperidol treatment of Cyclon/CCDC86 mRNA in schizophrenia patients. Neurochemistry International 62, 870–872 (2013).
102. Mee, L., Honkala, H., Kopra, O., Vesa, J., Finnila, S., Visap ¨ a¨a, I., ¨ et al. Hydrolethalus syndrome is caused by a missense mutation in a novel gene HYLS1. Human Molecular Genetics 14, 1475–1488 (2005).
103. Cho, H.-J., Chung, Y. W., Moon, S., Seo, J. H., Kang, M., Nam, J. S., et al. IL-4 drastically
decreases deuterosomal and multiciliated cells via alteration in progenitor cell differentiation. Allergy 78, 1866–1877 (2023).
104. Yick, C. Y., Zwinderman, A. H., Kunst, P. W., Grunberg, K., Mauad, T., Chowdhury, S., ¨ et
al. Gene expression profiling of laser microdissected airway smooth muscle tissue in asthma
and atopy. Allergy 69, 1233–1240 (2014).
105. Bryois, J., Buil, A., Evans, D. M., Kemp, J. P., Montgomery, S. B., Conrad, D. F., et al.
Cis and Trans Effects of Human Genomic Variants on Gene Expression. PLoS Genetics 10,
e1004461 (2014).
106. Pierce, B. L., Tong, L., Chen, L. S., Rahaman, R., Argos, M., Jasmine, F., et al. Mediation Analysis Demonstrates That Trans-eQTLs Are Often Explained by Cis-Mediation: A
Genome-Wide Analysis among 1,800 South Asians. PLoS Genetics 10, e1004818 (2014).
91



107. Statello, L., Guo, C.-J., Chen, L.-L. & Huarte, M. Gene regulation by long non-coding
RNAs and its biological functions. Nature Reviews Molecular Cell Biology 22, 96–118
(2021).
108. Ransohoff, J. D., Wei, Y. & Khavari, P. A. The functions and unique features of long intergenic non-coding RNA. Nature Reviews Molecular Cell Biology 19, 143–157 (2018).
109. Wingender, E. The TRANSFAC project as an example of framework technology that supports the analysis of genomic regulation. Briefings in Bioinformatics 9, 326–332. doi:10.
1093/bib/bbn016 (2008).
110. Altschuler, S. J. & Wu, L. F. Cellular Heterogeneity: Do Differences Make a Difference?
Cell 141, 559–563. doi:10.1016/j.cell.2010.04.033 (2010).
111. Bianco, P. & Robey, P. G. Handbook of Stem Cells. Part Four: Mesoderm, 415–424. doi:10.
1016/b978-012436643-5/50129-2 (2004).
112. Aegerter, H., Lambrecht, B. N. & Jakubzick, C. V. Biology of lung macrophages in health
and disease. Immunity 55, 1564–1580 (2022).
113. Shi, T., Denney, L., An, H., Ho, L.-P. & Zheng, Y. Alveolar and lung interstitial macrophages:
Definitions, functions, and roles in lung fibrosis. Journal of Leukocyte Biology 110, 107–
114 (2021).
114. Schyns, J., Bureau, F. & Marichal, T. Lung Interstitial Macrophages: Past, Present, and
Future. Journal of Immunology Research 2018, 5160794 (2018).
115. Liegeois, M., Legrand, C., Desmet, C. J., Marichal, T. & Bureau, F. The interstitial macrophage:
A long-neglected piece in the puzzle of lung immunity. Cellular Immunology 330, 91–96
(2018).
116. Su, C., Xu, Z., Shan, X., Cai, B., Zhao, H. & Zhang, J. Cell-type-specific co-expression
inference from single cell RNA-sequencing data. Nature Communications 14, 4846 (2023).
117. Harris, B. D., Crow, M., Fischer, S. & Gillis, J. Single-cell co-expression analysis reveals
that transcriptional modules are shared across cell types in the brain. Cell Systems 12, 748–
756.e3 (2021).
118. Wang, X., Choi, D. & Roeder, K. Constructing local cell-specific networks from single-cell
data. Proceedings of the National Academy of Sciences 118, e2113178118 (2021).
92



119. Algabri, Y. A., Li, L. & Liu, Z.-P. scGENA: A Single-Cell Gene Coexpression Network
Analysis Framework for Clustering Cell Types and Revealing Biological Mechanisms. Bioengineering 9, 353 (2022).
93



Appendices
A Supplemental Figures and Tables
Figure A.1: Percent Variance Explained in 6-month ACT Score by ABRIDGE Whole Blood
Gene Expression
Figure A.1: Plot showing percent variance explained in 6-month ACT score by ABRIDGE whole blood
gene expression data at varying level of dataset reduction, calculated for observed phenotypes in blue and
permuted phenotypes in red. An information loss value of 0 represents the unreduced dataset, and an information loss level of 100 represents the data being reduced to the average expression of all genes
94



Figure A.2: MV FDR Discovery Plot for ABRIDGE Gene co-expression CCA
Figure A.2 MV FDR discovery plot for CCA between ABRIDGE gene-gene co-expression measures and
ACT score components. After 250 permutations, likelihood ratio test p-values were computed for both the
permuted and non-permuted data across a series of possible detection threshold. The numbers above the
FDR estimate line denote the number of discoveries at each threshold, and the green shaded area is the 95%
CI for the FDR estimate.
B Supplemental Information
B.1 Chapter 3
OmicS-data-based Complex trait Analysis (OSCA) is a suite of C++ functions that provides
an estimate of the percent of variance in an external phenotype that can be explained by an omics
profile, akin to heritability estimates in GWAS. Here, we use OSCA’s Omics Restricted Maximum
Likelihood (OREML) method to determine the percent variance explained in 6-month ACT scores
by increasingly reduced gene expression data. These results can be seen in Figure A.1 and Table
4.
95



For the ACDC analysis, we aimed to choose a dataset that balances within and between-module
variance and reduces noise and thus, chose an ILC of 0.35, corresponding to 53% reduction in
features. This dataset explains 12.2% of the variance in 6-month ACT score by between-module
relationships. Therefore, the 10% percent variance explained that seems to be loss by reducing the
data 53% is contained within the modules created at this level of reduction. Additionally, we see
that the 72% reduced dataset explains roughly the same variance in ACT score, however by the
nature of Partition, these modules are larger and thus, noisier.
96



Table A.1: Gene module information for QTL analyses
Table A.1 Genes in referenced modules for full transcriptome partitioned at an ILC of 0.40. Gene symbols
are in HUGO format.
97 
Abstract (if available)
Abstract In systems biology, a key aim is to identify regulatory mechanisms in normal biological pro- cesses, disease, and disease progression. As a result, the number of statistical methods focused on gene networks, which form the basis of regulatory mechanisms, have greatly increased in the past decade. However, existing methods are limited both in their genomic and transcriptomic search spaces as well as what type of outcomes can be studied. Datasets used in these analyses are often incredibly high dimensional, which leads to computational resource overload and breakdown of ex- isting methodology not built for this type of data. Additionally, traditional statistical models are not designed to include multi-type phenotypes or exposures, or correlated genes, and addition of these adds a large magnitude of complexity to the modelling strategy. Because of these methodological limitations, questions still remain about the feasibility of large-scale network studies. Across the three parts of my dissertation, I (1) fortify existing software for dimension reduction in high dimen- sional data, (2) identify and quantify transcriptional covariance in gene sets that cause variance in a phenotype or external variable of interest, and (3) test for associations between those covariance features of gene sets and genetic variation across the full transcriptome and millions of locations in the genome. Projects (2) and (3) involve novel methodology that is widely applicable across both disease and data type, providing flexible extensions of existing methods.
This methodology and the coinciding applications to cohorts of asthmatic patients provide evidence for a new type of biological relationship that has long been posited but not proven and opens the door for new treatment designs. Of even greater significance is the ability to apply these novel methods to a wide range of diseases and biological questions, from normal biological functions such as blood sugar regulation to severe disease states such as cancer metastasis. 
Linked assets
University of Southern California Dissertations and Theses
doctype icon
University of Southern California Dissertations and Theses 
Action button
Conceptually similar
Best practice development for RNA-Seq analysis of complex disorders, with applications in schizophrenia
PDF
Best practice development for RNA-Seq analysis of complex disorders, with applications in schizophrenia 
Scalable latent factor models for inferring genetic regulatory networks
PDF
Scalable latent factor models for inferring genetic regulatory networks 
twas_sim, a Python-based tool for simulation and power analysis of transcriptome-wide association analysis
PDF
twas_sim, a Python-based tool for simulation and power analysis of transcriptome-wide association analysis 
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
PDF
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis 
Identification of differentially connected gene expression subnetworks in asthma symptom
PDF
Identification of differentially connected gene expression subnetworks in asthma symptom 
Genome-wide characterization of the regulatory relationships of cell type-specific enhancer-gene links
PDF
Genome-wide characterization of the regulatory relationships of cell type-specific enhancer-gene links 
Integrative analysis of multi-view data with applications in epidemiology
PDF
Integrative analysis of multi-view data with applications in epidemiology 
Gene expression and angiogenesis pathway across DNA methylation subtypes in colon adenocarcinoma
PDF
Gene expression and angiogenesis pathway across DNA methylation subtypes in colon adenocarcinoma 
The influence of DNA repair genes and prenatal tobacco exposure on childhood acute lymphoblastic leukemia risk: a gene-environment interaction study
PDF
The influence of DNA repair genes and prenatal tobacco exposure on childhood acute lymphoblastic leukemia risk: a gene-environment interaction study 
Statistical analysis of high-throughput genomic data
PDF
Statistical analysis of high-throughput genomic data 
Identification and characterization of cancer-associated enhancers
PDF
Identification and characterization of cancer-associated enhancers 
Finding signals in Infinium DNA methylation data
PDF
Finding signals in Infinium DNA methylation data 
Computational analysis of genome architecture
PDF
Computational analysis of genome architecture 
Variants in MTNR1B and CDKAL1 contributes independent additive effects to GDM-related traits in Mexican Americans
PDF
Variants in MTNR1B and CDKAL1 contributes independent additive effects to GDM-related traits in Mexican Americans 
DNA methylation and gene expression profiles in Vidaza treated cultured cancer cells
PDF
DNA methylation and gene expression profiles in Vidaza treated cultured cancer cells 
Identifying prognostic gene mutations in colorectal cancer with random forest survival analysis
PDF
Identifying prognostic gene mutations in colorectal cancer with random forest survival analysis 
Using average pairwise distance in a correlation analysis
PDF
Using average pairwise distance in a correlation analysis 
DNA methylation of NOS genes and carotid intima-media thickness in children
PDF
DNA methylation of NOS genes and carotid intima-media thickness in children 
Decoding the neurological and genetic underpinnings of chronic pain
PDF
Decoding the neurological and genetic underpinnings of chronic pain 
An analysis of conservation of methylation
PDF
An analysis of conservation of methylation 
Action button
Asset Metadata
Creator Queen, Katelyn Jennae (author) 
Core Title Identifying and quantifying transcriptional module heterogeneity and genetic co-regulation, with applications in asthma 
School Keck School of Medicine 
Degree Doctor of Philosophy 
Degree Program Biostatistics 
Degree Conferral Date 2024-05 
Publication Date 05/17/2024 
Defense Date 04/26/2024 
Publisher Los Angeles, California (original), University of Southern California (original), University of Southern California. Libraries (digital) 
Tag asthma,dimension reduction,DNA,gene expression,Genetics,OAI-PMH Harvest,QTLs 
Format theses (aat) 
Language English
Contributor Electronically uploaded by the author (provenance) 
Advisor Millstein, Joshua (committee chair), Mancuso, Nicholas (committee member), Selleck, Mark (committee member), Siegmund, Kimberly (committee member) 
Creator Email katelynqueen98@gmail.com,kjqueen@usc.edu 
Permanent Link (DOI) https://doi.org/10.25549/usctheses-oUC113939971 
Unique identifier UC113939971 
Identifier etd-QueenKatel-12934.pdf (filename) 
Legacy Identifier etd-QueenKatel-12934 
Document Type Dissertation 
Format theses (aat) 
Rights Queen, Katelyn Jennae 
Internet Media Type application/pdf 
Type texts
Source 20240517-usctheses-batch-1153 (batch), University of Southern California (contributing entity), University of Southern California Dissertations and Theses (collection) 
Access Conditions The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law.  Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright.  It is the author, as rights holder, who must provide use permission if such use is covered by copyright. 
Repository Name University of Southern California Digital Library
Repository Location USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email cisadmin@lib.usc.edu
Tags
dimension reduction
DNA
gene expression
QTLs