Close
About
FAQ
Home
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Sequential analysis of large scale data screening problem
(USC Thesis Other)
Sequential analysis of large scale data screening problem
PDF
Download
Share
Open document
Flip pages
Copy asset link
Request this asset
Transcript (if available)
Content
SEQUENTIAL ANALYSIS OF LARGE SCALE DATA SCREENING PROBLEM by Tao Feng A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (BIOSTATISTICS) May 2015 Copyright 2015 Tao Feng Dedication I would like to thank my parents, my parents in law, my wife, and my adorable son for their continuous support. The road to the PhD degree is filled with all kinds of challenges. With their great help and encouragement, I can face and deal with all those challenges with confidence. Without their help, I am not sure if I can accomplish my PhD dream. I am so lucky that I have all of them to accompany me for my pursuing of my PhD. I sincerely express my gratitude for all of them. Thank you all. ii Acknowledgments absor I would like to thank my mentors Dr. Wendy Mack and Dr. Wenguang Sun for their excellent guidance and directions for my PhD work. They are very patient and super knowledgeable. Anytime when I have technical questions that seem to be very difficult to resolve, it is them who encourage me and help me figure out the solutions to those questions. Without their mentoring, I do not think that I can accomplish such great PhD work. I really appreciate all their help. In addition, I would like to thank my committee member Dr. Paul Marjoram, Dr. KimberlySiegmund, andDr. JoshuaMillsteinfortheirgreattechnicalsupport and suggestions for my PhD work. I am so lucky to have them as my committee member. Thank you all. Also, I would like to thank Dr. Roberta Brinton for giving me a chance to participate in her P3 projects. It was such a great hands-on experience working on her projects. I am really happy to be one of co-authors for her publications. Finally, many thanks go to Mary Trujillo, Sherri Fagan, George Martinez and all other administrators of our department. They are the best administrators and have given me very efficient help. iii Contents Dedication ii Acknowledgments iii List of Tables vi List of Figures vii Abstract viii 1 Introduction 1 2 Literature review 4 2.1 Current HTS methods . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.1 Statistical methods of primary screen . . . . . . . . . . . . . 5 2.1.2 Statistical methods of confirmatory screen . . . . . . . . . . 9 2.1.3 Major existing statistical issues . . . . . . . . . . . . . . . . 10 2.2 A review of multiple testing procedures . . . . . . . . . . . . . . . . 11 2.2.1 Family wise error rate (FWER) . . . . . . . . . . . . . . . . 11 2.2.2 False discovery rate (FDR) . . . . . . . . . . . . . . . . . . . 13 3 The general framework of a two stage design 17 4 Two point normal mixture model 19 4.1 Model set up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Oracle procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3 Data adaptive procedure . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3.1 Prior distributions . . . . . . . . . . . . . . . . . . . . . . . 21 4.3.2 Some notations . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3.3 Posterior distributions . . . . . . . . . . . . . . . . . . . . . 24 4.3.4 MCMC algorithm . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3.5 Data adaptive procedure . . . . . . . . . . . . . . . . . . . . 26 4.4 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 iv 4.4.1 Numerical setting 1 . . . . . . . . . . . . . . . . . . . . . . . 28 4.4.2 Numerical setting 2 . . . . . . . . . . . . . . . . . . . . . . . 31 4.4.3 Numerical setting 3 . . . . . . . . . . . . . . . . . . . . . . . 31 4.5 Application to real HTS data . . . . . . . . . . . . . . . . . . . . . 33 5 K point normal mixture model 35 5.1 Model set up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2 Estimating the value k and associated parameters . . . . . . . . . . 36 5.2.1 Prior distributions . . . . . . . . . . . . . . . . . . . . . . . 37 5.2.2 Posterior distributions . . . . . . . . . . . . . . . . . . . . . 38 5.2.3 Reversible jumping mechanism . . . . . . . . . . . . . . . . 38 5.3 Oracle procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.4 Data adaptive procedure . . . . . . . . . . . . . . . . . . . . . . . . 43 5.5 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 6 Conclusions and future work 47 Reference List 51 v List of Tables 2.1 OUTCOMES OF MULTIPLE TESTING PROCEDURES . . . . . 11 4.1 BUDGET CONSTRAINT FOR THE SIMULATIONS OF THE TWO POINT NORMAL MIXTURE MODEL . . . . . . . . . . . . 28 4.2 STAGEPARAMETERSFORTHESIMULATIONSOFTHETWO POINT NORMAL MIXTURE MODEL . . . . . . . . . . . . . . . 29 5.1 BUDGET CONSTRAINT FOR THE SIMULATIONS OF THE K POINT NORMAL MIXTURE MODEL . . . . . . . . . . . . . . . 45 5.2 STAGE PARAMETERS FOR THE SIMULATIONS OF THE K POINT NORMAL MIXTURE MODEL . . . . . . . . . . . . . . . 45 5.3 PROPORTION PARAMETERS FOR THE SIMULATIONS FOR THE K POINT NORMAL MIXTURE MODEL . . . . . . . . . . . 45 vi List of Figures 1.1 FLOW CHART OF HTS PROCESS . . . . . . . . . . . . . . . . . 2 2.1 AN EXAMPLE OF 96 WELL PLATE LAYOUT . . . . . . . . . . 6 4.1 ORACLE PROCEDURE FOR TWO POINT MODEL . . . . . . . 22 4.2 DATA ADAPTIVE PROCEDURE FOR TWO POINT MODEL . . 27 4.3 HISTOGRAM OF THE SIMULATED DATA AT STAGE II (π 1 = 0.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.4 SIMULATION 1: No. OF TRUE POSITIVES VS. π 1 . . . . . . . . 30 4.5 SIMULATION 1: FDR vs. π 1 . . . . . . . . . . . . . . . . . . . . . 30 4.6 SIMULATION 2: No. OF TRUE POSITIVES VS. u 11 . . . . . . . 32 4.7 SIMULATION 2: FDR vs. u 11 . . . . . . . . . . . . . . . . . . . . . 32 5.1 THREE POINT NORMAL MIXTURE . . . . . . . . . . . . . . . . 35 5.2 ORACLE PROCEDURE FOR K POINT MODEL . . . . . . . . . 42 5.3 DATA ADAPTIVE PROCEDURE FOR K POINT MODEL . . . . 44 5.4 No. OF TRUE POSITIVES AND FDR VS. NULL PROPORTION 46 vii Abstract High-throughput screening (HTS) is a large-scale hierarchical process which contains 5 stages. The purpose of HTS is to identify biologically active compounds and generate drug candidates for clinical trials. However, conventional statistical methods of HTS are not well developed and they have three major problems which are: (1) ignoring multiple comparison problems; (2) very low signal to noise ratio; (3) not cost-effective. This dissertation develops new methods for resolving the 3 problems in an integrated way. The proposed methods are applied in both the two point normal mixture model and the k point normal mixture model. The advantages of the proposed methods are illustrated via both simulations and real studies. It is shown that the proposed designs highly improve the statistical power and are very cost-effective. viii Chapter 1 Introduction In both pharmaceutical industries and academic institutions, high throughput screening (HTS) has played a fundamental role with the aim to select biologically active agents or effective drug candidates from a large number of compounds (Malo et al. (2006)). During the last 20 years, HTS has become crucial in emerging and fast-developing fields such as small molecule screening in stem cell biology (Xu et al. (2008)) and interference RNA (RNAi) screening in molecular biology (Moffat & Sabatini (2006))(Echeverri & Perrimon (2006)). However, statistical methods that can be applied in HTS are not well developed. This is likely one of the reasons that HTS activities have not been so efficient in screening successful or correct agents/drug candidates/small molecules from a potential panel (Dove (2003)). HTS is a large-scale and multi-stage approach (Figure 1.1). Selecting a target is the first step. The target is some biological molecule that a compound can act on. There are about 500 targets used by pharmaceutical companies. The most com- monly used targets are cell membrane receptors (45%), mostly G-protein coupled receptors, and enzymes (28%). After a target is identified, an assay for measuring the reaction between compounds and target is developed and optimized accord- ingly. Most assays are either fluorescence or absorbance based. With the advances in robot-based technology, assay development is directed towards miniaturization. Therefore screening many thousands of compounds in a short period is possible (Macarron (2006)). Once the assay is established, a “compound library” contain- ing hundreds or thousands of compounds is available for a primary screen, or by 1 Figure 1.1: FLOW CHART OF HTS PROCESS Target Identification Assay Development Primary Screen Stage I Hits Confirmatory Screen Stage II Confirmed Hits Hits Follow-up Leads the definition of this dissertation, stage I. Conventionally at this stage, a single measurement (no replicates) of each compound is obtained. The “hits” identified in primary screening are confirmed at the step of confirmatory screening, or by the definition of this dissertation, stage II . Typically at stage II, at least duplicate replicates are performed for each compound. “Confirmed hits” are followed by structure-activity-relations (SAR), scaffold clustering, or some other techniques. At this step, chemists play a key role and they visualize “confirmed hits” by struc- tural scaffolds to identify chemical series that can be synthetically optimized. If passing that step, they are called “leads” which are potential drug candidates for clinical trials. As noted, advances in biotechnology with robot-based automation techniques along with organizational strategies to improve research efficiency have both increasedtheutilizationandperformanceofHTSresearchfordrugscreening. How- ever, surprisingly, the new drug approval rate has decreased dramatically since 2 1998 even though total research-and-development spending is increasing (Dove (2003)). In other words, funds are not spent efficiently. This dissertation is there- fore focused toward development of statistical methods for the design and analysis of HTS studies that will optimize the efficiency for such research. Review of current statistical methods applied in HTS in the Chapter 2 litera- ture review shows that there are three major problems in current statistical HTS methods: (i)multiplecomparisonproblem; (ii)lowsignaltonoiseratio(SNR);and (iii) lack of attention to cost effectiveness. This dissertation will propose solutions that address these three problems in an integrated way. Thedissertationisorganizedasfollows. Chapter2comprisesaliteraturereview of current statistical methods applied in primary and confirmatory screening, as well as a review of procedures for multiple hypothesis tesing. Chapter 3 presents the proposed general strategy and operational process to solve the three HTS problems. Startingfromasimplecaseassumingatwopointnormalmixturemodel, two computational algorithms in different scenarios for this simple yet practical case are proposed in Chapter 4. In chapter 4, simulation results and application to real HTS dataset are also shown. The more general case, k point normal mixture model was explored in chapter 5. Last, conclusions and future work are presented in chapter 6. 3 Chapter 2 Literature review 2.1 Current HTS methods Statistical analysis plays an important role in each step of HTS (Figure 1.1). At the assay development step, signal window (SW), assay variability ratio (AVR) and Z’-factor aresome parametersthat needto bewell optimized(N. etal. (2013)). SW is a measure of the data range of an HTS assay. SW can be calculated by two equations SW = [|mean(C pos )−mean(C neg )|− 3(std(C pos ) +std(C neg ))]/std(C pos ), (2.1) or SW = [|mean(C pos )−mean(C neg )|− 3(std(C pos ) +std(C neg ))]/std(C neg ), (2.2) where C pos and C neg represent positive control and negative control, respectively (N. et al. (2013)). Positive controls and negative controls should give the positive signal and negative signal accordingly, so they are used to make sure the assay itself works. Different from SW, AVR represents the data variability of an assay. It is defined by: AVR = 3(std(C pos ) +std(C neg ))/|mean(C pos ) +mean(C neg )|. (2.3) 4 Derived from the AVR, the Z’-factor is defined as (1-AVR) (N. et al. (2013)). From the data analysis perspective, statistical methods have been developed to clean and normalize raw data to reduce assay variability and systematic errors. Some of these methods are control based, such as percent of control (PC) and normalized percent inhibition (NPI). Some are not control based, such as Z score, robust Z score, B score, or BZ score (Brideau et al. (2003)). While these methods are very important for a successful HTS, they are not the focus of this dissertation. In the section below, the statistical methods applied in primary and confirmatory screening steps are reviewed. 2.1.1 Statistical methods of primary screen For HTS assays, an example layout for a 96 well plate is given as Figure 2.1. There is also another 384 well plate that can be used in HTS. In one HTS project, hundreds or thousands of 96/384 well plates are run simultaneously. None of the current methods use replicates in primary screening, i.e., a single measurement is used for all individual compounds. Eyeball method: This method plots raw or preprocessed measurements against compound identity (Malo et al. (2006)). In other words, the y-axis corre- sponds to compound measurements, and the x-axis is compound id. By looking at the graph, any outstanding compound will be selected as a “hit”. This method may be good at identifying super-active compounds. However, this method is obviously very subjective and does not have any statistical basis. Top percentage method: The compounds that produce the highest percent- age of measured activity, for example the top 1 %, are selected as “hits”. Without any prior knowledge of the number of “true” active compounds, this method is also very subjective, and has little rational statistical support (Malo et al. (2006)). 5 h g f e d c b a 1 2 3 4 5 6 7 8 9 10 11 12 Layout Compound 1−80 Negative control Positive control Figure 2.1: AN EXAMPLE OF 96 WELL PLATE LAYOUT Percent inhibition cut-off: In this method, activity of the individual com- pound is normalized by formula: NPI = (mean(C high )−S i )/(mean(C high )−mean(C low )), (2.4) where C high and C low represent high control and low control respectively (N. et al. (2013)). Both high control and low control are positive controls, where high control should give a high positive signal and low control should give a low positive signal. S i denotes the measurement of the individual compound. The “hits” are selected by an arbitrary cut off. While it was claimed (N. et al. (2013)) that this method is preferred for some small molecule screens with strong controls, there is no statistical justification for this assertion. 6 Z-score method: This method has some statistical support and assumes that most of the compounds are inactive and measurements are normally distributed (N. et al. (2013)) (Malo et al. (2006)). Using the mean and standard deviation (std) calculated across all data points, a z score is calculated for each compound by z i = (x i −mean)/std. (2.5) The cutoff to identify a “hit” is 3 or -3. This method is very sensitive to outliers. A false positive error rate of 0.00135 was reported (Zhang et al. (2006)). This number is area under standard normal curve when z is larger than 3 or less than -3 (1-Φ(3)). Robust Z-score method: To address the sensitivity of the conventional z score method to outliers, a robust z score method was developed (N. et al. (2013)). This method replaces the mean and std in the z score method with the median and median absolute deviation (mad). Therefore it is not sensitive to outliers. The equation for mad computation is mad = 1.4826median(|x i −median|). (2.6) The formula for the robust z score is z i = (x i −median)/mad. (2.7) The cutoff is defined as 3 or -3. This method has lower false non-discovery rates than the Z-score method (Chung et al. (2008)). Quartile-based method: Both z-score methods above are not robust to the violation of symmetry. Following Tukey’s idea of the widely-used boxplot, 7 a quartile-based method was proposed (Zhang et al. (2006)) to address this prob- lem. From the measurements, the first quartile (Q1), the median (Q2), and the third quartile (Q3) were calculated. Then, interquartile range (IQR) can be cal- culated as Q3-Q1. The lower boundary for “hit” selection is set up at the smallest observed measurement greater than Q1-1.7239 IQR; the upper boundary for “hit” selection is the biggest observed measurement smaller than Q3+1.7239 IQR. It was shown that this method is more powerful than the two z-score methods when the “true hits” have weak/moderate effect (Zhang et al. (2006)). SSMD or robust SSMD: Strictly standardized mean difference (SSMD) requires use of a negative control group (Zhang (2007)). SSMD (β) represents the ratio of the mean to standard deviation of the random variable representing the difference between 2 populations. When the mean and standard deviation are replaced with median and mad, it is called robust SSMD. The larger the absolute value of SSMD, the larger the magnitude of difference between the two popula- tions. Onepopulationisthetestedcompound, andtheotheristhenegativecontrol group. When the two populations are independent, SSMD takes the following β = (μ 1 −μ 2 )/ q σ 2 1 +σ 2 2 , (2.8) where μ 1 and σ 1 are the mean and standard deviation of the test compound; μ 1 andσ 2 are the mean and standard deviation of the negative control group. In the primary screening, β can be estimated by ˆ β i = (x i − ¯ x 2 )/ q (2/K)(n 2 − 1)s 2 2 , (2.9) 8 and the variance of β can be estimated by ˆ σ 2 ≈ (1/2)(1 + 1/n 2 ), (2.10) whereK≈n 2 − 2.5,x i is the observed measurement of a compound, ¯ x 2 is the mean of the negative control group, s 2 is the standard deviation of the negative control group, and n 2 is the sample size of the negative control group. Inaddition, someZα isdefinedsuchthatPr(Z≤Zα)=1-α, whereZ isstandard normal variable. The compound is selected as a positive compound if ˆ β i ≥c−Zαˆ σ. α is defined as the false negative level and can be selected as various numbers. C can also take various values. For example, when c=3, α=0.025, and the sample size of negative control group is 100, any number ˆ β i ≥ 1.6084 indicates a positive compound. 2.1.2 Statistical methods of confirmatory screen At this step, “hits” selected from the primary screening are further tested using more accurate and more expensive experiments. Conventionally, at least two repli- cates are performed at this step (N. et al. (2013)). Then the sample variability of each compound can be estimated. Two methods are currently used. T-test method: A t-statistic derived p-value is used to check if a compound is a “confirmed hit” or not, using the negative control group as the comparison group. Extended from this t-test method, there are some variants such as “paired t-test” (Zhang (2010)) or “randomized variance model (RVM)” (Wright & Simon (2003))(Malo et al. (2010)). 9 SSMD or robust SSMD: Since replicates are performed for every “hit”, SSMD (β) is estimated by a different formula at the confirmatory screening step (N. et al. (2013)). The formula is ˆ β i = ( ¯ x 1 − ¯ x 2 )/ q (2/K)[(n 1 − 1)s 2 1 + (n 2 − 1)s 2 2 ], (2.11) and the variance of β can be estimated by ˆ σ 2 ≈ (1/2)(1/n 1 + 1/n 2 ), (2.12) where K≈n 1 +n 2 − 3.5, ¯ x 1 is the mean of the “hit”, s 1 is the standard deviation of the “hit”, n 1 is the sample size of the “hit”. ¯ x 2 is the mean of the negative control group, s 2 is the standard deviation of the negative control group, n 2 is the sample size of the negative control group. The same decision rule described in the primary screening is followed. 2.1.3 Major existing statistical issues There are three major issues in these statistical methods that are currently employed in primary and confirmatory screens. First, the current methods ignore the multiple comparison problem. In the confirmatory screen step, decisions are madesimultaneouslyforhundreds/thousandsof“hits”identifiedinprimaryscreen- ing, thus suffering an inflation of the type I error. All of the methods above control type I error at an individual “hit” level resulting in inflated overall type I error and leading to many false positive “confirmed hits”. At this step, controlling type I error is very critical because cost of the “Hits Follow-up” is very high. Too many false findings are a waste of funding. Second, the signal to noise ratio (SNR) is very low in the primary screen. As mentioned above, no replicates are used with 10 a very low SNR. A low SNR can yield a very high missed discovery rate. Finally, the current designs are not cost-effective. The current methods do not take into consideration optimal budget allocation between primary and confirmatory screen, so the funding is not utilized efficiently. All of these issues result in the inefficiency of HTS (Dove (2003)). 2.2 A review of multiple testing procedures Outcomes of multiple testing procedures are outlined in Table 2.1. S is the number ofnon-rejections;Risthenumberofrejections;N 00 isthenumberoftruenegatives; N 01 is the number of false negatives; N 10 is the number of false positives; N 11 is the number of true positives. Table 2.1: OUTCOMES OF MULTIPLE TESTING PROCEDURES Claimed non-significant Claimed significant Total Null N 00 N 10 m 0 Non-null N 01 N 11 m 1 Total S R m 2.2.1 Family wise error rate (FWER) If testingm hypotheses and usingα for each individual test, a per-comparison error rate (PCER) procedure is performed that results in inflated overall type I error. Per-comparison procedures do not control the overall type I error Shaffer (1995). If considering all hypotheses in a multiple testing setting as one family, the family wise error rate (FWER) (Shaffer (1995)) is defined as FWER =P (N 10 ≥ 1). (2.13) 11 Using PCER procedure (Shaffer (1995)), FWER = 1− (1−α) m . (2.14) This is the method currently applied in HTS. Use of this method yields many false positive findings. Instead of testing each individual test at α, the overall type I error can be controlled by letting FWER≤α. A number of FWER controlling procedures have been proposed. One class is single-step procedures, which eval- uate each hypothesis using a common critical value that is independent of other test statistics. The Bonferroni method is perhaps the most such common single- step procedure. Other ones include the Sidak procedure and the minP procedure (Westfall & (1993)) (Dudoit et al. (2002)) (Shaffer (1995)). As a matter of fact, single-step procedures are very conservative (Holm (1979)) (Hochberg (1988)). So, step-wise procedures have been proposed to improve this problem. Of step-wise procedures, there are two types. One is the step-down procedures, such as Holm procedure (Holm (1979)) and the other is the step-up procedures, such as Simes-Hochberg procedure (Hochberg (1988)). Holm procedure: Letp 1 ≤···≤p m be the orderedp values andH 1 ,··· ,H m be the corresponding hypotheses (Holm (1979)). The procedure goes as follows: if p 1 ≥α/m, then accept all hypotheses and stop. Otherwise reject H 1 and test the remaining m-1 hypotheses at level α/(m-1). If p 2 ≥α/(m-1), then accept the remaining hypotheses and stop. Otherwise reject H 2 and test the remaining m-2 hypotheses at level α/(m-2). Keep going like this. The Holm procedure starts with the most significant hypothesis, so it is called step-down procedure. 12 Simes-Hochberg procedure: This procedure starts with the least significant hypothesis (Hochberg (1988)), so it is called step-up procedure. Let k=max{i : p (i) ≤α /(m-i+1)}, then we reject all H (i) ,i≤k. IfFWERproceduresareappliedinverylarge-scalestudieswherewehavemany non-null hypotheses, these procedures are too conservative (Benjamini & Hochberg (1995)) and inflate N 01 . These procedures are not sufficiently powerful to pick up positive signals. 2.2.2 False discovery rate (FDR) To address the inflated false negative rates occuring with FWER procedures, Benjamini and Hochberg (Benjamini & Hochberg (1995)) proposed to control the false discovery rate (FDR). The equation to calculate FDR is FDR =E( N 10 R |R> 0)P (R> 0) =E( N 10 R∨ 1 ). (2.15) When there are no rejections (R=0), FDR is zero. There are some other simi- lar definitions, such as positive false discovery rate (pFDR) (Storey (2002)) and marginal false discovery rate (mFDR) (Genovese & Wasserman (2002)) (Sun & Cai (2007)). It was shown that pEDR, mFDR, and FDR are asymptotically equivalent under certain conditions (Storey (2003)) (Genovese & Wasserman (a)). P-value based procedures: The seminal work of Benjamini and Hochberg in 1995 proposed a procedure to control the FDR (Benjamini & Hochberg (1995)). Letp 1 ,...,p m be the orderedp-values and the corresponding hypotheses are H 1 ,..., H m . If k = max{i :p (i) ≤iα/m}, 13 then reject all H (i) , i≤k. This procedure controls FDR at α level. In 2000, Ben- jamini and Hochberg (Benjamini & Hochberg (2000)) noticed that their original work did not take into consideration information of the sample, for example, the proportion of non-nulls, π. Therefore, their original step up procedure is conser- vative and controls FDR at (1-π)α. A number of procedures have been proposed to include the estimated proportion information of non-nulls, namely BH adap- tive π-value procedure (Benjamini & Hochberg (2000)), Storey’s q-value proce- dure (Storey (2002)), and GW Oracle and plug-in π-value procedure (Genovese & Wasserman (a)). Local FDR based procedures: All of the methods above use individual test-derived p-values, and only use the proportion of non-nulls from the sample. However, it is possible to improve these in terms of the best procedure to con- trol FDR. First, in addition to the proportion of non-nulls, is there any other information that can be employed? Second, are all valid FDR methods above the optimal ones as well? In the multiple hypothesis testing field, an FDR procedure is said to be valid if it controls the FDR at a prespecified level α and optimal if it has the smallest FNR among all valid FDR procedures at level α (Sun & Cai (2007)). Integrating the concepts of multiple testing and weighted classification problems/compound decision problems (Robbins (1951)) and the concept of local false discovery rate (Lfdr) (Efron et al. (2001)) (Efron (2004)), Sun and Cai (Sun & Cai (2007)) developed an Oracle and adaptive compound decision rule for large scale multiple testing. This method finds the statistic, Lfdr, that uses distribu- tion information in addition to π-value and is optimal compared to all p-value based methods. Lfdr procedure is employed in this dissertation. There are two reasons. First, as mentioned above, Lfdr procedure gives the least number of false negatives than the standard FDR methods while having the same FDR value as 14 the standard FDR methods. In addition to that, Lfdr can be computed directly using the samples generated from the computational algorithms developed in this dissertation. In this method, a two component random mixture model that is widely used in multiple testing problems (Efron et al. (2001)) is used. Suppose there are observed values x=(x 1 ,··· ,x m ). Let θ=(θ 1 ,...,θ m ) be the unobserved status for each measurement. Based on x, the purpose is to make inference for each θ, and the decision is denoted by δ=(δ 1 ,...,δ m ). In terms of inference, there are 4 possible outcomes: (1){θ i =0, δ i =0} (true negative); (2) {θ i =0, δ i =1} (false positive); (3) {θ i =1, δ i =0} (false negative); (4) {θ i =1, δ i =1} (true positive). From the simula- tion point of view, let θ 1 ,...,θ m be independent Bernoulli(π) variables (0,1). Then x i can be generated by x i |θ i ∼ (1−θ i )f 0 +θ i f 1 ; (2.17) or, marginally, x i ∼ (1−π)f 0 +πf 1 . (2.18) f 0 and f 1 are assumed to be continuous and positive on the real line, π is the proportion of non-nulls. In this method, the statistic used is Lfdr. The equation to calculate Lfdr is Lfdr i = (1−π)f 0 (x i )/f(x i ). (2.19) Where x i is the individual observed value, and f(x i )=(1-π)f 0 (x i )+πf 1 (x i ). In reality, estimatingπ is very challenging. A number of methods have been proposed (Cai et al. (2007)) (Meinshausen & Rice (2006)) (Genovese & Wasserman (b)) (Jin & Cai (2007)). The cutoff point to decide if δ i takes 0 or 1 is defined by k = max{i : 1 i i X j=1 Lfdr (j) ≤α}. (2.20) 15 Then reject all H (i) , i = 1,··· ,k. In other words, δ i =1 for i = 1,··· ,k, δ i =0 otherwise. Sun and Cai demonstrated that this method is optimal to other FDR methods. 16 Chapter 3 The general framework of a two stage design As mentioned above, current HTS procedures have 3 major problems: current methodsdonot controlthe overalltypeIerrorrateat theconfirmatoryscreenstep; HTS operators do not perform replicates at the primary screen step, so the SNR is very low; the current design does not take into consideration budget allocation. This dissertation proposes to address the three problems in an integrated way. For the multiple comparison problem, the best procedure, the local false discovery rate (Lfdr) based procedure, is applied. This procedure is optimal compared to other modern multiple testing procedures. For the low SNR issue, the optimal number of replicates at both primary and confirmatory stages are proposed by a computational algorithm. For cost effectiveness, a budget constraint is used to optimally allocate funding between primary and confirmatory stages. The motivation of this dissertation is to maximize the statistical power of HTS for a given budget, i.e., this dissertation finds the most efficient HTS design that identifies the largest number of true “confirmed hits” given the fixed funding. From the operational perspective of HTS biologists, this design provides HTS operators with the optimal number of replicates that should be performed at primary and confirmatory screen steps. From the analysis perspective of HTS biologists, this design provides the optimal false discovery rate that should be utilized at the primary screen and confirmatory screening. 17 The design is formulated as a constrained optimization procedure. Specifically, it is a two stage dynamic computation algorithm, where stage I is equivalent to the primary screen, and stage II is equivalent to the confirmatory screen (Figure 1.1). Let total budget =B, stage I budget =B 1 , and stage II budget =B 2 . At stage I, let m 1 = number of library compounds to be screened, r 1 = number of replicates per compound (same r 1 for every compound), c 1 = cost per replicate, q 1 = false discovery rate,|A 1 | = number of “hits” that are screened positive in stage I and proceed to stage II. At stage II, let r 2 = number of replicates per “hit” (same r 2 for every “hit”), c 2 = cost per replicate, q 2 = 0.05, and| A 2 | = number of final screened “confirmed hits”. So, dynamically, there are the following equations: B =B 1 +B 2 ; (3.1) B 1 =c 1 r 1 m 1 ; (3.2) B 2 =c 2 r 2 |A 1 |. (3.3) The dynamic computation process goes as follows: for each combination of specific values of r 1 and q 1 , a decision can be made at stage I, and the value of |A 1 | is decided as well. Then, by equations of (3.1), (3.2), and (3.3), B 1 ,B 2 , and r 2 can be calculated. So a decision at stage II can be made if q 2 is kept fixed at 0.05 . With different combinations of r 1 and q 1 , there are different outputs|A 2 |. The aim is to develop the optimal combination ofr 1 andq 1 in order to identify the maximum number of true “confirmed hits” (max(| A 2 |)) of two stage HTS. This is also a sequential decision problem. Finally, the optimal value of r 1 and q 1 can be reported. 18 Chapter 4 Two point normal mixture model The compound library can be very diverse and complicated (Gillet (2008)) (KÃijmmel & Parker (2011))(Entzeroth et al. (2001)). Yet the assumption that the data follow a two point normal mixture model is very reasonable. Let x ji represent observed measurement from the ith compound at stage j. Let θ ji be a Bernoulli variable (π) denoting the true state of a compound. 4.1 Model set up At stage I, the simulated data is x 1 = (x 11 ,...,x 1m1 ). It can be generated by the random mixture model x 1i |θ 1i ∼ (1−θ 1i )f 10 +θ 1i f 11 ; (4.1) or, marginally, x 1i ∼ (1−π 1 )f 10 +π 1 f 11 . (4.2) Here,f 10 andf 11 are assumed to be normal distribution density functions. Letμ 10 andσ 10 be the mean and standard deviation of a null case;μ 11 andσ 11 be the mean and standard deviation of a non-null case;π 1 is the proportion of the non-null case. With r 1 replicates per compound, f 10 is a normal distribution density function with meanμ 10 and standard deviationσ 10 / √ r 1 . f 11 is a normal distribution density 19 functionwithmeanμ 11 andstandarddeviationσ 11 / √ r 1 . AfterstageI,foraspecific combination of r 1 and q 1 , decisions are δ 1 =(δ 11 ,...,δ 1m1 )∈ I ={0, 1} m 1 . At stage II, the simulated data are: x 2 = (x 21 ,...,x 2m2 ) (here, m 2 =| A 1 |). It can be generated by x 2i |θ 2i ∼ (1−θ 2i )f 20 +θ 2i f 21 ; (4.3) or, marginally, x 2i ∼ (1−π 2 )f 20 +π 2 f 21 . (4.4) Here,f 20 andf 21 are also assumed to be normal distribution density functions. Let μ 20 and σ 20 be the mean and standard deviation of false positive “hits”; μ 21 and σ 21 be the mean and standard deviation of true positive “hits”; π 2 is the proportion of the true positive “hits”. Withr 2 replicates per “hit”,f 20 is a normal distribution density function with mean μ 20 and standard deviation σ 20 / √ r 2 , f 21 is a normal distribution density function with mean μ 21 and standard deviation σ 21 / √ r 2 . After stage II, decisons areδ 2 = (δ 21 ,...,δ 2m2 )∈ I ={0, 1} m 2 . 4.2 Oracle procedure This procedure assumes that there is an oracle that knows μ 10 , σ 10 , μ 11 , σ 11 , μ 20 , σ 20 , μ 21 , σ 21 and π 1 . Then the parameter information can be used to sim- ulate the data. Specifically, at stage I, the oracle gives all parameters needed. Then following the random mixture model (4.1), simulated measurements can be generated for a specific value of r 1 . Using Lfdr multiple testing procedure and a specific value of q 1 , the decisions can be made and|A 1 | is determined. At stage II, following the random mixture model (4.3) and information of θ 1i from stage I, simulated measurements can be generated. Again, by using the Lfdr procedure 20 and letting q 2 =0.05, decisions can be made and| A 2 | is determined. Different combinations of r 1 and q 1 are tested in order to reach the maximum| A 2 |. The overall computational algorithm is depicted in Figure 4.1 below. 4.3 Data adaptive procedure For the two point normal mixture model mentioned above, if there is not such an oracle, then parameter values are not known. This dissertation proposes a data adaptive procedure. In this procedure, a Markov chain Monte Carlo (MCMC) Gibbs sampler (Chen et al. (2000)) is employed. If there are preliminary data, MCMC can give estimated μ 10 , σ 1 2 0 , μ 11 , σ 1 2 1 , π 1 , μ 20 , σ 2 2 0 , μ 21 , σ 2 2 1 . Before the detailed MCMC procedure is presented, it is necessary to introduce some back- ground on Bayesian statistics . 4.3.1 Prior distributions For the random mixture model (2.17), assume thatf 1 ∼N(μ 1 ,σ 1 2 )andf 2 ∼N(μ 2 , σ 2 2 ). There are n 1 θ i =0; n 2 θ i =1 such that n 1 +n 2 =m. (4.5) Since conjugate prior distributions have the practical advantage due to compu- tational convenience, and easy interpretation, conjugate prior distributions are assumed for all parameters. Specifically, assume μ 1 and μ 2 follow a normal distri- bution;σ 1 andσ 2 follow a scale inverseχ 2 distribution with degree of freedomν 01 , 21 Figure 4.1: ORACLE PROCEDURE FOR TWO POINT MODEL Oracle: prior knowledge Stage I μ 10 ,σ 1 2 0 ,μ 11 ,σ 1 2 1 ,π 1 Simulation with r 1 Simulated data: x 11 ,...,x 1m1 r 1 replicates q 1 Lfdrs Decisions: δ 11 ,...,δ 1m1 |A 1 | Oracle: prior knowledge Stage II μ 20 ,σ 2 2 0 ,μ 21 ,σ 2 2 1 Simulation with r 2 : r 2 calculated by equa- tion (3.1), (3.2) and (3.3) Simulated data: x 21 ,...,x 2m2 r 2 replicates q 2 =0.05 Lfdrs Decisions: δ 21 ,...,δ 2m2 |A 2 | |r 1 |,|q 1 |, and|r 2 | 22 ν 02 and scale parameterσ 01 ,σ 02 ; the proportion of nulls follows a beta distribution with parameter α and β. Notationally, μ 1 ∼N(μ 01 ,τ 0 2 1 ); (4.6) σ 1 ∼ScaleInverseχ 2 (ν 01 /2,ν 01 σ 0 2 1 /2); (4.7) μ 2 ∼N(μ 02 ,τ 0 2 2 ); (4.8) σ 2 ∼ScaleInverseχ 2 (ν 02 /2,ν 02 σ 0 2 2 /2); (4.9) 1−π∼Beta(α,β). (4.10) 4.3.2 Some notations Define some notations for easy expression later. Let S X1 = X (x i |θ i = 0), (4.11) which is the sum of all measurements x i given θ i =0; σ n 2 1 = 1/(1/τ 0 2 1 +n 1 /σ 2 1 ); (4.12) μ n1 =σ n 2 1 × (μ 01 /τ 0 2 1 +SX1/σ 2 1 ); (4.13) S X2 = X (x i |θ i = 1), (4.14) which is the sum of all measurements x i given θ i =1; σ n 2 2 = 1/(1/τ 0 2 2 +n 2 /σ 2 2 ); (4.15) 23 μ n2 =σ n 2 2 × (μ 02 /τ 0 2 2 +SX2/σ 2 2 ); (4.16) S 2 X1 = X ((x i −μ 1 ) 2 |θ i = 0), (4.17) which is the sum of the square of the difference between measurements x i and μ 1 given θ i =0; S 2 X2 = X ((x i −μ 2 ) 2 |θ i = 1), (4.18) which is the sum of the square of the difference between measurements x i and μ 2 given θ i =1. 4.3.3 Posterior distributions By choosing the conjugate prior distributions mentioned above, conjugate pos- terior distributions for all parameters in the two point normal mixture model are: μ 1 ∼N(μ n1 ,σ n 2 1 ); (4.19) μ 2 ∼N(μ n2 ,σ n 2 2 ); (4.20) σ 2 1 ∼ScaleInverseχ 2 ((ν 01 +n 1 )/2, (ν 01 σ 0 2 1 +SX1 2 )/2)); (4.21) σ 2 2 ∼ScaleInverseχ 2 ((ν 02 +n 2 )/2, (ν 02 σ 0 2 2 +SX2 2 )/2)); (4.22) 1−π∼Beta(α +n 1 ,β +n 2 ); (4.23) prob(θ i = 0|π,x i ,μ 1 ,σ 2 1 ) = (1−π)f(x i |μ 1 ,σ 2 1 ); (4.24) prob(θ i = 1|π,x i ,μ 2 ,σ 2 2 ) =πf(x i |μ 2 ,σ 2 2 ); (4.25) prob(θ i = 0) = (4.24)/((4.24) + (4.25)). (4.26) 24 Therefore,μ 1 andμ 2 follow a normal distribution; σ 1 andσ 2 follow a scale inverse χ 2 distribution; the proportion of nulls follows a beta distribution. In MCMC, θ i =0 or 1 should be sampled. So it is necessary to compute the probability ofθ i =0 by equation 4.24 and the probability of θ i =1 by equation 4.25. Then, following equation 4.26, θ i can be sampled. 4.3.4 MCMC algorithm MCMC (Chen et al. (2000)) is a widely used method to approximate poste- rior distributions. A particular algorithm, the Gibbs sampler, is especially useful in multi-dimensional parameter problems, such as the two point normal mixed model which has a number of parameters to estimate. It is also called alternating conditional sampling. The basic idea is drawing from a sample of one parameter first, then draw from a sample of another parameter (conditional on the current values of other parameters), and so on. The Gibbs sampler is employed in the data adaptive procedure to get the estimated parameter values. Now, assumingpreliminarydatax=(x 1 ,...,x m )(noreplicates), theMCMCalgo- rithm proceeds as follows: 1. initiate the value 0 or 1 for θ i and value of μ 1 and μ 2 ; 2. following equation 4.21 and 4.22, sample σ 2 1 and σ 2 2 ; 3. update μ 1 and μ 2 by equation 4.19 and 4.20; 4. update 1-π following equation 4.23; 5. update the value 0 or 1 for θ i following equation 4.26; 6. go back to step 2 and iterate until it converges. 25 In this way, by using MCMC, parameters in the computation algorithm are estimated. 4.3.5 Data adaptive procedure At stage I, there are some preliminary data x 1 = (x 11 ,...,x 1m1 ). Then by using MCMC, μ 10 , σ 1 2 0 , μ 11 , σ 1 2 1 , π 1 can be estimated. Following the random mixture model of equation (4.1), simulate measurements for a specific value of r 1 . By the Lfdr multiple testing procedure and a specific value of q 1 , decisions can be made and| A 1 | is determined. At stage II, assuming some preliminary data x 2 = (x 21 ,...,x 2m1 ), by using MCMC,μ 20 ,σ 2 2 0 ,μ 21 ,σ 2 2 1 can be estimated. At stage II, the values of each of θ=(θ 21 ,...,θ 2m2 ) are known from the simulated data of stage I, therefore it is not necessary to estimate proportion of non-nulls any more. Again, by using the Lfdr procedure and letting q 2 =0.05, decisions can be made and| A 2 | is determined. Different combinations of r 1 and q 1 are tested in order to reach the maximum|A 2 |. The overall computational algorithm is depicted in Figure 4.2 below. 4.4 Simulation results To evaluate the performance of the proposed method, the proposed data adap- tive procedure was compared with conventional methods: z score method, robust z score method, and quartile based method via simulation studies. The number of true positive compounds identified at stage II, and stage II FDR obtained by different methods were compared. These numbers were compared under various situations. 26 Figure 4.2: DATA ADAPTIVE PROCEDURE FOR TWO POINT MODEL Preliminary data: x 11 ,...,x 1m1 no replicates Stage I MCMC μ 10 ,σ 1 2 0 ,μ 11 ,σ 1 2 1 ,π 1 Simulation with r 1 Simulated data: x 11 ,...,x 1m1 r 1 replicates q 1 Lfdrs Decisions: δ 11 ,...,δ 1m1 |A 1 | Preliminary data: x 21 ,...,x 2m1 no replicates Stage II MCMC μ 20 ,σ 2 2 0 ,μ 21 ,σ 2 2 1 Simulation with r 2 : r 2 calculated by equa- tion (3.1), (3.2) and (3.3) Simulated data: x 21 ,...,x 2m2 r 2 replicates m 2 =|A 1 | q 2 =0.05 Lfdrs Decisions: δ 21 ,...,δ 2m2 |A 2 | |r 1 |,|q 1 |, and|r 2 | 27 In the simulation study, the two-component random mixture model (equation (2.17)) was used to simulate data for both stages, which were used as the pre- liminary data. The simulated true state of nature of compound i, θ i was used to determine the number of true positive compounds and FDR. The computations for the z score, robust z score, and quartile based methods followed the described in section 2.1. The proposed data adaptive method were followed by flowcharts mentioned above (Figure 4.2). Information from the generated data at stage I and stage II was utilized to help selection of hyperparameters and initial values for the MCMC procedure in the data adaptive method. For the proposed method, a series ofr 1 taking the values of 1,2,3, ... , and 24 was tested; for the conventional meth- ods, one replicate was tested. Table 4.1 shows the budget contraint (equations (3.1),(3.2), and (3.3)) and compounds number (m 1 ) used in the computation. 4.4.1 Numerical setting 1 In practice, depending on the design of the compound library, there might be different proportions of positive compounds (π 1 ). To mimic these scenarios, π 1 =0.01, 0.02, ..., 0.20 were simulated. The other parameters used to generate the simulated data are shown in Table 4.2. The histogram of generated data at stage II whenπ 1 =0.1 is shown, as an example of what the simulated data look like (figure 4.3). Table 4.1: BUDGET CONSTRAINT FOR THE SIMULATIONS OF THE TWO POINT NORMAL MIXTURE MODEL No. Compounds Total budget C 1 C 2 500 250000 20 100 The results for the simulations performed under these settings are shown in Figure 4.4, and Figure 4.5. The results show: 28 Table 4.2: STAGE PARAMETERS FOR THE SIMULATIONS OF THE TWO POINT NORMAL MIXTURE MODEL Stage Null mean Null variance Alternative mean Alternative variance Stage I 0 1 2.5 1 Stage II 0 1 3 1 Histogram of x x Density −2 0 2 4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Figure4.3: HISTOGRAMOFTHESIMULATEDDATAATSTAGEII(π 1 = 0.1) 1. In terms of the number of true positive compounds identified at stage II (Figure 4.4), the proposed data adaptive method gave the number of true positive compounds that are much larger than the conventional methods. In contrast, the Z score, robust Z score, and quartile methods showed a very low number of true positive compounds. This is not surprising since the proposed data adaptive method found the optimal number of replicates per compound. However, the conventional methods used only one replicate per compound; 2. In terms of stage II FDR (Figure 4.5), the proposed data adaptive method gave the value of FDR under the prespecified FDR control level 0.05. How- ever, the conventional methods had a wide range of values of FDR with the Z score and robust Z score having almost 0s and quartile based methods 29 0.05 0.10 0.15 0.20 0 20 40 60 80 Proportions of positive compounds Number of true positives Z score Robust Z score Quartile−base method Proposed Data−driven method Figure 4.4: SIMULATION 1: No. OF TRUE POSITIVES VS. π 1 0.05 0.10 0.15 0.20 0.0 0.2 0.4 0.6 0.8 1.0 Proportions of positive compounds FDR Z score Robust Z score Quartile−base method Proposed Data−driven method Figure 4.5: SIMULATION 1: FDR vs. π 1 having the FDR ranging from 0 to 0.16. This is expected because only the proposed method considers controlling FDR and the conventional methods do not; 3. Overall, the proposed data adaptive method is much better than the conven- tional methods in terms of the number of true positive compounds identified at stage II. 30 4.4.2 Numerical setting 2 In this setting of simulation, all parameters were kept the same as in setting 1 (Table 4.1 and Table 4.2) except that we tried different alternative means of stage I at 2, 2.5, and 3. π 1 =0.1 was used to simulate the data. The results are shown in Figure 4.6, and 4.7. It is easy to see the following: 1. Similar to the findings of simulation study 1, in terms of number of true positive compounds identified at stage II (Figure 4.6), the data adaptive method gave the number of true positive compounds that are much larger than the conventional methods. In contrast, the Z score, robust Z score, and quartile methods showed a very low number of true positive compounds; 2. In terms of stage II FDR (Figure 4.7), the proposed data adaptive method gave the value of FDR under the prespecified FDR control level 0.05. How- ever, the Z score method, robust Z score method, and quartile method had very low values of FDR; 3. Overall, theproposeddataadaptivemethodismuchbetterthanconventional methods in terms of the true positive compounds identified at stage II. 4.4.3 Numerical setting 3 In this setting of simulation, the number of true positive compounds identified at stage II for the proposed data adaptive method and quartile-based method was compared under both the same budget and same FDR. Budget information and number of compounds from Table 4.1 were used. The means for the positive compounds at stage I and stage II were simulated at 2.5 and 3, respectively; the standard deviations were 1; π 1 =0.01 was used to simulate the data. The quartile- based method was run 100 times first and gave the average FDR at 0.10. Then 31 Non−null mean of stage I Number of true positives Z score Robust Z score Quartile−base method Proposed Data−driven method 2.0 2.5 3.0 0 10 20 30 40 50 60 Figure 4.6: SIMULATION 2: No. OF TRUE POSITIVES VS. u 11 Non−null mean of stage I FDR Z score Robust Z score Quartile−base method Proposed Data−driven method 2.0 2.5 3.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Figure 4.7: SIMULATION 2: FDR vs. u 11 the proposed data adaptive method was run with the same FDR at 0.10 for stage II. The quartile-based method returned the number of true positive compounds for stage II 0.84, and the proposed data adaptive method gave 2.79. Therefore, under both the same budget and same FDR, the proposed data adaptive method gave the number of true positive compound that is 3 times higher than that of the quartile-based method. 32 4.5 Application to real HTS data This section applies the proposed method to an HTS study in stem cell biology. The objective of this study is to identify small molecules that may enhance insulin expression ability of pancreatic-like stem cells. In this study, a National Institute of Health Clinical Collection (NCC) compound library was screened. Four hundred forty compounds were screened in this study. At the primary screening stage, the assay method is fluorescence based. The lab technician performed duplicates for each compound at the primary screening stage and the principal investigator of the study used top percentage criteria (top 1%) where the most active compounds are selected. In total 5 were selected as “hits”. In the confirmatory screening stage, the assay method is real-time polymerase chain reaction (RT-PCR) and all 5 “hits” were confirmed at this stage. At the primary screening stage, the actual data from the principle investigator was used as the preliminary data (no replicates). The MCMC algorithm was employed to estimate the unknown parameters. The estimated mean and standard deviation for the null compounds are 1.19 FIU (1 FIU=107 fluorescence intensity unit) and 0.89 FIU; the estimated mean and standard deviation for the positive compounds are 4.27 FIU and 6.32 FIU. The MCMC estimates the proportion of positive compounds as 0.11. At the confirmatory screening stage, the actual data from the principle investigator is not applicable since the lab technician only performed the experiment for the 5 “hits”. However, the RT-PCR data from the previous experiments provides the reasonable values for the parameters needed in the computational algorithm. The mean and standard deviation of the null “hits” are 24 Ct (Ct: threshold cycle) and 3 Ct; the mean and standard deviation of the positive “hits” are 19 Ct and 3 Ct. The total budget for this study is $200,000, the 33 cost per replicate at the primary screening stage is $20, and the cost per replicate at the confirmatory screening stage is $50. If 20 replicates at the primary screening stage and 9 replicates at the confirma- tory screening stage are performed and 55 “hits” are selected at stage I in practice, there will be 49 positive compounds identified at the stage II. Compared to the “5” hits that were selected by the principal investigator, the efficiency is increased. 34 Chapter 5 K point normal mixture model In Chapter 4, a special yet very common case, a two point normal mixture model, is assumed. In real HTS data, the data might follow a k point normal mixture model. In Figure 5.1, simulated data from a 3 point normal mixture model is shown as an example. 5.1 Model set up If observed measurementsx 1 ,··· ,x n are from ak point normal mixture model, then x i ∼ k X j=1 p j f j (x i ), (5.1) Histogram of Dat$x Dat$x Density −5 0 5 10 15 20 0.00 0.05 0.10 0.15 0.20 Figure 5.1: THREE POINT NORMAL MIXTURE 35 where f j (1≤ j≤ k)∼N(μ j , σ j ), and 0 <p j <1 satisfying P k j=1 p j = 1. Also define a group label z i as a latent allocation variable where p(z i =j) =p j (5.2) It is easy to show that when r replicates are performed,f j (1≤ j ≤ k)∼N(μ j , σ j / √ r). 5.2 Estimating the value k and associated parameters Estimating the number of unknown points k in a finite mixture model is chal- lenging. Traditional approaches consider different numbers of k seperately and then use some procedures to decide the estimate of k that best represents the data. These procedures use different criteria, including hypothesis test, information cri- teria, classification criteria, and minimum information ratio and related criteria. Sylvia Richardson and Peter Green (Richardson & Green (1997)) presented a novel methodology for fully Bayesian mixture analysis. In this method, a reversible jump Markov chain Monte Carlo (RJMCMC) algorithm that is able to jump between the parameter subspaces corresponding to different values of k was used to estimate k and mixture parameters simultaneously. This method was shown to be accurate, convenient, flexible, and optimal and was used in this application. 36 5.2.1 Prior distributions The prior distributions for μ j and σ − j 2 are: μ j ∼N(ξ,κ −1 ); (5.3) σ − j 2 ∼ Γ(α,β); (5.4) p∼Dirichlet(δ,δ,··· ,δ). (5.5) Wherep=(p j ) k 1 . K follows a uniform prior distribution between 1 and some integer k max . The prior distributions are independent. The choices of prior distributions abovearenaturalwiththeadvantagesofconjugacy, easycomputationsandfeasible interpretations. To be flexible and generalizable, assume that there is no strong knowledge of the values of the hyperparameters. In other words, only weak prior information for the parameters is available. The data can provide fair information for the hyperparameters of μ j . However, the data might not provide valuable insights for the hyperparameters of σ j . So, following the guidelines by Sylvia Richardson and Peter Green (Richardson & Green (1997)), β follows: β∼ Γ(g,h). (5.6) The data can provide some weak information for the values of hyperparameters. Let R represent the data range, min as the minimum of the data and max as the maximum of the data, then initially set ξ = (min+max)/2; κ= 1/R 2 ;α = 2; g = 0.2; h = 10/R 2 ;δ= 1. 37 5.2.2 Posterior distributions Through conjugacy, μ j ∼N ( σ − j 2 P i:z i =j x i +κξ σ − j 2 n j +κ , (σ − j 2 n j +κ) −1 ) , (5.7) where n j =#{i:z i =j}, σ − j 2 ∼ Γ α + 1 2 n j ,β + 1 2 X i:z i =j (x i −μ j ) 2 , (5.8) and the proportion p remains Dirichlet: p∼Dirichlet(δ +n 1 ,δ +n 2 ,··· ,δ +n k ). (5.9) The allocation variable follows: p(z i =j)∝ p j σ j exp ( − (x i −μ j ) 2 2σ 2 j ) . (5.10) The random hyperparameter β follows: β∼ Γ(g +κα,h + X j σ − i 2 ). (5.11) 5.2.3 Reversible jumping mechanism The RJMCMC proceeds as follows: 1. update the proportion p; 2. update the parameters (μ,σ); 3. update the allocation variable z; 38 4. update the hyperparameter β; 5. split one point into two, or combine two into one; 6. the birth or death of an empty point (no observations of some point). Steps 5 and 6 change the value ofk by 1 and then all other parameters are updated. Letb k denotetheprobabilityofsplittingandd k =1-b k theprobabilityofcombining. Astothevalueselections,b 1 =1whenk=1sinceonlysplittingcanproceed;b kmax =0 when k is at the maximum value since only combining can proceed. b k takes 0.5 for k=2,3,··· ,k max -1. Then the combining starts by choosing a pair of points (j 1 ,j 2 ) at random. The means of these two points are adjacent so that there is no other mean in between. Then these two points are combined with probability d k leading to the number of points=k-1 now. After combining, there is a new point j ∗ which includes all observations from j 1 and j 2 . Next parameters are updated: p j ∗ =p j 1 +p j 2 ; (5.12) p j ∗μ j ∗ =p j 1 μ j 1 +p j 2 μ j 2 ; (5.13) p j ∗(μ 2 j ∗ +σ 2 j ∗) =p j 1 (μ 2 j 1 +σ 2 j 1 ) +p j 2 (μ 2 j 2 +σ 2 j 2 ). (5.14) Splitting goes in the opposite way to the combining proposal above. A point j ∗ is selected and split into two points j 1 and j 2 with the probability b k . The parameters follow: p j 1 =p j ∗μ 1 , (5.15) p j 2 =p j ∗(1−μ 1 ), (5.16) μ j 1 =μ j ∗−μ 2 σ j ∗ s p j 2 p j 1 , (5.17) 39 μ j 2 =μ j ∗ +μ 2 σ j ∗ s p j 1 p j 2 , (5.18) σ 2 j 1 =μ 3 (1−μ 2 2 )σ 2 j ∗ p j ∗ p j 1 , (5.19) σ 2 j 2 = (1−μ 3 )(1−μ 2 2 )σ 2 j ∗ p j ∗ p j 2 . (5.20) whereμ 1 andμ 2 follow a beta distribution (2,2), andμ 3 follow a beta distribution (1,1). Then all the observations that belong toj ∗ are re-allocated to the new points j 1 and j 2 . The acceptance probability for the splitting is min(1,A), where: A = (likelihoodratio) p(k + 1) p(k) (k + 1) p δ j 1 −1+l 1 p δ j 2 −1+l 2 p δ j ∗ −1+l 1 +l 2 B(δ,kδ) × r κ 2π exp[− 1 2 κ{(μ j 1 −ξ) 2 + (μ j 2 −ξ) 2 − (μ j ∗−ξ) 2 }] × β α Γ(α) σ 2 j 1 σ 2 j 2 σ 2 j ∗ ! −α−1 exp{−β(σ − j 1 2 +σ − j 2 2 −σ − j ∗ 2 )} × d k+1 b k P alloc {g 2,2 (μ 1 )g 2,2 (μ 2 )g 1,1 (μ 3 )} −1 × p j ∗|μ j 1 −μ j 2 |σ 2 j 1 σ 2 j 2 μ 2 (1−μ 2 2 )μ 3 (1−μ 3 )σ 2 j ∗ , where k is the number of points before the split, l 1 and l 2 are the number of observations assigned to the new points j 1 and j 2 after splitting. B is the beta function, P alloc is the probability of this particular allocation, and g denotes the beta density. The likelihood ratio is the ratio of the product of all terms off(x i ) for the new parameter sets to the old ones. The acceptance probability for combining is min(1,A −1 ). For the birth and death mechanism in step 6, the selection between birth and death is also random following the sameb k andd k definition as above. For making a new point (birth): p j ∗∼beta(1,k) (5.21) 40 μ j ∗∼N(ξ,κ −1 ); (5.22) σ − j ∗ 2 ∼ Γ(α,β). (5.23) Then all the old p j needs to be rescaled to make room for p j ∗. For a death of an empty point, delete any existing empty point at random and update the p j . The acceptance probability for birth and death are min(1,A) and min(1,A −1 ), where A = p(k + 1) p(k) 1 B(kδ,δ) p δ−1 j ∗ (1−p j ∗) n+kδ−k (k + 1) × d k+1 b k (k 0 + 1) 1 g 1,k (p j ∗) (1−p j ∗) k , wherek is the number of points andk 0 is the number of empty points before birth. 5.3 Oracle procedure This procedure assumes that there is an oracle that knows k, μ j , σ j , and p j . This parameter information can be used to simulate the data. Specifically, at stage I, the oracle gives all parameters. Following the random mixture model (5.1), simulated measurements can be generated for a specific value of r 1 . Using the Lfdr multiple testing procedure and a specific value ofq 1 ,|A 1 | is determined. At stage II, following the same random mixture model (5.1) and information of θ 1i from stage I, simulated measurements can be generated. Again, by using the Lfdr procedure and letting q 2 =0.05,| A 2 | is determined. Different combinations of r 1 and q 1 are tested to achieve the maximum|A 2 |. The overall computational algorithm is depicted in Figure 5.2 below. 41 Figure 5.2: ORACLE PROCEDURE FOR K POINT MODEL Oracle: prior knowledge Stage I k,μ j ,σ j ,p j Simulation with r 1 Simulated data: x 11 ,...,x 1m1 r 1 replicates q 1 Lfdrs Decisions: δ 11 ,...,δ 1m1 |A 1 | Oracle: prior knowledge Stage II μ j ,σ j Simulation with r 2 : r 2 calculated by equa- tion (3.1), (3.2) and (3.3) Simulated data: x 21 ,...,x 2m2 r 2 replicates q 2 =0.05 Lfdrs Decisions: δ 21 ,...,δ 2m2 |A 2 | |r 1 |,|q 1 |, and|r 2 | 42 5.4 Data adaptive procedure If the oracle does not exist, then parameter values are not known. In this case, RJMCMC can be employed to estimate all parameters if there are preliminary data. At stage I, there are some preliminary data x 1 = (x 11 ,...,x 1m1 ). Then by using RJMCMC, k, μ j , σ j , p j can be estimated. Following the random mixture model (5.1), simulate measurements for a specific value ofr 1 . By the Lfdr multiple testing procedure and a specific value of q 1 ,| A 1 | is determined. At stage II, assuming some preliminary datax 2 = (x 21 ,...,x 2m1 ), by using RJMCMC,μ j andσ j can be estimated. At stage II, the values of each of θ=(θ 21 ,...,θ 2m2 ) are known from the simulated data of stage I, therefore it is not necessary to estimate proportions p j . Again, by using the Lfdr procedure and letting q 2 =0.05,|A 2 | is determined. Different combinations ofr 1 andq 1 are tested to achieve the maximum|A 2 |. The overall computational algorithm is depicted in Figure 5.2 below. 5.5 Simulation results To evaluate the performance of the proposed methods, the proposed oracle and data adaptive methods were compared with conventional HTS methods: z score method, robust z score method, and quartile based method, via simulation studies. The number of true positive compounds identified at stage II and stage II FDR obtained by different methods were compared. These numbers were compared under various situations. In the simulation study, a k point normal mixture model (equation (5.1)) was used to generate the datasets for both stages. The generated data were used as the preliminary data. Then the procedures described in the Figure 5.2 and Figure 5.3 were followed for the proposed oracle and data adaptive procedures. Table 43 Figure 5.3: DATA ADAPTIVE PROCEDURE FOR K POINT MODEL Preliminary data: x 11 ,...,x 1m1 no replicates Stage I RJMCMC k, μ j , σ j , p j Simulation with r 1 Simulated data: x 11 ,...,x 1m1 r 1 replicates q 1 Lfdrs Decisions: δ 11 ,...,δ 1m1 |A 1 | Preliminary data: x 21 ,...,x 2m1 no replicates Stage II RJMCMC μ j ,σ j Simulation with r 2 : r 2 calculated by equa- tion (3.1), (3.2) and (3.3) Simulated data: x 21 ,...,x 2m2 r 2 replicates q 2 =0.05 Lfdrs Decisions: δ 21 ,...,δ 2m2 |A 2 | |r 1 |,|q 1 |, and|r 2 | 44 5.1 shows the budget contraint (equations (3.1),(3.2), and (3.3)) and compounds number (m 1 ) used in the computation. Three points (k=3) were used in the simulation. The parameters used to gen- erate the simulated data are shown in table 5.2. Table 5.1: BUDGET CONSTRAINT FOR THE SIMULATIONS OF THE K POINT NORMAL MIXTURE MODEL No. Compounds Total budget C 1 C 2 500 200000 20 50 Table 5.2: STAGE PARAMETERS FOR THE SIMULATIONS OF THE K POINT NORMAL MIXTURE MODEL Stage Mean Variance Stage I point 1 0 1 Stage I point 2 2 1 Stage I point 3 3 1 Stage II point 1 0 1 Stage II point 2 2.5 1 Stage II point 3 3 1 Table5.3: PROPORTIONPARAMETERSFORTHESIMULATIONSFORTHE K POINT NORMAL MIXTURE MODEL Proportion of point 1 Proportion of point 2 Proportion of point 3 0.7 0.15 0.15 0.8 0.1 0.1 0.9 0.05 0.05 Different proportions simulated are listed in Table 5.3. Results are summarized below. All points on the plots represent the mean of simulation (100 replications). Figure 5.4 plots the number of true positive compounds as a function of the pro- portions of negative compounds. In Figure 5.5, the FDR is shown as a function of the proportions of negative compounds. The following observations can be made based on the results from this simulation study. The results show: 45 Null proportions No. of true positives Z score Robust Z score Quartile−based method Proposed−data driven method 0.7 0.8 0.9 0 50 100 150 Null proportions FDR Z score Robust Z score Quartile−based method Proposed−data driven method 0.7 0.8 0.9 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Figure 5.4: No. OF TRUE POSITIVES AND FDR VS. NULL PROPORTION 1. In terms of the number of true positive compounds identified at stage II (Figure 5.4), the proposed data adaptive method gave the number of true positive compounds that are much larger than the conventional methods. The Z score, robust Z score, and quartile methods show a very low number of true positive compounds. This is expected since the proposed data adap- tive method identified the optimal number of replicates for each compound. However, the conventional methods only used one replicate per compound; 2. In terms of stage II FDR (Figure 5.5), the proposed data adaptive method gave the values of FDR under the prespecified FDR control level 0.05. How- ever, the conventional methods all gave very low values of FDR. This is not surprising since only the proposed data adaptive method considers the FDR control strategy. The conventional methods are obviously very conservative. 3. Overall, in terms of the number of true positive compounds, the proposed data adaptive method is much better than the conventional methods. 46 Chapter 6 Conclusions and future work High throughput screening is a large-scale hierarchical process, in which the numberofinvestigatedcompoundscanvaryfromhundredstomillions. HTSstages comprise target identification, assay development, primary screening, confirmatory screening, and follow-up of hits. The selection and utilization of valid and optimal statisticalapproachesareimportantateachstageofHTS.Thisdissertationfocuses on statistical methods at the primary and confirmatory screening stages. Conventional statistical methods, such as the Z-score, robust Z-score, quartile- based, and strictly standardized mean difference methods have been used for selec- tion of compounds as “hits” or “confirmed hits” (Malo et al. (2006)) (N. et al. (2013)). However, these methods are highly inefficient due to three major issues. First, the currently used methods ignore the multiple comparison problem. When a large number of compounds are tested simultaneously, the inflation of type I errors (i.e., false positives) becomes a serious issue and may lead to large financial losses in the follow-up stages. The control of type I errors is especially crucial at the confirmatory screening stage due to the larger costs in the “hits follow-up” stage. The family-wise error rate (FWER), the probability of making at least one type I error, is often used to control the multiplicity of errors. However, in HTS studies, the FWER criterion becomes so conservative that it fails to identify most useful compounds. A more cost-effective and powerful framework for large-scale inference is considered in this dissertation; the goal is to control the false discovery rate (FDR). 47 Second, the data collected in conventional HTS studies often have very low signal to noise ratios. In most HTS analyses, only one measurement is obtained for each compound at the primary screening stage; existing analytical strategies often lead to a high false negative rate and hence inevitable financial losses (since missed findings will not be pursued). Finally, the current HTS designs are not data-adaptive, and an optimal budget allocationbetweentheprimaryscreeningandconfirmatoryscreeningstagesarenot considered. Ideally the budgets should be allocated efficiently and dynamically to maximize the statistical power. This dissertation proposes a new approach for the design and analysis of HTS experiments to address the above issues. By utilizing MCMC techniques, the proposed data-driven design calculates the optimal replicates at each stage. The new design promises to significantly increase the signal to noise ratio in HTS data. This would effectively reduce the number of false negative findings and help identify more useful compounds for drug development. By controlling the FDR at the confirmatory screening stage, both the false positive findings and the financial burdens on the “hits follow up” stage can be effectively controlled. Finally, under the proposed computational framework, the funding has been utilized efficiently with an optimized design, which allocates available budgets dynamically according to the estimated signal strengths and expected number of true discoveries. . Specifically, the HTS design problem is formulated as a constrained optimiza- tion problem where the goal is to maximize the expected number of true discoveries subject to the constraints on the FDR and study budget. Then a simulation-based computational procedure is developed to dynamically allocate the study budgets between the two stages and effectively controls the FDR in the confirmatory stage. Simulation studies were conducted to show that with the same study budget, the 48 proposed method controls the FDR effectively and identifies more useful com- pounds compared to conventional methods. A real data example was analyzed to illustrate the implementation and merits of the proposed method. Effective and powerful strategies and methodologies have been developed for the design and analysis of multistage experiments. However, these existing meth- ods cannot be directly applied to the analysis of HTS data. Satagopan et al. (Satagopan et al. (2004)) proposed a two-stage design for genome-wide association studies; compared to conventional single-stage designs, their two-stage design sub- stantially reduces the study cost, while maintaining statistical power. However, the error control issue and optimal budget allocation between the stages were not considered. Posch et al. (Zehetmayer et al. (2008)) developed an optimized multi- stage design for both FDR and FWER control in the context of genetic studies. Their methods are not suitable for HTS studies since the varied cost per compound at different stages were not taken into account. Müller et al (Müller et al. (2004)) and Rossell and Müller (Rossell & Müller (2013)) studied the optimal sample size problem and developed a two-stage simulation-based design in a decision theo- retical framework with various utility functions. However, it is unclear how the sample size problem and budget constraints can be integrated into a single design. In addition, the varied stage-wise costs were not considered in their studies. Com- pared to existing methods, the proposed data-driven procedure in this dissertation simultaneously addresses the error control, measurement costs and optimal design issues and is in particular suitable for HTS studies. There are some limitations and open questions related to this research. The proposed method is computationally demanding, and powerful computer is needed for the proposed computational algorithm. Normal mixture model is assumed in this dissertation. Although this is a reasonable assumption in some applications, 49 some HTS data might follow skewed normal, or skewed t distributions. If that is the case, it is desirable to extend the theory and methodology to handle those situations. 50 Reference List Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: A practi- cal and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological) 57:289–300. Benjamini Y, Hochberg Y (2000) On the adaptive control of the false discovery rate in multiple testing with independent statistics. Journal of Educational and Behavioral Statistics 25:60–83. Brideau C, Gunter B, Pikounis B, Liaw A (2003) Improved statistical methods for hit selection in high-throughput screening. Journal of Biomolecular Screen- ing 8:634–647 PMID: 14711389. Cai TT, Jin J, Low MG (2007) Estimation and confidence sets for sparse normal mixtures. The Annals of Statistics 35:2421–2449 Mathematical Reviews number (MathSciNet): MR2382653; Zentralblatt MATH identifier: 05241110. Chen MH, Shao QM, Ibrahim JG (2000) Monte Carlo Methods in Bayesian Com- putation: With 32 Illustrations Springer-Verlag. Chung N, Zhang XD, Kreamer A, Locco L, Kuan PF, Bartz S, Linsley PS, Fer- rer M, Strulovici B (2008) Median absolute deviation to improve hit selection for genome-scale RNAi screens. Journal of Biomolecular Screening 13:149–158 PMID: 18216396. Dove A (2003) Screening for contentâĂŤthe evolution of high throughput. Nature Biotechnology 21:859–864. Dudoit S, Shaffer J, Boldrick J (2002) Multiple hypothesis testing in microarray experiments. U.C. Berkeley Division of Biostatistics Working Paper Series . Echeverri CJ, Perrimon N (2006) High-throughput RNAi screening in cultured cells: a user’s guide. Nature Reviews Genetics 7:373–384. Efron B (2004) Large-scale simultaneous hypothesis testing. Journal of the Amer- ican Statistical Association 99:96–104. 51 Efron B, Tibshirani R, Storey JD, Tusher V (2001) Empirical bayes analy- sis of a microarray experiment. Journal of the American Statistical Associa- tion 96:1151–1160. Entzeroth M, Flotow H, Condron P (2001) Overview of high-throughput screening In Current Protocols in Pharmacology. John Wiley & Sons, Inc. Genovese C, Wasserman L A stochastic process approach to false discovery con- trol. The Annals of Statistics 32:1035–1061 Mathematical Reviews number (MathSciNet): MR2065197; Zentralblatt MATH identifier: 02100792. Genovese C, Wasserman L A stochastic process approach to false discovery con- trol. The Annals of Statistics 32:1035–1061 Mathematical Reviews number (MathSciNet): MR2065197; Zentralblatt MATH identifier: 02100792. Genovese C, Wasserman L (2002) Operating characteristics and extensions of the false discovery rate procedure. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64:499âĂŞ517. Gillet VJ (2008) New directions in library design and analysis. Current opinion in chemical biology 12:372–378 PMID: 18331851. Hochberg Y (1988) A sharper bonferroni procedure for multiple tests of signifi- cance. Biometrika 75:800–802. Holm S (1979) A simple sequentially rejective multiple test procedure. Scandina- vian Journal of Statistics 6:65–70. Jin J, Cai TT (2007) Estimating the null and the proportion of non-null effects in large-scale multiple comparisons. J. Amer. Statist. Assoc 102:496âĂŞ506. KÃijmmel A, Parker CN (2011) The interweaving of cheminformatics and HTS. Methods in molecular biology (Clifton, N.J.) 672:435–457 PMID: 20838979. Macarron R (2006) Critical review of the role of HTS in drug discovery. Drug discovery today 11:277–279 PMID: 16580969. Malo N, Hanley JA, Carlile G, Liu J, Pelletier J, Thomas D, Nadon R (2010) Experimental design and statistical methods for improved hit detection in high- throughput screening. Journal of biomolecular screening 15:990–1000 PMID: 20817887. Malo N, Hanley JA, Cerquozzi S, Pelletier J, Nadon R (2006) Statistical practice in high-throughput screening data analysis. Nature biotechnology 24:167–175 PMID: 16465162. 52 Meinshausen N, Rice J (2006) Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses. The Annals of Statis- tics34:373–393 MathematicalReviewsnumber(MathSciNet): MR2275246; Zen- tralblatt MATH identifier: 05034315. Müller P, Parmigiani G, Robert C, Rousseau J (2004) Optimal sample size for mul- tiple testing: the case of gene expression microarrays. Journal of the American Statistical Association 99:990–1001. MoffatJ,SabatiniDM(2006) BuildingmammaliansignallingpathwayswithRNAi screens. Nature Reviews Molecular Cell Biology 7:177–187. N. A, C. S, Che T (2013) Data analysis approaches in high throughput screening In El-Shemy H, editor, Drug Discovery. InTech. Richardson S, Green PJ (1997) On bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society. Series B (Methodological) 59:731–792. Robbins H (1951) Asymptotically subminimax solutions of compound statistical decision problems In Second Berkeley Symposium on Mathematical Statistics and Probability, Vol. -1, pp. 131–149. Rossell D, Müller P (2013) Sequential stopping for high-throughput experiments. Biostatistics (Oxford, England) 14:75–86. Satagopan JM, Venkatraman ES, Begg CB (2004) Two-stage designs for gene- disease association studies with sample size constraints. Biometrics 60:589–597. Shaffer JP (1995) Multiple hypothesis testing. Annual Review of Psychol- ogy 46:561–584. Storey JD (2002) A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64:479âĂŞ498. Storey JD (2003) The positive false discovery rate: a bayesian interpretation and the q-value. The Annals of Statistics 31:2013–2035 Mathematical Reviews number (MathSciNet): MR2036398; Zentralblatt MATH identifier: 02067675. Sun W, Cai TT (2007) Oracle and adaptive compound decision rules for false dis- covery rate control. Journal of the American Statistical Association 102:901–912. Westfall PH, Y (1993) Resampling-based multiple testing: examples and methods for P-value adjustment Wiley, New York. 53 Wright GW, Simon RM (2003) A random variance model for detection of differen- tial gene expression in small microarray experiments. Bioinformatics (Oxford, England) 19:2448–2455 PMID: 14668230. Xu Y, Shi Y, Ding S (2008) A chemical approach to stem-cell biology and regen- erative medicine. Nature 453:338–344. Zehetmayer S, Bauer P, Posch M (2008) Optimized multi-stage designs con- trolling the false discovery or the family-wise error rate. Statistics in Medicine 27:4145–4160. Zhang XD (2007) A new method with flexible and balanced control of false neg- atives and false positives for hit selection in RNA interference high-throughput screeningassays. Journalofbiomolecularscreening12:645–655 PMID:17517904. Zhang XD (2010) Assessing the size of gene or RNAi effects in multifactor high- throughput experiments. Pharmacogenomics 11:199–213 PMID: 20136359. Zhang XD, Yang XC, Chung N, Gates A, Stec E, Kunapuli P, Holder DJ, Ferrer M, Espeseth AS (2006) Robust statistical methods for hit selection in RNA inter- ference high-throughput screening experiments. Pharmacogenomics 7:299–309 PMID: 16610941. 54
Asset Metadata
Creator
Feng, Tao (author)
Core Title
Sequential analysis of large scale data screening problem
Contributor
Electronically uploaded by the author
(provenance)
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
01/26/2015
Defense Date
12/12/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
false discovery rate control,high throughput screening,HTS,Markov chain Monte Carlo,MCMC,OAI-PMH Harvest,two-stage design
Format
application/pdf
(imt)
Language
English
Advisor
Mack, Wendy Jean (
committee chair
), Sun, Wenguang (
committee chair
), Marjoram, Paul (
committee member
), Millstein, Joshua (
committee member
), Siegmund, Kimberly D. (
committee member
)
Creator Email
fengtao424@gmail.com,taof@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-523419
Unique identifier
UC11298442
Identifier
etd-FengTao-3130.pdf (filename),usctheses-c3-523419 (legacy record id)
Legacy Identifier
etd-FengTao-3130.pdf
Dmrecord
523419
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Feng, Tao
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Abstract (if available)
Abstract
High-throughput screening (HTS) for drug discovery is a large-scale hierarchical process in which a large number of chemicals are tested in multiple stages. Conventional statistical analyses of HTS studies often suffer from high test error rates and soaring costs in large-scale settings. This dissertation develops new methodologies for false discovery rate control and optimal design in HTS studies. The dissertation proposes a two-stage procedure that determines the optimal numbers of replicates at different screening stages while simultaneously controlling the false discovery rate in the confirmatory stage subject to a constraint on the total budget. The merits of the proposed methods are illustrated using both simulated and real data. It is shown that the proposed screening procedure effectively controls the error rate and the design leads to improved detection power and great savings in study costs.
Tags
false discovery rate control
high throughput screening
HTS
Markov chain Monte Carlo
MCMC
two-stage design
Linked assets
University of Southern California Dissertations and Theses