Close
About
FAQ
Home
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Inference correction in measurement error models with a complex dosimetry system
(USC Thesis Other)
Inference correction in measurement error models with a complex dosimetry system
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Inference Correction in Measurement Error Models with a Complex Dosimetry System by Zhuo Zhang A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (BIOSTATISTICS) August 2017 Copyright 2017 Zhuo Zhang Dedication To my parents, Yihao Zhang, Jing Li and especially to my wife, Jingyu Chen for their love and support ii Acknowledgements This dissertation will not be possible without the support from many people. Especially, I want to thank my mentor, Dr. Daniel O. Stram, for stimulating discussion, giving insightful advice, and inspiring intellectual curiosity during my senior years working on my dissertation. I would like to thank my advisor, Dr. Meredith Franklin, for her kind support and help throughout my graduate life. I can't be more grateful for their mentorship and friendship. I also want to give special thanks to our collaborators, Dr. Dale L. Preston for invaluable discussions and Dr. Mikhail Sokolnikov for making the data available. I would like to sincerely express my gratitude to my dissertation committee members, Dr. W. James Gauderman, Dr. Juan Pablo Lewinger, and Dr. Jianfeng Zhang, for their involvement and helpful advice. iii Table of Contents Dedication ii Acknowledgements iii List Of Tables vii List Of Figures ix Abstract x Chapter 1: Introduction 1 1.1 Measurement Error in Epidemiology Study . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Types of Measurement Error and its Modeling . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Systematic vs. Random . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.2 Dierential vs. Non-dierential . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.3 Shared vs. Unshared . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.4 Structural vs. Functional . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.5 Classical vs. Berkson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Eects of Measurement Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.1 Eects on Dose Response Estimation . . . . . . . . . . . . . . . . . . . . . . 4 1.3.2 Eects on Variance Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.3 Introduced Correlation between Outcomes . . . . . . . . . . . . . . . . . . . 4 1.4 Current Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4.1 Regression Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4.2 Simulation Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4.3 Likelihood Method & Bayesian Method . . . . . . . . . . . . . . . . . . . . 6 1.5 Complex Dosimetry System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5.1 Limitations of Traditional Error Model . . . . . . . . . . . . . . . . . . . . . 7 1.5.2 Dose Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5.3 Idealized Dosimetry System . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5.4 SUMA Model for Measurement Error . . . . . . . . . . . . . . . . . . . . . 8 1.6 Current Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.6.1 Variance Correction for Shared Error . . . . . . . . . . . . . . . . . . . . . . 9 1.6.2 Relationship with Rosner's Correction . . . . . . . . . . . . . . . . . . . . . 12 1.6.3 Bayesian Model Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Chapter 2: Normal-Lognormal Approximation for Condence Intervals 17 2.1 Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Excess Relative Risk Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.1 Simple Excess Relative Risk Model . . . . . . . . . . . . . . . . . . . . . . . 24 iv 2.2.2 Relative Risk Model with Dose Eect Modier . . . . . . . . . . . . . . . . 26 2.3 Model Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.1 Model with Known Covariates . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.2 Hierarchical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.3 Model with Multiple Covariates Measured with Independent Error . . . . . 30 2.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4.1 Simple Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4.2 Simple ERR Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.4.3 Multiple Covariates with Measurement Error . . . . . . . . . . . . . . . . . 36 2.4.4 ERR Model with Dose Eect Modier . . . . . . . . . . . . . . . . . . . . . 37 Chapter 3: Simulations with the Mayak Worker Cohort 39 3.1 Background on the Mayak Worker Cohort . . . . . . . . . . . . . . . . . . . . . . . 39 3.1.1 Ionizing Radiations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1.2 Characteristics of the Cohort . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.1.3 Importance and Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 Data and Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.1 Mayak Worker Dosimetry System 2013 (MWDS-2013) . . . . . . . . . . . . 44 3.2.2 Survival Analysis and Poisson Regression . . . . . . . . . . . . . . . . . . . 46 3.2.3 Poisson Regression Model in Mayak Worker Cohort . . . . . . . . . . . . . 47 3.2.4 Ecient Calculation of M T Cov(XjZ)M . . . . . . . . . . . . . . . . . . . . 49 3.2.5 Dosimetry System Error Components Diagnostics . . . . . . . . . . . . . . 50 3.2.6 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2.6.1 Dose simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2.6.2 Survival data simulation . . . . . . . . . . . . . . . . . . . . . . . 53 3.2.7 Lung cancer risk modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3.1 simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3.2 The Null Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.3.3 Bias in Slope Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.3.4 Variance Components Diagnostics with MWDS-2013 . . . . . . . . . . . . . 60 3.3.5 Comparison of MWDS-2013 and MWDS-2008 cumulative doses . . . . . . . 61 3.3.6 Lung cancer survival analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.5 Implementation of a R Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Chapter 4: Comparisons between the Variance Correction Method and Bayesian Methods 70 4.1 Data and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.1.1 Risk Model, Dosimetry Model and Cohort Simulation . . . . . . . . . . . . 71 4.1.2 Bayesian Model Averaging Method . . . . . . . . . . . . . . . . . . . . . . . 72 4.1.2.1 Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . . . . . 73 4.1.2.2 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.1.2.3 Stochastic Approximation Monte Carlo . . . . . . . . . . . . . . . 76 4.1.3 Full Bayesian Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.1.3.1 Adaptive Metropolis-within-Gibbs Sampling . . . . . . . . . . . . 79 4.1.4 Variance Correction Method . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3.1 Unshared Error Only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3.2 Shared Error Only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 v 4.3.3 Shared and Unshared Errors . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Chapter 5: Impact and Future Research 90 Appendix A Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 vi List Of Tables 2.1 Bias, Coverage, Length of CI for Simple Linear Model with Dierent Dosimetry Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.2 Bias, Coverage, Length of CI for Simple ERR Model with Dierent Dosimetry Models with 5000 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.3 Bias, Coverage, Length of CI for ERR Model with Two Error-Prone Covariates . . 37 2.4 Bias, Coverage, Length of CI for ERR Model with Dose Eect Modier . . . . . . 38 3.1 Condence interval coverage for model parameters with SUMA-U, SUMA-S, SUMA- SU, SUMA-SUP dose uncertainty models . . . . . . . . . . . . . . . . . . . . . . . 57 3.2 Condence interval coverage for model parameters with MWDS-13 internal dose uncertainties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.3 Score Test for the Null Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4 Estimates and Standard Deviations of b 1 . . . . . . . . . . . . . . . . . . . . . . . . 59 3.5 Variance components diagnostics for MWDS-2013 . . . . . . . . . . . . . . . . . . 61 3.6 Number of monitored workers and total mean cumulative MWDS-2013 plutonium dose distribution summaries by sex and workplace. . . . . . . . . . . . . . . . . . . 63 3.7 Number of monitored workers and total cumulative MWDS-2008 plutonium dose distribution summaries by sex and workplace. . . . . . . . . . . . . . . . . . . . . . 64 3.8 Number of monitored workers and total mean cumulative MWDS-2013 gamma dose distribution summaries by sex and workplace. . . . . . . . . . . . . . . . . . . . . . 65 3.9 Number of monitored workers and total mean cumulative MWDS-2008 gamma dose distribution summaries by sex and workplace. . . . . . . . . . . . . . . . . . . . . . 66 3.10 Lung cancer ERR estimates based on MWDS-2008 and MWDS-2013 dose estimates with corrected-information-matrix-adjusted Wald bounds for the MWDS-2013 es- timates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 vii 4.1 Coverages of Condence Intervals of Dierent Methods with SUMA-U . . . . . . . 80 4.2 Coverages of Condence Intervals of Dierent Methods with SUMA-S . . . . . . . 81 4.3 Coverages of Condence Intervals of Dierent Methods with SUMA-SU . . . . . . 82 4.4 Parameter Estimates with ^ X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 A.1 Bias, Coverage, Length of CI for Simple ERR Model with Dierent Dosimetry Models with 5000 Subjects (with Dose Known) . . . . . . . . . . . . . . . . . . . . 95 A.2 Bias, Coverage, Length of CI for ERR Model with Two Error-Prone Covariates (with Dose Known) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 A.3 Bias, Coverage, Length of CI for ERR Model with Dose Eect Modier (with Dose Known) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 A.4 Condence interval coverage for model parameters tted with true dose . . . . . . 97 A.5 Condence interval coverage for model parameters with MWDS-13 internal dose tted with true dose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 viii List Of Figures 2.1 Condence interval containing1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 QQ-Plot of Log-Normal distribution and Its Normal Modication . . . . . . . . . . 22 3.1 Satellite Map of Mayak Production Association . . . . . . . . . . . . . . . . . . . . 40 3.2 Na ve Standard Error of b 1 vs. b 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.3 Empirical distributions of the shared error components in MWDS-2013 . . . . . . . 61 4.1 Illustration of the Bayesian methods with only unshared errors. . . . . . . . . . . 85 4.2 Bayesian methods with only shared errors favor smaller shared errors . . . . . . . . 87 ix Abstract In epidemiological studies, exposures of interest are often measured with uncertainties, which may be independent or correlated. For some important studies, Monte Carlo dosimetry systems that provide multiple realizations of exposure estimates have been used to represent complex error structures. While the eects of independent measurement errors on parameter estimation and methods to correct these eects have been studied comprehensively, the literature on the eects of correlated errors, and associated correction methods is much more sparse. A novel method is derived and implemented that calculates corrected condence intervals based on the approximate asymptotic distribution of parameter estimates in linear excess relative risk models. These models are widely used in survival analysis, particularly in radiation epidemiology. Specically, for the dose eect estimate of interest, a mixture distribution consisting of a normal and a lognormal component is applied. This choice of asymptotic approximation guarantees that corrected condence intervals will always be bounded. A simulation study was conducted to evaluate the proposed method in survival analysis using a realistic ERR model. We used both simulated Monte Carlo dosimetry systems (MCDS) and actual dose histories from the Mayak Worker Dosimetry System 2013, a MCDS for plutonium exposures in the Mayak Worker Cohort. Results show our proposed methods provide much improved coverage probabilities for the dose eect parameter, and noticeable improvements for other model parameters. x The Bayesian model averaging method and a full Bayesian method were examined. The comparison between the Bayesian methods and the proposed method is based on simulation and the result is investigated. xi Chapter 1 Introduction 1.1 Measurement Error in Epidemiology Study In epidemiological studies, exact measurement is required to evaluate the eect of exposure on response. However, as in most observational studies, it is hard to get perfect measurement on exposure information. This situation happens when the exposure variable is subjective, when it is based on past information based on recall, when it involves complicated biological or physical process, or simply because of human data entry error. For example, in many studies in radiation epidemiology, either subjects were exposed long before the study started, or there was long-lasting exposure going on in the past, which renders no easy measurement for total radiation each subject receives. This is particularly true for internal dosimetry. The amount of radiation can thus only be reconstructed through biological and physical model where parameters in the model need to be approximated. In such case, error in measurement is inherited from the study. Generally, measurement error refers to any discrepancy between measured value of a variable and its true value (Thomas, Stram, and Dwyer, 1993). When used with categorical variables, it is often referred as `misclassication'. Also, since statistical models usually included error components for response, in the following we are only discussing errors in exposures. 1 1.2 Types of Measurement Error and its Modeling Measurement error can be classied into dierent categories or they can have components in each categories. In the following, suppose we have a variable of interest whose true value is X, with the measured value Z, and the response variable is Y . 1.2.1 Systematic vs. Random Systematic error means E(ZjX)X6= 0 saying the measured value is not distributed around the true value, while random error means Var(ZjX) > 0, i.e. there is random variability of the measured value. 1.2.2 Dierential vs. Non-dierential Depending on whether the measurement error is also related to the response, we can classify the error to be `dierential' or `non-dierential'. Formally, non-dierential error is dened as Z?YjX, i.e. conditioning on X, Y;Z are independent. This implies that Z is only a surrogate for X, containing no further information about Y when X is given. When this independence is violated, the error is said to be dierential, and (X;Z) will contain more information about Y then X alone since YjX6YjX;Z. 1.2.3 Shared vs. Unshared Further, error in measurement can be shared or unshared. It is called shared when subjects share the same one or more incompletely known dose determinants, and unshared when it is independent among subjects. Error can also have components that belong to the two categories so it has shared components and unshared components which will lead to a complex variance structure of the error. 2 1.2.4 Structural vs. Functional Traditionally, as referred in Thomas, Stram, and Dwyer (1993), structural approach treats X (unobserved) as random variables while it is treated as xed constants or parameters in the functional approach. This classication, referred as `classical structural model' and `classical functional model' by Carroll et al. (2006), brings problem when X is treated as parameters, and maximum likelihood is attempted for estimating X simultaneously with the model parameters. Such treatment works poorly except for the very special case of having Berkson error model in linear regression. Thus a more `fruitful' denition would consider functional modeling of X as making minimal assumptions about the distribution of X while structural modeling takes more aggressive steps to parameterize the distribution of X. 1.2.5 Classical vs. Berkson When (X;Z) are to be modeled, sometimes it is easier and more appropriate to modelX andZjX, and sometimes it is the other way around, i.e. Z andXjZ. The former case is called the classical error model when E(ZjX) =X. The latter can correspond to situation where we have an average measurement for a group of people, say in a village, and individual exposure will be distributed around this average. When E(XjZ) = Z, it is called the Berkson error model, or the error is of Berkson type. WhenX6=Z, we see that in classical error model Var(Z) Var(X), and in Berkson error model Var(X) Var(Z). Also, we can transforme classical error model into Berkson error model by letting Z 0 = E(XjZ);X =Z 0 + (X E(XjZ)), so that E(XjZ 0 ) = E(E(XjZ)jZ 0 ) =Z 0 . Notice E(XjZ) can be obtained through Bayes theorem from classical error model. Of course, this requires information about the distribution of X and ZjX. 3 1.3 Eects of Measurement Error 1.3.1 Eects on Dose Response Estimation Classical error will have an attenuation eect in linear models. Recall that in simple linear regression E(YjX) = a +bX, the best linear predictor is ^ Y = y + xy 2 x (X X ). In classical error model, we have Z = X +U;U? X; Var(U) = 2 u . Then E( ^ b X ) = b; E( ^ b Z ) = Cov(Y;Z) Var(Z) = xy 2 x + 2 u = 2 x 2 x + 2 u b, which means the regression based on Z will give an estimator attenuated by factor = 2 x 2 x + 2 u . Berkson error, however, will not introduce bias to the estimator as will be shown in 1.4.1. In some cases, the error associated with measurement is proportional to the amount of exposure, so that errors are multiplicative more than additive. This is often a key feature of radiation dosimetry. 1.3.2 Eects on Variance Structure Outcomes are generally more variable given Z than given true exposure X, i.e. Var(YjZ) > Var(YjX). In simple linear model, we have Var(YjZ) = Var(YjX) +b 2 Var(XjZ) for both the classical and Berkson error models. This over-dispersion may need to be accounted for to get valid inference. Since Var(YjZ)> Var(YjX), it implies the variance of the estimator will be increased when Z is used. Most measurement error models will have an adverse eect on the power of the analysis with the exception when there is only shared additive error, which is unrealistic. 1.3.3 Introduced Correlation between Outcomes Assume that we have two linear models E(Y 1 jX) = a 1 +b 1 X; E(Y 2 jX) = a 2 +b 2 X with Y 1 ? Y 2 jX. When we have onlyZ with classical error, we have Cov(Y 1 ;Y 2 jZ) = E(Cov(Y 1 ;Y 2 jX)jZ) + Cov(E(Y 1 jX); E(Y 2 jX)jZ) = b 1 b 2 Var(XjZ), which says there might be correlation induced by measurement error. This eect is also called residual confounding. There are important questions in radiation epidemiology related to this. For example, in A-bomb study(Neriishi et al., 1991), 4 survivors who reported a history of epilation have two fold higher risk of leukemia than those without. This could be evidence of dierences in sensitivity between individuals or it could simply result from the random error in dose. It is dicult to distinguish the two possibilities statistically. 1.4 Current Methods Here we review common techniques used to correct for exposure measurement error. 1.4.1 Regression Calibration Regression calibration is a simple, widely used approach for measurement error problem Carroll et al. (2006). The essence of this method is we substitute the true X with its estimate E(XjZ) and use E(XjZ) as if it is the true dose in the regression. For linear model E(YjX) =a +bX, we have E(YjZ) = E(E(YjX)jZ) = E(a+bXjZ) =a+bE(XjZ), so the mean structure is maintained, and we should get unbiased result using E(XjZ). This also implies that Berkson error does not introduce bias in linear models. Regression calibration can be generalized. For example, if we let M be other covariates, and the model is E(YjX;M) = a +f(X)b X +g(M)b M +h(X;M)b XM , then we need to estimate E(f(X)jZ;M) and E(h(X;M)jZ;M). Here f;g;h can be vectors of functions. 1.4.2 Simulation Extrapolation Simulation Extrapolation (SIMEX) is another method to estimate and reduce bias from mea- surement error, by simulating models with added error and extrapolating back to where there is no error (Cook and Stefanski, 1994). The assumption of the method is that the estimates for a model will change continuously with the amount of measurement error involved. Assume the model Y = a +bX +, and we have classical error model with Z = X +U;U? X. By adding 5 additional error, we construct Z() = Z +U 0 ;U 0 ? Z;U 0 U. Then we can use Z();Y to t the model and get ; ^ b. Now we t a non-linear model f such that E( ^ bj) = f(), and the estimate of b without measurement error will be f(1). The heuristics for this is that assume E( ^ ) =T ( 0 ; (1 +) 2 ), whereT is a continuous function and 0 is the true value, then by letting =1, lim !1 E( ^ ) =T ( 0 ; 0) = 0 . There are several parametric candidates for extrapolants, and fully nonparametric spline-based extrapolant can also be used. The model was developed further by Devanarayan and Stefanski (2002) to allow for unknown variance. However, there is obvious disadvantage of this method. It is easy to add additional error if there is only one type of error, and there is only one variable with error. When the types of error are complex, or when multiple variables are with error, or when error is not independent, it is dicult to estimate function T . 1.4.3 Likelihood Method & Bayesian Method Regression calibration and SIMEX method do not assume the distribution of error, which cor- responds to functional approach, while likelihood method and Bayesian method are structural approach, requiring specication of the distribution of the error model. The two methods are very similar in the sense that both of them require the calculation of likelihood of observing Y givenZ. The dierence is likelihood requires numerical integration/summation to get rid of latent variables while Bayesian method needs prior information and usually also Markov Chain Monte Carlo (MCMC) to get the posterior samples of all parameters including latent variables. Both methods are very computationally intensive, and they will take much longer time than the two approaches listed above. The procedures of both methods are given below, and further details can be found in Carroll et al. (2006). Likelihood method 1) Formulate the exposure response model f(YjX;). 2) Specify error model XjZ 6 3) Formulate the likelihood function L(jY;Z) = R f(YjX;)f(XjZ)dX 4) Maximize the likelihood function L(jY;Z) Bayesian method 1) Formulate the exposure response model f(YjX;) 2) Specify error model XjZ 3) Choose prior () 4) Formulate posterior distribution f(;XjY;Z)/f(YjX;)f(XjZ)() 5) Run MCMC 1.5 Complex Dosimetry System 1.5.1 Limitations of Traditional Error Model In traditional measurement error problems as listed above, the errors associated with each subject are treated as independent to each other, or in other words, error are assumed to be unshared. This might not be true in certain cases. Also, since classical error can be thought of as an exposure/dose meter, while Berkson error as an averaging error, there are situations that require the combination of these two error models. These limitations of traditional measurement error models call for a more comprehensive and advanced approach for modeling measurement error. 1.5.2 Dose Reconstruction In some settings, there is no measurement of dose for exposure. Physical and biological processes that model dose origination and transfer are used to reconstruct the true dose each subject receives. This reconstruction can be used for exposure that happens a long time ago, or for a continuous exposure going on for some period. Though dose reconstruction has been used since the study of A-bomb, the idea of accounting for uncertainty in dose reconstruction is adopted only recently. Monte Carlo method have become a widely used tool as both providing dose estimation and 7 representing uncertainty in dose estimation since the Hanford Thyroid Disease Study (Davis et al., 2007), where 100 doses are simulated for each subject. In the Hanford study, each dose realization was calculated based on source terms, environmental transport and uptake of 131 I, which are shared among participants in each realization. Thus in such dosimetry system, there are both shared and unshared components, which will give a complex correlation structure of the measurement error among subjects. 1.5.3 Idealized Dosimetry System In order to have a framework to work on to correct for the eect of errors in a complex dosimetry system, Stram and Kopecky (2003) dened an idealized dosimetry system where we can draw samples from the distributionf(XjW ), whereX is the true dose, andW is the known information about the dose. Here both X and W are vectors, each element associating with one subject. We assume that the error is non-dierential, i.e. W ? YjX. Dene Z = E(XjW ), we also have Z ? YjX (notice Z makes the error model look like Berkson type). Usually, the uncertainty of parameters in the dosimetry system is determined by expert who develop the system. The samples from the dosimetry system will represent the distribution of the true dose based on what is known and the judgement of the expert, and it re ects the uncertainty in dose reconstruction. 1.5.4 SUMA Model for Measurement Error Stram and Kopecky (2003) further proposed a simplication of a complex dosimetry system, referred here as SUMA (Shared Unshared Multiplicative Additive) model, by decomposing it into 4 components. LetZ i be the true dose for subjecti, andZ i be the measured/reconstructed dose, then assume X i = SM Mi Z i + SA + Ai 8 where SM ; Mi are shared and unshared multiplicative error, and SA ; Ai are shared and unshared additive error. Assume all are independent with E( SM ) = E( Mi ) = 1, and E( SA ) = E( Ai ) = 0, which makes E(XjZ) =Z. This dosimetry system serves a good purpose for both theoretical and simulation analysis. 1.6 Current Research Unshared error is relatively easy to adjust for since we have a full set of methods to correct for its eect. The current research is thus on situation where shared and unshared error exist simultaneously. 1.6.1 Variance Correction for Shared Error Stram and Kopecky (2003) used the SUMA model to illustrate the eect of ignoring shared error in such situation. Assume now we are interested in the slope coecient in a simple linear model which takes the form E(YjX) =a +bX, whereY is the response variable. This is practical because linear models are the most widely used in many elds including radiation epidemiology. Even when the model is not linear, around the MLE we can often utilize linear approximations to the model. There are two observations worth noticing here. 1) Unshared error doesn't aect the model very much. Assume there are only unshared error. With SUMA we have X i = Mi Z i + Ai , and thus we get E(Y i jZ i ) = E(E(Y i jX i )jZ i ) = E(a +b( Ai + Mi Z i )jZ i ) =a +bZ i , though the variance structure of Y may be changed when Mi 6= 0. Specically, let =Y E(YjX), then Var(YjZ) = E([b( Mi 1)Z +b Ai +] 2 jZ) = Var(YjX)+b 2 2 Ai I +b 2 2 Mi Z T Z. This change of variance structure leads to reduction in power of least square estimator, but ignoring measurement errors in this case does not lead to biased estimator, nor invalid condence interval of the estimator. 9 2) Now consider only shared error exists. X i = SM Z i + SA . Y i =a + SA +b SM Z i . Condition on SA ; SM , we get E(^ aj SA ; SM ;Z) =a + SA , E( ^ bj SA ; SM ;Z) =b SM . Now Var( ^ b) = Var(E( ^ bj SA ; SM ;Z i )) + E(Var( ^ bj SA ; SM ;Z)) =b 2 2 SM + Var(YjZ) N Var(Z) Hence ignoring shared error will greatly in uence constructing condence interval and con- ducting inference. Notice C ij = Cov(X i ;X j ) = Cov( SM Mi Z i + SA + Ai ; SM Mj Z j + SA + Aj ) =Z i Z j 2 SM + 2 SA C i = Var(X i ) =Z 2 i ( 2 SM + 1)( 2 M + 1) 1 + 2 SA + 2 A We can regress C ij on Z i Z j and C i on Z 2 i to obtain estimates for 2 SM ; 2 M ; 2 SA ; 2 A . Stram & Kopecky then proposed to use the test statistic (b 2 o ^ b) 2 b 2 o 2 SM + Var(Y ) NVar(Z) and 2 1 distribution to construct condence interval. However, this statistic will asymptote to 1 2 SM asjb 0 j!1. This means, we will have un- bounded condence interval whenever 1 2 SM < 2 1;1 . Moreover, this calculation is based on SUMA model, which might deviate from the true underlying error model. The performance of the method wasn't evaluated under dierent scenarios to check whether it is sucient to only account for shared error. 10 Stram et al. (2015) extends the same idea of correcting variance of the estimator while taking a dierent approach. Notice the estimator for the parameters in the model can be approximated using the Fisher Scoring equation ^ I 1 Z S Z where I Z is the information matrix and S Z is the score function. When there is not any mea- surement error we have Var(S Z ) = I Z , and we have the classical result Var( ^ ) = I 1 Z . When measurement error does exist, variance ofS Z will be increased due to the shared error. For linear model above, we have the score function S Z = 1 Var(YjZ) 1 T (Y (a +bZ));Z T (Y (a +bZ)) T with expectation 0, which means the MLE estimator will be unbiased. We need to pay special attention to the statement of unbiasedness here since the expectation is taken over XjZ, which means we are averaging over all possible possible X given Z. However, in real studies, only one realization of X is actually underlying the data. We can calculate the variance of S Z , which will be used to get variance of ^ , Var(S Z ) = E(Var(S Z jX)) + Var(E(S Z jX)) = Var(YjX) Var(YjZ) 2 2 6 6 4 N Z T 1 Z T 1 Z T Z 3 7 7 5 + b 2 Var(YjZ) 2 1 Z T Cov(XjZ) 1 Z which, if we let Z 0 = [1 Z], gives Var 0 B B @ 2 6 6 4 ^ a ^ b 3 7 7 5 1 C C A = Var(YjX)((Z 0 ) T Z 0 ) 1 +b 2 ((Z 0 ) T Z 0 ) 1 (Z 0 ) T Cov(XjZ)Z 0 ((Z 0 ) T Z 0 ) 1 (1.1) The author didn't state how this corrected variance should be used, but instead use Wald test to construct condence interval in an example, which is not a good choice since the variance is highly dependent on the value of b, which means we are missing possible high null value of b, which will make ^ b more variable. However, if we were to use the Score test as proposed in 11 previous method, there is chance we end up with unbounded condence interval, both positively and negatively, which is undesired. 1.6.2 Relationship with Rosner's Correction Rosner, Willett, and Spiegelman (1989) proposed linear approximation method to adjust for classical error with validation data with has similar structure as the variance correction method above. Here we intend to show that variance correction from Section 1.6.1 gives very similar result to Rosner's method. In linear approximation method, the classical error model takes the form X = 0 +Z +;N(0; 2 ) Instead of the logistic model originally used in the paper, we choose the simpler linear model EY = +X Assume we have observed dataset (Z;Y ) and a separate validation dataset (Z ;X ). It is trivial to use the validation data set to t the error model and get ^ 0 ; ^ , and generate E(XjZ) = ^ 0 + ^ Z. Regression calibration can then be used to t the main model to obtain ^ ; ^ , by using Y and E(XjZ). However, going through this path prevents us from obtaining proper condence interval for since we ignored the uncertainty estimating . Rosner suggests using another approach, which he referred to as linear approximation, with the same point estimate obtained while condence interval is correctly estimated. We t the same 12 error model rst with validation dataset. The main model is then tted withY;Z, instead of with Y; E(XjZ), giving ^ . is estimated by ^ = ^ = ^ since E(YjZ) = ( + 0 ) +Z. Delta method can be used to approximate the variance of ^ Var( ^ ) = 1 ^ 2 Var( ^ ) + ( ^ ) 2 ^ 4 Var( ^ ) (1.2) Now we use the variance correction method as in Section 1.6.1 with a Bayesian perspective in the dosimetry system. Assume a non-informative prior on, then the probability density function of the posterior distribution of is proportional to the likelihood function of the error model with the validation dataset (X ;Z ). That means asymptoticly 2 6 6 4 0 3 7 7 5 jX ;Z N 0 B B @ 2 6 6 4 ^ ^ 3 7 7 5 ; (I ) 1 1 C C A with I the information matrix in frequentist's view. Now when we generate realizations from the error model using the posterior distribution of ( 0 ;), and observed Z, we can get E(XjZ) = Z 0 2 6 6 4 0 3 7 7 5 and its covariance matrix, Cov(XjZ) =Z 0 Cov 0 B B @ 2 6 6 4 0 3 7 7 5 1 C C A (Z 0 ) T 13 Let ~ Z = [1; E(XjZ)], and we have ~ Z =Z 0 M where M = 2 6 6 4 1 ^ 0 ^ 3 7 7 5 When we t the main model with Y; E(XjZ), substitute Cov(XjZ) into Equation (1.1) we get Var 0 B B @ 2 6 6 4 ^ ^ 3 7 7 5 1 C C A = Var(YjX)(( ~ Z) T ~ Z) 1 + 2 (( ~ Z) T ~ Z) 1 ( ~ Z) T Z 0 Cov 0 B B @ 2 6 6 4 0 3 7 7 5 1 C C A (Z 0 ) T ~ Z(( ~ Z) T ~ Z) 1 = Var(YjX)(( ~ Z) T ~ Z) 1 + 2 M 1 (I ) 1 (M 1 ) T Knowing that M 1 = 1 ^ 2 6 6 4 ^ ^ 0 1 3 7 7 5 easily we can see for arbitrary 2 2 matrix A, M 1 A(M 1 ) T [2;2] = 1 ^ 2 A [2;2] So we have Var( ^ ) = Var(YjX) ^ 2 ((Z 0 ) T Z) 1 + 2 ^ 2 Var( ^ ) Notice Var( ^ ) = Var(YjZ)((Z 0 ) T Z) 1 , and = , we get Var( ^ ) = Var(YjX) Var(YjZ) Var( ^ ) ^ 2 + ^ 2 2 ( ) 2 ^ 4 Var( ^ ) (1.3) Since Var(YjX) is not evaluable, practically we replace it with Var(YjZ), and we need to estimate with ^ , so that the results from variance correction (Equation 1.3) and Rosner's linear approximation (Equation 1.2) are very comparable. 14 1.6.3 Bayesian Model Averaging Dierently from traditional Bayesian method mentioned above, recently Kwon et al. (2016) pro- posed a Bayesian Model Averaging (BMA) method by treating each dose realization as a `model' that needs to be averaged over all dose realizations. Each dose will be given weight proportional to their posterior distribution. To alleviate convergence issue, they resort to SAMC (Stochastic Approximation Monte Carlo) (Liang, Liu, and Carroll, 2007), which adjusts the weight of each dose as the MCMC goes. Specically, they used modied logistic model P (Y = 1jX;D) = exp P j j X j + log(1 +D) 1 + exp P j j X j + log(1 +D) where D represent the dose vector, and X are covariates measured without error. Notice this probability is quite linear in D over a wide range. Then l(;;Djy;X) = Y i p yi i (1p i ) 1yi with priors chosen as follows j N(0; 1000) Exponential(100) Kwon et al. (2016) provided simulation of their approach. In their simulations, 5000 doses are generated. In each simulation, in order to validate their methods, one of the 5000 is selected as the true value and other doses are used as samples to form the models to be averaged. The credible interval is calculated from the posterior samples combined with the weight associated with each dose. The author also introduced BMA used with conditional mean and conditional median of doses conditioned on shared parameters. 15 The performance of the BMA method regarding 95% condence interval is evaluated under 2 exposure scenarios in Kwon et al. (2016). In scenario 1 where the amount of uncertainty is relatively small, the coverage for Bayesian method is 95.6% and is 94.4% for naive regression with median doses. In scenario 2, BMA gives high inclusion percentage (70% to 91%) while conventional regression with median dose only have 52.2%. When relative bias (percentage dierence of the estimates from the true value) is considered, the authors found that BMA have smaller bias compared to conventional method. Further, with conditional mean and conditional median, smaller bias can be achieved, consistent with higher coverage of these two approach. There are two things worth noticing here: Median dose was used for simulation for conventional method. While it is known that with only unshared error, the mean dose should give unbiased estimate, the property of the median dose wasn't known. The simulation claims naive method gives high bias. This contradict the conclusion from Stram and Kopecky (2003) and Stram et al. (2015) where conventional method ignoring measurement error is shown to have unbiased estimate, when averaging over the realizations. 16 Chapter 2 Normal-Lognormal Approximation for Condence Intervals The improvement in dose reconstruction gives us complex system usually represented by dose realizations, as discussed in Chapter 1.5. Here we will be working on a complex dosimetry system represented by the SUMA model. Let X be the true dose underlying the data, which is unobsered. We observe Z instead. X and Z has the relation E(X) = Z. In the SUMA model, X i = Z i SM Mi + SA + Ai where i indexes subject, E( SM ) = E( Mi ) = 1, E( SA ) = E( Ai ) = 0. The restriction on expectation of these random errors guarantees the error is Berkson-type. An important observation is that in one study, SM and SA is shared among all subjects invovled. It is assumed that we can sample dose realizations from the conditional distribution XjZ. 2.1 Linear Model We start with linear model E(YjX) = a +bX. We see that the variance correction method in Stram and Kopecky (2003) and Stram et al. (2015) gives corrected variance for the estimator. However, if the distribution of the estimator is treated as normal, asb!1, the score test statistic has asymptote ( ^ bb) 2 (I 1 Z ) 2;2 +b 2 (I 1 Z M T Var(XjZ)MI 1 Z ) 2;2 ! 1 (I 1 Z M T Var(XjZ)MI 1 Z ) 2;2 17 whereM [i;] = ( 1 a+bZi ; Zi a+bZi ),i indexing individuals. Since the statistic is continuous, it is bounded above in the region [ ^ b;1). This can possibly lead to unbounded condence interval, which is not desirable (Figure 2.1). Notice when the upper condence interval is unbounded, the lower condence interval will also be unbounded. −100 −50 0 50 100 150 200 2 3 4 5 6 b under H 0 Test statistic Score Statistic χ 2 = 3.84 Figure 2.1 Condence interval containing1 Stram and Kopecky (2003) showed that E( ^ bj SA ; SM ;Z) =b SM and we also know that ^ bj SA ; SM ;Z should be asymptotically normal from maximum likelihood theory. The following result gives us the distribution of the slope estimator. Theorem 1. Under the SUMA modelX i =Z i SM Mi + SA + Ai ; 1in; E( SM ) = E( Mi ) = 1; E( SA ) = E( Ai ) = 0; Var( Mi ) = 2 M ; Var( Ai ) = 2 A , where X i is the unobserved true dose, Z i = E(X i ) is observed, consider the ordinary least square (OLS) slope estimator ^ b n of the simple linear regression modelY i =a+bX i + i , with E( i ) = 0; Var( i ) = 2 , Mi ; Ai ; i are independent. 18 1. When 2 M = 0, E Z 2 i <1 and max 1in Z i = p o(n 1=2 ), we have ^ b n ! d b SM +Q where QN 0; 2 + 2 A nVar(Z) ;Q? SM 2. When 2 M > 0, E Z 4 i <1 and max 1in Z 2 i = p o(n 1=2 ), we have ^ b n ! d b SM N 1 +N 2 where N 1 N 1; E(Z 2 (Z E(Z)) 2 ) nVar(Z) 2 2 M ;N 2 N 0; 2 + 2 A nVar(Z) and SM ;N 1 ;N 2 are independent. Proof. Let Z be a column vector offZ i ; 1ing. Similarly we dene X; Y, and dene 1 be a column vector of 1, M =f Mi ; 1in, A =f Ai g; 1ing; =f i ; 1ing. 1. When Mi 1, the OLS estimator ^ b n takes the form ^ b n j SM ; SA = (Z 1 n 11 T Z) T Y (Z 1 n 11 T Z) T (Z 1 n 11 T Z) = (Z 1 n 11 T Z) T (a1 +b SM Z + SA 1 + A +) Z T Z 1 n Z T 11 T Z =b SM + (Z 1 n 11 T Z) T ( A +) Z T Z 1 n Z T 11 T Z When max 1in Z i = p o(n 1=2 ) 19 we have max p n (Z 1 n 11 T Z) T Z T Z 1 n Z T 11 T Z = p o(1) and when E Z 2 <1, n (Z 1 n 11 T Z) 2 (Z T Z 1 n Z T 11 T Z) 2 = n Z T Z 1 n Z T 11 T Z a:s: ! 1 Var(Z) So the second term has asymptotic normal distribution from Lindeberg-Feller Central Limit Theorem (2.27 from Vaart (2000)). Let P = SM ;Q n = (Z 1 n 11 T Z) T ( A +) Z T Z 1 n Z T 11 T Z , then P?Q n , and thus ^ b n jP =bP +Q n , Q n ! d Q where QN 0; 2 + 2 A nVar(Z) . Since E e it ^ b = E(E e it ^ b jP ) = E e itbP E e itQn ! E e it(bP+Q) Applying Continuity theorem (3.3.6 from Durrett (2010)) we have ^ b n ! d b SM +Q where QN 0; 2 + 2 A nVar(Z) 2. When Var( Mi ) = 2 M > 0, we have ^ b n j SM ; SA = (Z 1 n 11 T Z) T (b SM diag(Z) M + A +) Z T Z 1 n Z T 11 T Z =b SM Z T (diag(Z) ( 1 n Z T 1)I) M Z T Z 1 n Z T 11 T Z + (Z 1 n 11 T Z) T ( A +) Z T Z 1 n Z T 11 T Z 20 From (1) we know the asymptotic distribution of the second term. Notice n Z T (diag(Z) ( 1 n Z T 1)I) Z T Z 1 n Z T 11 T Z 2 = (Z T (diag(Z) ( 1 n Z T 1)I)) 2 =n (Z T Z 1 n Z T 11 T Z) 2 =n 2 = P i Z 2 i (Z i Z) 2 =n (Z T Z 1 n Z T 11 T Z) 2 =n 2 a:s: ! E(Z 2 (Z E(Z)) 2 ) Var(Z) 2 When we have max 1in Z 2 i = p o(n 1 =2), we can apply Linderberg-Feller CLT to conclude that Z T (diag(Z) ( 1 n Z T 1)I) M Z T Z 1 n Z T 11 T Z ! d N 1; E(Z 2 (Z E(Z)) 2 ) nVar(Z) 2 2 M Using the continuity theorem of characteristic function as in (1), we conclude that ^ b n ! d b SM N 1 +N 2 where N 1 N 1; E(Z 2 (Z E(Z)) 2 ) nVar(Z) 2 2 M ;N 2 N 0; 2 + 2 A nVar(Z) One observation is that the distribution of ^ b is largely related to the distribution of SM , the shared multiplicative error. We can see in the proof that the normal distribution part comes from central limit theorem which depends on large sample theory while the SM part does not change with the sample size. This also illustrate a diculty rarely seen in other situation, that we cannot estimate b exactly even if we have innite number of subjects because of shared multiplicative error. 21 0 5 10 15 20 0 2 4 6 8 10 X Y Figure 2.2 QQ-Plot of Log-Normal distribution and Its Normal Modication Usually in a dosimetry system, the dose is constructed based on some models and there are global parameters shared among all subjects in the study. Such parameters usually have multiplicative eect since arithmetically, large errors are often associated with large observations. When these parameters are treated as random variables as in these dosimetry systems, the overall eect can be approximated by log-normal distribution from the central limit theorem. This means SM should approximately have log-normal distribution, and in the following, we assume that SM is log-normal distributed. Another observation is, when there is no unshared multiplicative error, we have ^ b distributed as a sum of a log-normal distribution and a normal distribution, and when there is shared mul- tiplicative error, the log-normal distribution is modied by a normal distribution related to the unshared multiplicative errorr. However, we should see that the variance of the modifying normal distribution is diminishing withn, the number of subjects, and the log-normal distribution should be dominating with moderate number of subjects. This can be seen from a simple simulation in Figure 2.2, where we have X has log-normal distribution, with expectation 1 and standard deviation 1, while Y has the same log-normal distribution but is modied by a normal random variable with mean 1 and SD 0.1. We sample 1,000 points from X and Y . The QQ-plot is close to a straight line, except at the extreme. 22 We suggest to approximate the distribution of ^ b by ^ bb logN(1;V 2 ) + N(0;V 1 ), with V 1 ;V 2 to be determined below. We will use dose realizations to nd V 1 ;V 2 . Specically, we want to match the variance of the approximation to that from the sample as given by Stram et al. (2015). Hence we assume ^ bV +W;V ?W;V blogN(1;V 2 );W N(0;V 1 ); Var( ^ b) =V 1 +b 2 V 2 and V 1 ;V 2 can be obtained as follows V 1 = Var(YjX) ((Z 0 ) T Z 0 ) 1 2;2 ;V 2 = ((Z 0 ) T Z 0 ) 1 (Z 0 ) T Cov(XjZ)Z 0 ((Z 0 ) T Z 0 ) 1 2;2 There are several things worth noticing here, Score test based on this approximation will not contradict previous tests against H 0 :b = 0 where measurement error is ignored. This follows since under H 0 : b = 0, Y ? X) Y ? Z) Var(YjX) = Var(YjZ)) ^ b N(0; (I 1 Z ) 2;2 ), which is as if we ignored measurement error. In general, Var(YjX) cannot be estimated directly and is essentially unidentiable since X is not observed. In the SUMA setting, this is very close to Var(YjZ; SA ; SM ) when b is small. We choose to replace this by using the residuals of tting Y to Z. This will partly improves the approximation by absorbing the normal part in b logN of the approximation resulting from unshared error making the approximation more conservative. The condence interval can always be bounded. We use Algorithm 1 to get upper and lower bound for the condence interval, represented by b u ;b l respectively. For the lower bound b l , by construction, we have P (b l logN(1;V 2 ) t l ) = 1 2 , and P (N(0;V 1 ) ^ bt l ) = 1 2 . Thus P (b l logN(1;V 2 ) +N(0;V 1 ) ^ b)P (b l logN(1;V 2 )t)P (N(0;V 1 ) ^ b t l ) = (1 2 ) 2 1 . So P (b l logN(1;V 2 ) + N(0;V 1 ) > ^ b) < . Similarly we 23 have P (b u logN(1;V 2 ) +N(0;V 1 ) ^ b) < . So we have valid upper and lower bounds for the condence interval. These bounds can also be used for obtaining the condence interval through numerical optimization procedure since the cumulative distribution function is increasing and we can nd solutions of cdf corresponding to dierent values. Algorithm 1 Find Upper and Lower Bounds for Condence Interval lnVar =log(1 +V 2 ) lnMean = lnVar 2 if t l = ^ bz 1 2 p V 1 < 0 then b l = t l lnMean+z =2 p lnVar else b l = t l lnMean+z 1=2 p lnVar end if if t u = ^ bz 2 p V 1 < 0 then b u = tu lnMean+z 1=2 p lnVar else b u = tu lnMean+z =2 p lnVar end if return b l ;b u 2.2 Excess Relative Risk Model 2.2.1 Simple Excess Relative Risk Model Another model of interest is the excess relative risk (ERR) model, which is widely used in survival analysis. We begin by examing a simple ERR model with only one covariate of the form E(YjX) = exp(a)(1 +bX), where YjX has a Poisson distribution. Use the SUMA model, we try to nd the eect of unshared errors and shared errors. The score function and information matrix is given below S = X i (Y i exp(a)(1 +bX i )); X i Y i exp(a)(1 +bX i ) 1 exp(a)X i ! T I = 0 B B @ P i exp(a)(1 +bX i ) P i exp(a)X i P i exp(a)X i P i exp(a)X 2 i (1+bXi) 1 C C A 24 1. When there are only unshared errors Mi ; Ai , E(YjZ) = E(exp(a)(1 +bX)jZ) =exp(a)(1 + bZ); Var(YjZ) = E(exp(a)(1+bX)jZ)+Var(exp(a)(1+bX)jZ) = exp(a)(1+bZ)+b 2 exp(a) 2 (Z 2 2 Mi + 2 Ai ). IfZ was used in place ofX to t the model, we nd that the score function has expec- tation 0, which means that the estimates will remain asymptotically unbiased. Also we see that the variance structure is changed, as Y no longer has Poisson distribution conditional onZ since now its variance is greater than its expectation. However, this variance in ation can often be ignored when the attributable risk b exp(a) is small. Also the estimating procedure stays valid since the score function has expectation 0 condi- tional on Z, though there might be some loss of eciency due to variance in ation. 2. When there are only shared error SM ; SA , and assume ^ b X is the MLE whenX was used, ob- serve that E(Y i jZ; SM ; SA ) = exp(a)(1+b(Z SM + SA )) = exp(a)(1+b SA )(1+ b SM 1+b SA Z), so we have ^ b Z consistently estimates SM 1+b SA b. When there is no shared additive error, we have the previous result that ^ b Z = SM ^ b X . The analysis above suggests that the normal-lognormal approximation should work well with the SUMA model when there is no shared additive error. To use the approximation ^ bV +W;V ?W;V blogN(1;V 2 );W N(0;V 1 ) we want to match the variance as in linear model. Using Fisher's Scoring ^ I 1 Z S Z , we get Var(S Z jZ) = Var(E(S Z jX)jZ) + E(Var(S Z jX)jZ) = Var X i b exp(a)(X i Z i ); X i b(X i Z i ) 1 +bZ i exp(a)Z i ! jZ ! +I Z =b 2 M T Cov(XjZ)M +I Z 25 where M [i;] = exp(a)(1; Zi 1+bZi ) Var( ^ b) = I 1 Z 2;2 +b 2 I 1 Z M T Cov(XjZ)MI 1 Z 2;2 Hence V 1 = I 1 Z 2;2 , V 2 = I 1 Z M T Cov(XjZ)MI 1 Z 2;2 . Dierent from linear model, bothI Z andM both depend onb. However, if we think of Var( ^ b) as containing two components, one from variability in Y and one from variability in XjZ, the quantity I 1 Z and I 1 Z M T Cov(XjZ)MI 1 Z evaluated at ^ b should would as a good representation of such variability, and then we can proceed to calculate the condence interval. Since the ERR model is linear in X, we expect this treatment to be reasonable. 2.2.2 Relative Risk Model with Dose Eect Modier In radiation epidemiology study, a more complete model would be the ERR model with possible modier on the eect of exposure Specically, Stram et al. (2015) used the model E(Y i ) = exp(a +a 1 A i )(1 +bX i exp(a 2 C i )) Here A i modies the baseline rate of disease while C i modies the eect of the exposure. More generally, we can have A i ;C i be vectors, so that the above becomes E(Y i ) = exp 0 @ a + X j a 1;j A i;j 1 A 1 +bX i exp X k a 2;k C i;k !! 26 Using the same notation as above, and from similar derivation as above, we can obtain S = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 P i Y i exp(a + P j a 1;j A i;j ) (1 +bX i exp ( P k a 2;k C i;k )) P i A i;1 Y i exp(a + P j a 1;j A i;j ) (1 +bX i exp ( P k a 2;k C i;k )) . . . P i A i;J Y i exp(a + P j a 1;j A i;j ) (1 +bX i exp ( P k a 2;k C i;k )) P i Yi exp(a+ P j a1;jAi;j )(1+bXi exp( P k a 2;k C i;k)) 1 exp(a + P j a 1;j A i;j )X i exp ( P k a 2;k C i;k ) P i Yi exp(a+ P j a1;jAi;j )(1+bXi exp( P k a 2;k C i;k)) 1 exp(a + P j a 1;j A i;j )bX i C i;1 exp ( P k a 2;k C i;k ) . . . P i Yi exp(a+ P j a1;jAi;j )(1+bXi exp( P k a 2;k C i;k)) 1 exp(a + P j a 1;j A i;j )bX i C i;K exp ( P k a 2;k C i;k ) 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 I Z = X i exp 0 @ a + X j a 1;j A i;j 1 A 1 +bZ i exp X k a 2;k C i;k !! P i P T i P i = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 1 A i;1 . . . A i;J Zi exp( P k a 2;k C i;k) 1+bZi exp( P k a 2;k C i;k) Ci;1bZi exp( P k a 2;k C i;k) 1+bZi exp( P k a 2;k C i;k) . . . C i;K bZi exp( P k a 2;k C i;k) 1+bZi exp( P k a 2;k C i;k) 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 and M [i;] = exp 0 @ a + X j a 1;j A i;j + X k a 2;k C i;k 1 A P T i Then Var( ^ b) = I 1 Z 2;2 +b 2 I 1 Z M T Cov(XjZ)MI 1 Z 2;2 27 and we choose V 1 ;V 2 similarly as in previous. Testing of a 1;j ;a 2;k will be addressed in 2.3.1. 2.3 Model Extension We now extend the method for correcting standard error of parameter estimates to two scenarios. 2.3.1 Model with Known Covariates Assume we are considering linear model of form E(YjX;W ) =a +bX + T W where W denotes variables measured without error. Under SUMA model, we see that E(YjZ;W; SA ; SM ) = (a +b SA ) +b SM Z + T W Notice ^ is unbiased since E( ^ ) = E(E( ^ j SA ; SM )) = . Var( ^ ) = E(Var( ^ j SA ; SM )) + Var(E( ^ j SA ; SM )) = Var( ^ j SA ; SM ), where the last equality holds since the variance is not dependent on SM ; SA . This speculation on SUMA model suggests we can proceed as if we don't have measurement error in the model and still get valid inference for while the inference for b stays the same as before and calls for the corrected variance. 2.3.2 Hierarchical Model In many studies, there are inherent hierarchies in the structure of the study population and observations. Multi-level models are usually used in such circumstances. 28 Assume we are working on a simple two-level hierarchical model with the form Y j =a j 0 +b j 0X j + j ; Var( j ) = 2 a j 0 =a + aj 0; Var( aj 0) = 2 a b j 0 =b + bj 0; Var( bj 0) = 2 b which can be simplied to E(YjX) =a +bX; Var(YjX) = = 2 I + 2 a K + 2 b XKX T where K is a block diagonal matrix characterizing the structure of the data. Generally, a hierar- chical model can be represented by E(YjX) =a +bX; Var(YjX) = = 2 I + X i i K i When X is known, the model can be tted using weighted least square and the estimator (^ a; ^ b) will be 2 6 6 4 ^ a ^ b 3 7 7 5 = ((X 0 ) T 1 X 0 ) 1 (X 0 ) T 1 Y whereX 0 is the extended design matrix. When onlyZ is observed instead ofX. Now Var(YjZ) = E(Var(YjX)jZ) + Var(E(YjZ)jX) = +b 2 Cov(XjZ). Thus we have V 1 = ((Z 0 ) T 1 Z 0 ) 1 2;2 ;V 2 = ((Z 0 ) T 1 Z 0 ) 1 Z 0T 1 Cov(XjZ) 1 Z 0 ((Z 0 ) T 1 Z 0 ) 1 2;2 Inclusion of other covariates without measurement error can be easily achieved as described in 2.3.1. 29 2.3.3 Model with Multiple Covariates Measured with Independent Error We start with only 2 error prone covariates in the model, and assumeX 1 jZ 1 ;Z 2 =X 1 jZ 1 ;X 2 jZ 1 ;Z 2 = X 2 jZ 2 , and X 1 jZ 1 ?X 2 jZ 2 . For ERR model, the model has form E(YjX) = exp(a)(1 +b 1 X 1 +b 2 X 2 ). The Score function and information matrix can be obtained to be as follows S X = 0 B B B B B B @ P i (Y i exp(a)(1 +b 1 X 1;i +b 2 X 2;i )) P i Yi exp(a)(1+b1X1;i+b2X2;i) 1 exp(a)X 1;i P i Yi exp(a)(1+b1X1;i+b2X2;i) 1 exp(a)X 2;i 1 C C C C C C A I X = 0 B B B B B B @ P i exp(a)(1 +b 1 X 1;i +b 2 X 2;i ) P i exp(a)X 1;i P i exp(a)X 2;i P i exp(a)X 1;i P i exp(a)X 2 1;i (1+b1X1;i+b2X2;i) P i exp(a)X1;iX2;i (1+b1X1;i+b2X2;i) P i exp(a)X 2;i P i exp(a)X1;iX2;i (1+b1X1;i+b2X2;i) P i exp(a)X 2 2;i (1+b1X1;i+b2X2;i) 1 C C C C C C A Using the same technique as in 2.2, we have Var(S Z jZ) = E(Var(S Z jX)) + Var(E(S Z jX)) =I Z + Var 0 B B B B B B @ 0 B B B B B B @ P i exp(a)(b 1 (X 1;i Z 1;i ) +b 2 (X 2;i Z 2;i )) P i 1+b1X1;i+b2X2;i 1+b1Z1;i+b2Z2;i 1 exp(a)Z 1;i P i 1+b1X1;i+b2X2;i 1+b1Z1;i+b2Z2;i 1 exp(a)Z 2;i 1 C C C C C C A 1 C C C C C C A =I Z +M T Cov(b 1 X 1 +b 2 X 2 jZ)M =I Z +b 2 1 M T Cov(X 1 jZ 1 )M +b 2 2 M T Cov(X 2 jZ 2 )M where M i; = exp(a) 1; Z1;i 1+b1Z1;i+b2Z2;i ; Z2;i 1+b1Z1;i+b2Z2;i . Notice that sinceX 1 ?X 2 jZ 1 ;Z 2 , the shared errors associated with each are also independent, which implies the variance due to the shared component of the errors should be quantied by the associated quadratic term. Also since the estimated I Z already contains the part from unshared 30 error, we can use I Z as the variance of the normal part of the approximation. This argument leads us to following choice: 1) Inference for b 1 , V 1 = I 1 Z 2;2 ;V 2 = I 1 Z M T Cov(X 1 jZ 1 )MI 1 Z 2;2 2) Inference for b 2 , V 1 = I 1 Z 3;3 ;V 2 = I 1 Z M T Cov(X 2 jZ 2 )MI 1 Z 3;3 This conguration can be extended to model with more than 2 error-prone covariates as long as the errors for dierent covariates are independent from each other. 2.4 Simulations We use 2 dosimetry models to evaluate the variance correction method for linear model and simple ERR model. Also we evaluate dierent extensions of this method with simulations. The dosimetry models are SUMA model and hierarchical SUMA model. The SUMA model has been described in 1.5.4, and hierarchical SUMA is an extension of SUMA model by introducing hierarchical error components, i.e. errors that are shared by dierent groups in the cohort. This allows for modeling more complicated error structure and it should be closer to a real dosimetry system. Here we restrict that all the partially shared errors in the hierarchy are multiplicative. Formally, we have X i = SM Mi m(i) k(i) Z i + SA + Ai where m(i)2 [M];k(i)2 [K], where M;K are the number of groups. We use M andK to dene groups in the cohort because they are dening partially shared errors in dierent aggregation patterns. We assume here that M represents a smaller aggregation pattern and MM represents a relatively larger aggregation pattern. For example, in a radiation exposure-response study, the dose might be constructed for people in several villages, and the number of villages is xed (K), i.e. not increasing with the number of subjects included in the study. However, there can be 31 shared errors among family members, and the number of such groups (M) will increase with the number of total subjects. The dosimetry models are parameterized as follows, 1. SUMA1 { Only shared errors. Set Z 4 2 4 , 2 SM = 1, 2 SA = 0:25. 2. SUMA2 { Only multiplicative errors. Set Z 4 2 4 , 2 SM = 1, 2 Mi = 0:25. 3. SUMA3 { All 4 errors. Set Z 4 2 4 , 2 SM = 0:25; 2 Mi = 1, 2 SA = 0:25; 2 Ai = 0:25. 4. HSUMA1 { small 2 k , no m . Set 2 k = 0:25. 5. HSUMA2 { big 2 k , no m . Set 2 k = 1 6. HSUMA3 { small 2 k , small 2 m . Set 2 k = 0:25, 2 m = 0:25. 7. HSUMA4 { big 2 k , big 2 m . Set 2 k = 1, 2 m = 1. For linear models, for HSUMA1-HSUMA4, we set 2 SM = 2 Mi = 2 SA = 2 Ai = 0:25. For ERR models, we set 2 SM = 2 Mi = 0:25; 2 SA = 2 Ai = 0 since additive errors are rarely seen in radiation dosimetry, where zero dose is often very well estimated while higher dose are subject to greater error arithmetically. 2.4.1 Simple Linear Model We use simple linear model Y =a +bX +; N(0; 2 ) to evaluate the method compared with naive method where no correction is used. We generate 1000 simulation dataset, where in each dataset, 100 subjects are included as follows (i) True exposure X are generated as one realization from each of the given dosimetry model, with E(XjZ) =Z, where Z 4 2 4 , which has E(Z) = 16. (ii) 500 realizations from the same dosimetry model are generated, and used to estimate the matrix Cov(XjZ), E(XjZ). 32 (iii) Y is generated from the above linear model, with a = 1;b = 2, having variance 2 chosen from the setf0:01; 0:25; 1; 4g. Table 2.1 Bias, Coverage, Length of CI for Simple Linear Model with Dierent Dosimetry Models Bias(SD) Coverage y Length of CI z 2 Dosimetry Corrected Uncorrected Corrected Uncorrected 0.01 SUMA1 0:057(2:222) 0:941(0:027; 0:032) 0:001(0:310; 0:689) 9:012 0:004 SUMA2 0:014(1:910) 0:952(0:025; 0:023) 0:151(0:249; 0:600) 10:463 0:478 SUMA3 0:019(1:279) 0:967(0:020; 0:013) 0:433(0:171; 0:396) 5:930 1:114 HSUMA10:059(1:270) 0:953(0:022; 0:025) 0:248(0:229; 0:523) 5:334 0:695 HSUMA20:078(1:659) 0:954(0:023; 0:023) 0:285(0:175; 0:540) 7:855 0:828 HSUMA30:088(1:450) 0:962(0:017; 0:021) 0:310(0:178; 0:512) 6:652 0:780 HSUMA40:073(2:696) 0:962(0:020; 0:018) 0:258(0:149; 0:593) 13:034 0:861 0.25 SUMA1 0:057(2:222) 0:940(0:027; 0:033) 0:004(0:308; 0:688) 9:004 0:018 SUMA2 0:014(1:910) 0:952(0:025; 0:023) 0:153(0:248; 0:599) 10:475 0:479 SUMA3 0:019(1:279) 0:967(0:020; 0:013) 0:431(0:170; 0:399) 5:933 1:115 HSUMA10:059(1:270) 0:952(0:022; 0:026) 0:246(0:231; 0:523) 5:347 0:696 HSUMA20:078(1:659) 0:954(0:023; 0:023) 0:284(0:176; 0:540) 7:856 0:827 HSUMA30:088(1:450) 0:962(0:017; 0:021) 0:310(0:178; 0:512) 6:654 0:781 HSUMA40:073(2:696) 0:963(0:020; 0:017) 0:259(0:149; 0:592) 13:049 0:861 1 SUMA1 0:057(2:222) 0:939(0:027; 0:034) 0:006(0:307; 0:687) 9:008 0:035 SUMA2 0:014(1:910) 0:953(0:025; 0:022) 0:153(0:248; 0:599) 10:498 0:482 SUMA3 0:019(1:279) 0:966(0:020; 0:014) 0:430(0:170; 0:400) 5:933 1:115 HSUMA10:059(1:269) 0:953(0:022; 0:025) 0:248(0:231; 0:521) 5:356 0:696 HSUMA20:078(1:659) 0:954(0:023; 0:023) 0:285(0:176; 0:539) 7:855 0:829 HSUMA30:088(1:450) 0:961(0:017; 0:022) 0:310(0:177; 0:513) 6:658 0:782 HSUMA40:073(2:696) 0:963(0:020; 0:017) 0:260(0:148; 0:592) 13:068 0:860 4 SUMA1 0:057(2:221) 0:937(0:027; 0:036) 0:013(0:303; 0:684) 9:031 0:071 SUMA2 0:013(1:910) 0:955(0:025; 0:020) 0:153(0:248; 0:599) 10:525 0:486 SUMA3 0:020(1:279) 0:967(0:020; 0:013) 0:430(0:171; 0:399) 5:913 1:119 HSUMA10:059(1:269) 0:954(0:022; 0:024) 0:251(0:230; 0:519) 5:363 0:697 HSUMA20:078(1:659) 0:954(0:023; 0:023) 0:290(0:175; 0:535) 7:879 0:829 HSUMA30:088(1:451) 0:959(0:018; 0:023) 0:315(0:175; 0:510) 6:662 0:782 HSUMA40:073(2:696) 0:962(0:020; 0:018) 0:262(0:147; 0:591) 13:131 0:860 y Overall coverage with percentage of missed coverage for upper and lower condence limits shown in parenthesis respectively. z Median of the length of the condence interval from all simulations. The results of the simulation is summarized in Table 2.1. From the table we see that there is no noticeable bias when ordinary maximum likelihood estimator (MLE) is used, but the standard error of the estimator is much larger than the one estimated from sher's information as can be inferred by the length of the uncorrected condence interval (CI). The corrected CI has good coverage, approximately equal to 0.95 as desired in all situations, and is balanced on both sides. 33 In contrary, the uncorrected CI is too narrow, without consideration of the additional uncertainty introduced by the shared error, yielding poor coverage ranging from 0.001 to 0.433 across all situations. It performs extremely bad when the underlying dosimetry model is SUMA1 and SUMA2, where shared error is relatively large, as expected from previous analysis. 2.4.2 Simple ERR Model We also evaluate simple ERR model Y Poisson(E(Y )); E(Y ) = exp(a)(1 +bX) We generate 1000 simulation dataset, and each includes 5000 subjects (i) True exposure X are generated as one realization from each of the given dosimetry model, with E(XjZ) =Z, where Z 4 2 4 , which has E(Z) = 16. (ii) 1000 realizations from the same dosimetry model are generated, and used to estimate the matrix Cov(XjZ), E(XjZ). (iii) Y is generated from the above ERR model, with exp(a) = 0:05. (iv) b ranges inf0:01; 0:05; 0:1; 0:5; 1g. The result of the simulation with 5000 subjects is summarized in Table 2.2. The variance correction signicantly improved the coverage of the 95% CI in al situations compared to the traditional information based Wald's CI. The estimator can be seen to be very close to the true value on average (over all 1000 simulation studies). We see some imbalance in the corrected CI as there is larger chance that the true b is bigger than the upper limit compared with thatb is smaller than the lower limit. However notice this is mainly due to underestimating of the lower limit of b using MLE to calculate the information matrix as can be seen from the uncorrected CI. Even with dose exactly known, the condence interval is not balanced (Table 34 A.1). This imbalance might be eliminated or lessened by using Score-based information matrix. Overall, variance correction oers a good adjustment to the CI and gives us a condence interval that performs well in simple ERR model. Table 2.2 Bias, Coverage, Length of CI for Simple ERR Model with Dierent Dosimetry Models with 5000 Subjects Bias(SD) Coverage y Length of CI z b Dosimetry * Corrected Uncorrected Corrected Uncorrected 0.01 SUMA1 0:000(0:012) 0:935(0:016; 0:049) 0:827(0:033; 0:140) 0:090 0:027 SUMA2 0:001(0:013) 0:950(0:019; 0:031) 0:822(0:046; 0:132) 0:093 0:028 SUMA3 0:001(0:009) 0:947(0:008; 0:045) 0:890(0:019; 0:091) 0:046 0:028 HSUMA1 *1 0:001(0:011) 0:934(0:016; 0:050) 0:871(0:033; 0:096) 0:051 0:028 HSUMA2 0:001(0:012) 0:926(0:022; 0:052) 0:849(0:035; 0:116) 0:066 0:028 HSUMA3 0:000(0:009) 0:951(0:005; 0:044) 0:879(0:015; 0:106) 0:051 0:028 HSUMA4 0:000(0:010) 0:932(0:013; 0:055) 0:843(0:027; 0:130) 0:067 0:028 0.05 SUMA1 0:001(0:051) 0:938(0:017; 0:045) 0:486(0:108; 0:406) 0:264 0:047 SUMA2 0:001(0:056) 0:946(0:020; 0:034) 0:508(0:103; 0:389) 0:280 0:048 SUMA3 0:001(0:031) 0:953(0:018; 0:029) 0:707(0:058; 0:235) 0:133 0:054 HSUMA1 0:002(0:035) 0:946(0:019; 0:035) 0:657(0:079; 0:264) 0:150 0:052 HSUMA2 0:002(0:047) 0:955(0:020; 0:025) 0:565(0:089; 0:346) 0:199 0:050 HSUMA3 0:000(0:033) 0:954(0:015; 0:031) 0:642(0:075; 0:283) 0:147 0:052 HSUMA4 0:001(0:040) 0:953(0:017; 0:030) 0:551(0:093; 0:356) 0:202 0:049 0.2 SUMA1 0:003(0:243) 0:961(0:007; 0:032) 0:427(0:083; 0:490) 1:007 0:129 SUMA2 0:011(0:271) 0:972(0:003; 0:025) 0:446(0:074; 0:480) 1:037 0:132 SUMA3 0:006(0:128) 0:970(0:003; 0:027) 0:653(0:034; 0:313) 0:493 0:163 HSUMA1 0:012(0:166) 0:969(0:003; 0:028) 0:611(0:043; 0:346) 0:569 0:163 HSUMA2 0:019(0:330) 0:973(0:001; 0:026) 0:527(0:060; 0:413) 0:764 0:148 HSUMA3 0:009(0:196) 0:972(0:002; 0:026) 0:614(0:038; 0:348) 0:543 0:155 HSUMA4 0:005(0:214) 0:977(0:001; 0:022) 0:498(0:056; 0:446) 0:799 0:135 1 SUMA1 *5 0:056(1:197) 0:967(0:000; 0:033) 0:547(0:001; 0:452) 5:297 0:824 SUMA2 *8 0:061(1:227) 0:973(0:000; 0:027) 0:539(0:002; 0:459) 5:352 0:769 SUMA3 *6 0:183(1:354) 0:968(0:000; 0:032) 0:686(0:000; 0:314) 2:730 0:977 HSUMA1 *2 0:164(1:166) 0:969(0:000; 0:031) 0:673(0:000; 0:327) 3:000 0:998 HSUMA2 *11 0:143(2:035) 0:978(0:000; 0:022) 0:551(0:001; 0:448) 4:184 0:793 HSUMA3 *3 0:155(1:387) 0:968(0:000; 0:032) 0:636(0:000; 0:364) 3:014 0:922 HSUMA4 *19 0:055(1:495) 0:980(0:000; 0:020) 0:524(0:000; 0:476) 4:884 0:710 * The number in the superscript after * indicates how many simulations cannot get a MLE due to diver- gence. y Overall coverage with percentage of missed coverage for upper and lower condence limits shown in parenthesis respectively. z Median of the length of the condence interval from all simulations. 35 2.4.3 Multiple Covariates with Measurement Error We also evaluate the extension where two error-prone covariates are included in the analysis, using ERR model. SUMA2, HSUMA3, HSUMA4 are used for X 1 , and SUMA2 is used for X2. The ERR model is specied as below YjX 1 ;X 2 Poisson(E(YjX 1 ;X 2 )); E(YjX 1 ;X 2 ) = exp(a)(1 +b 1 X 1 +b 2 X 2 ) Details of generating variables X 1 ;X 2 ;Y are given below, (i) True exposure X 1 ;X 2 are generated as one realization from each of their respective given dosimetry models, with E(X i jZ i ) =Z i ;i = 1; 2. We introduce correlation betweenZ 1 ;Z 2 by making Z 2 4 , Z 1 Z + 3 2 4 ;Z 2 Z + 3 2 4 ;Z 1 ? Z 2 jZ. Note that the covariates Z 1 ;Z 2 are correlated but not their errors. (ii) 5000 subjects are included. 1000 realizations from the same dosimetry model are generated, and used to estimate the matrix Cov(XjZ), E(XjZ). (iii) Y is generated from the above ERR model, with exp(a) = 0:05, (b 1 ;b 2 ) chosen from f(0:05; 0:1); (0:02; 0:2); (0:01; 1)g. The result is summarized in Table 2.3, compared with the result when X 1 ;X 2 are known in Table A.2. The variance correction works very well for (b 1 ;b 2 ) are (0:05; 0:1); (0:02; 0:2), with coverage very close to 95%. However when these two parameter dier in a great amount when (b 1 ;b 2 ) = (0:01; 1) (100x), the coverage of b 1 does not improve much from using information based CI (0.849{0.870 compared with 0.822{0.850) while the CI of b 2 is well-behaved. A conjecture of this result is the correlation of Z 1 ;Z 2 which will in uence the estimate of the eect of measurement error due to Z 1 . Since the two covariates have about the same variance, the large eect ofb 2 indicates a greater importance. The excellent performance of the method for b 2 , and both b 1 ;b 2 when they don't 36 dier dramatically suggests it is suitable for analysis with multiple exposure with measurement errors. Compared with the situation when the dose is perfectly measured (Table A.2), we see that the corrected CI performs even better. The reason for this could be the lognormal distribution in the mixture distribution compensate for the right-skewedness of the estimator when the sample size is not adequate for achieving approximate asymptotic property. Table 2.3 Bias, Coverage, Length of CI for ERR Model with Two Error-Prone Covariates Bias for (b1;b2) Coverage for b1 Coverage for b2 (b1;b2) Dosimetry Corrected Uncorrected Corrected Uncorrected (0.05, 0.1) SUMA2 *11 (0:008; 0:022) 0:953 0:697 0:960 0:540 HSUMA3 *6 (0:008; 0:018) 0:962 0:801 0:971 0:537 HSUMA4 *5 (0:005; 0:020) 0:958 0:733 0:973 0:519 (0.02, 0.2) SUMA2 *31 (0:014; 0:094) 0:939 0:837 0:965 0:313 HSUMA3 *27 (0:006; 0:022) 0:937 0:876 0:970 0:287 HSUMA4 *26 (0:008; 0:042) 0:929 0:856 0:975 0:293 (0.01, 1) SUMA2 *210 (0:005;0:089) 0:849 0:822 0:937 0:273 HSUMA3 *201 (0:009;0:151) 0:874 0:857 0:934 0:269 HSUMA4 *202 (0:017;0:016) 0:870 0:850 0:944 0:268 * The number in the superscript after * indicates how many simulations cannot get a MLE due to diver- gence. 2.4.4 ERR Model with Dose Eect Modier We evaluate a minimal ERR model with dose eect modier as follows Y i Poisson(E(Y i )); E(Y i ) = exp(a +a 1 A i )(1 +bX i exp(a 2 C i )) We generate 1000 simulation dataset, and each includes 5000 subjects. The detailed specication is that (i) True exposure X are generated as one realization from each of the given dosimetry model, with E(XjZ) = Z, where Z 4 2 4 , which has E(Z) = 16. A i N(0; 4);C i N(0; 4) are chosen to makeA i ;Z i ;C i comparable, which can be achieved by standardization in practice. 37 (ii) 1000 realizations from the same dosimetry model are generated, and used to estimate the matrix Cov(XjZ), E(XjZ). (iii) Y is generated from the above ERR model, with exp(a) = 0:05,b = 0:05,a 1 2f0:01; 0:1; 0:5g, a 2 2f0:01; 0:1; 0:5g. Only SUMA2 model will be used. Table 2.4 Bias, Coverage, Length of CI for ERR Model with Dose Eect Modier a1 a2 Bias(b) Coverage for b1 a1 a2 Corrected Uncorrected Bias Coverage Bias Coverage 0.01 0.01 *4 0:002 0:927 0:478 0:000 0:949 0:007 0:955 0.1 *6 0:001 0:941 0:493 0:000 0:945 0:002 0:945 0.5 *155 0:012 0:962 0:283 0:000 0:819 0:002 0:670 0.1 0.01 *5 0:001 0:935 0:455 0:000 0:962 0:007 0:949 0.1 *3 0:002 0:934 0:465 0:000 0:948 0:007 0:947 0.5 *199 0:013 0:951 0:251 0:000 0:815 0:003 0:627 0.5 0.01 *558 0:037 0:326 0:043 0:028 0:611 0:062 0:681 0.1 *522 0:039 0:310 0:050 0:036 0:584 0:034 0:519 0.5 *748 0:010 0:579 0:071 0:056 0:171 0:174 0:115 * The number in the superscript after * indicates how many simulations cannot get a MLE due to diver- gence. The result is summarized in Table 2.4. We see that the variance correction method works well in situations as in the rst 6 rows of the table. The failure occurs when either a 1 is big. This is actually more a problem of failing to achieve the asymptotic property of MLE due to non-linearity of ERR model, as can be seen from the coverage of the CI's of a 1 ;a 2 when a 1 is big. Our method relies on obtaining estimator with good distributional property. When MLE fails, it is expected that no easy x can be found. Analyzing the data assuming the dose is known, we can see the coverage of Wald's condence interval performs poorly whena 1 is big, which indicates the failure of asypmtotic property (Table A.3). 38 Chapter 3 Simulations with the Mayak Worker Cohort We further evaluate the performance of our method through simulations using dosimetry data from Mayak Worker Cohort. Also, collaborating with our US and Russian colleagues, we perform radiation risk analysis with lung cancer survival data of Mayak Worker Cohort. 3.1 Background on the Mayak Worker Cohort Located in the Southern Urals of Russia, southeast to Ozyorsk and 72 km northwest of Chelyabinsk (Figure 3.1), the Mayak Production Association (Mayak PA) was established in 1948 as nuclear facilities for plutonium production in the former Soviet Union (Koshurnikova et al., 1999). It consists of a nuclear reactor, a radiochemical plant, a plutonium production plant and several auxiliary plants. 3.1.1 Ionizing Radiations Plutonium has atomic number 94 with many stable isotopes such as 238 Pu, 239 Pu, 240 Pu, 241 Pu, 242 Pu, 244 Pu). Amongh them 239 Pu and 241 Pu are ssile, having applications in nuclear weapons and nuclear reactors with half-life of 24,100 and 14 years respectively (Lide, 2005). 239 Pu is the most used fuel in the ssion portion of nuclear weapons. It needs to be produced by bombarding 39 Figure 3.1 Satellite Map of Mayak Production Association Uranium-238 ( 238 U) with slow neutron changing it to 239 U. 239 U then undergoes two 1 decay and turn into 239 Pu as characterized by 238 U +n! 239 U! 239 Np +e + v e ; 239 Np! 239 Pu +e + v e ; where e denotes electron and v e denotes antineutrino. These high energy electron are called particles or rays. They can penetrate human body and damage DNA structure, causing mutations. ray can usually be shielded by foil. 239 Pu becomes harmless 235 U through decay, 239 Pu! 235 U +; 40 where denotes particle, identical to a helium nucleus ( 4 He 2+ ). Since particle is heavy and positively charged, it can only penetrate a few centimeters in air. This makes 239 Pu relatively safe as an external radiation source. However, when inhaled or ingested, it can cause cellular damage and be carcinogenic. During nuclear ssion in the plant and spontaneous ssion in nuclear fuel, ray will also be emitted. Dierent from particle, has a remarkable degree of penetrability and thus it requires shielding with large amounts of mass. ray is similar to X-ray, which is usually used in medical diagnostics, with dierence in whether the electron comes from the nucleus. particles, particles, rays, and X-rays, are all ionizing radiation, posing potential health risk to human body. 3.1.2 Characteristics of the Cohort Due to inadequate knowledge of the radioactive materials and their toxicity to human body, the workers in Mayak PA experienced a chronic exposure to radiation through the following decades, without sucient protection, especially during the rst decade of operation of the facility. The average cumulative career dose level to ionizing radiation is similar to those received by the atomic bomb survivor cohort of Life Span Study (LSS) in Japan and substantially larger than those received by other nuclear sites like Hanford site (Vasilenko et al., 2007). The cohort was rst identied in 1999 to include 18,831 workers (C1) from 1948 to 1972, where 14,072 (75%) are male and 4,759 (25%) are female. It was extended in the Mayak Worker Dosime- try System 2008 (MWDS 2008) to include workers who started employment from 1973 to 1982, reaching a total number of 22,349 workers (C2) with 25.5% of them being female (Khokhryakov et al., 2013). The entire cohort was exposed to external radiation and workers from radiochemical plant and plutonium production plant might also inhale plutonium aerosols, which leads to internal radiation exposure. 41 1) External Radiation: Workers were monitored for external radiation since 1948. Initially, both ionization chamber dosimeters and lm dosimeters were used until 1953 when only lm dosimeters were continued to be used because of lowered amount of radiation. From 1961, upgraded individual lm dosimeter (UIFD) was deployed, which includes four windows allowing for detection of the amount of-particle radiation, radiation and thermal neutron radiation. Starting from 1992, thermoluminescent dosimeters (TLDs) were used. Since Mayak facilities have a combination of , and neutron radiation, dierent scenarios are built to characterize the radiation environment for the cohort by Mayak sta with expert knowledge. Conversion factors were developed to be applied to the dosimeter readings for each working scenario to adjust for angular response, orientation of the worker, geometry and energy of the radiation led. In C1, about 83% of the workers were monitored. The measurement error (random error) associated with processing a single dosimeter is around 30%. Further detail can be found in Vasilenko et al. (2007). 2) Internal Radiation: Dose from inhaled plutonium can be calculated from urine bioassay measurement. Aerosols of plutonium were classied by investigators at the Southern Urals Biophysics Institute (SUBI) into three categories with dierent levels of transportability, which strongly in uences the distribution of inhaled plutonium between lung and systemic organs. During 1962 to 2000, autopsies of 530 workers in the cohort provided tissue samples, and distribution of plutonium between lungs and lymph nodes were ascertained. Smoking histories and working record at dierent sites were also considered to determine the distribution. Workers were classied into 5 categories depending on whether bioassay and autopsy are available and the number of transportability they have. Then the internal radiation dose is calculated through a model consisting of respiratory tract portion and systemic portion. Overall, 32% of cohort members have bioassay measurements and 4.5% have autopsy records. The parameters of absorption and formation of xed plutonium deposits in the respiratory 42 tract can be estimated from autopsy data. Data on transportability and smoking are also used. Further detail can be found in Vasilenko et al. (2007) and Khokhryakov et al. (2013). 97% of the deceased cohort members have known cause of death, with information collected from life event registration oce, pathology department, and medical institutions. The cause is coded based on ICD-9 with coding preference given to autopsy whenever it is available. Dose estimation for the cohort has been improved steadily and the Mayak Worker Dosime- try System 2013 (MWDS 2013) will include 25,757 workers from 1948-1982. There will be two important renements and enhancements to MWDS 2008 which is currently being used. 1) Improvement in models for plutonium exposure and improved dose characterization for each worker. 2) Estimation of the uncertainty associated with dose estimates are represented by multiple real- izations. 3.1.3 Importance and Relevance Compared to other cohort, the relatively high and protracted doses make the Mayak Worker Cohort extremely valuable for evaluating scientic hypothesis concerning dose rate and risk. The quality of death ascertainment together with the availability of both internal and external radiation exposures oers a unique opportunity to evaluate site-specic cancer mortality risks in both male and female under dierent exposure conditions, that relates more closely to typical nuclear workers. It may also provide better characterization of occupational radiation risk both internally and externally. 43 3.2 Data and Method 3.2.1 Mayak Worker Dosimetry System 2013 (MWDS-2013) The main dierence of MWDS-2013 from MWDS-2008 is that it takes into account of both uncertainty in measurement data and uncertainty in the values of model parameters, where point estimates of dose was used previously (Birchall and Puncher, 2016). Two models are used in the MWDS; the rst one is a biokinetic model where the intake of radiosity is estimated, and the second one is a dosimetric model, where the intake is transformed to organ dose. Previously model parameters were chosen to be the best estimates, while the new system uses prior distribution of both intake and calculate the posterior distribution of intake using samples of model parameters from their own prior distribution. The process can be summarized as follows (1) Sample model parameters from their prior distributions. The shared parameters are kept the same for all individuals while unshared parameters are sampled independently. (2) Obtain the posterior distribution of intake for each individual, and calculate the posterior distribution of dose. In the above process, each set of sampled model parameters contribute to a posterior distribution of dose, which is called hyper-realizations by the authors. Together with these hyper-realizations, the posterior distribution of these sampled model parameters can be calculated. Assume these parameters are referred to as, and the measurement data isD, then from Bayes' rule, it is simply P(jD)/ P(Dj)P(): Three spot estimates of dose are provided. These are 44 Spot 1 estimates Model parameters are chosen to be their representative values, which are either mean or median of their distribution. Maximum likelihood estimate of the intake is used with the dosimetric model to obtain the estimates. Spot 2 estimates Using the above mentioned prior distribution of both intake and model pa- rameters, the posterior distribution are obtained. The estimates correspond to the mean of the posterior distribution. Equivalently, this means a weighted mean of all the means from the hyper-realizations. Spot 1.5 estimates Only prior distribution of intake is used. Model parameters are chosen to be the means of their prior distribution. Spot 3 estimates This is a full Bayesian approach, where all the parameters and intake and mea- surements are evaluated simultaneously using Bayesian framework. Numerically this approach can be approximated using Spot 2 estimates. To reduce hyper-realizations to realizations, sampling from hyper-realizations needs to be done. Even though each hyper-realization has dierent weight, sampling based on weight has serious convergence problem since the weights are often order of magnitude o from each other resulting one dominating all others. In WMDS-2013, each hyper-realization is sampled once, which gives 1000 realizations. Also, a weighted and an unweighted mean dose of all workers are calculated using these hyper-realizations. We will use these 1000 realizations in belief that they represent well the posterior distribution of dose of Mayak Worker Cohort. A realization of dose includes a set of absorbed doses for each organ in each year. Our tasks here are (1)simulations designed to mimic the analysis of lung cancer risk in this Mayak Worker Cohort, and (2)survival analysis of lung cancer risk with survival data from Mayak Worker Cohort. Here (1) is intended to show the performance of the proposed method while (2) give us risk estimate for both internal and external radiation exposure. 45 3.2.2 Survival Analysis and Poisson Regression Since exposure of Mayak Worker Cohort is provided annually, we choose piecewise exponential survival function for analysis. Specically, relative risk (proportional hazard) is modeled annually. In survival analysis (Kalb eisch and Prentice, 2011), let T be the time of failure which is a pre-dened event, we dene the survival function S(t;X) = P(T >tjX) = 1F (t;X); where F (t;X) is the distribution function of T given X. The hazard function is dened as h(t;X) = lim h!0 + P(tTt +hjTt;X)=h =f(t;X)=S(t;X); where f(t;X) is the density function of T given X. Note the survival function and the hazard function satisfy S(T >tjX) = exp Z t 0 h(t;X) : Assume the following hazard model h i;j (X) = exp( 0 + 1 C 1;i + 2 C 2;j )(1 +X i;j exp( 3 C 2 ));i2 [K];j2 [N; where is the parameter describing the eect of dose X, i indexes the interval or the year the subject is in, j indexes the subject, C 1;i is the age of subject j at interval i,C 2;j is the sex of subject j. Let t i;j be the time the subject j experienced in interval i and d i;j indicates whether subject j experienced failure in interval i, we have the likelihood function for subject j L() = N Y j=1 K Y i=1 h i;j (X j ) di;j exp(h i;j (X j )t i;j ) 46 Assume now we are tting another model where d i;j Poisson(h i;j (X j )t i;j ) and assume all the d i;j are independent. Then the likelihood is L 0 () = N Y j=1 K Y i=1 (h i;j (X j )t i;j ) di;j exp(h i;j (X j )t i;j ) It is easily seen that these two likelihood are only dierent by a constant Q i;j t di;j i;j , not in- volving; 1 ; 2 ;3, and the asymptotic properties of likelihood estimator are equivalent. In the simulation, since we are analyzing yearly data, t i;j = 1;8i;j. More details can be found in Laird and Olivier (1981). Essentially, the equivalence of these two likelihood does not involve the exact form of h i;j , and we can use a more complicated model for the hazard function. 3.2.3 Poisson Regression Model in Mayak Worker Cohort From the above Section 3.2.2, we can do survival analysis using linear ERR model with Mayak Worker Cohort, given that the dose realizations are provided for each calendar year in MWDS- 2013. In our simulations we consider the following piecewise ERR hazard function, h i;j = exp (a 0 +a 1 C 1;i;j +a 2 C 2;i;j +a 3 C 3;i;j ) (1 +b 1 X 1;i;j exp (a 4 A 1;i;j +a 5 A 2;i;j ) +b 2 X 2;i;j ): (3.1) Here i indexes individual, j indexes time period, C 1 ;C 2 ;C 3 are covariates associated with baseline risk, X 1 ;X 2 are dose variables, and A 1 ;A 2 are eect modiers on dose eect of X 1 . For individual i, let the disease status be d i;j and time periods t i;j with j ranging from 1 to J i . We have the poisson model d i;j i:i:d: Poisson(t i;j h i;j );j2 [J i ]: 47 Following the derivation from Section 2.2.2, we get S Z1 =Q (DE Z1 (D)) I =Qdiag (E Z1 (D))Q T whereD is the vector of tabulated response variable, and matrixQ = [Q 1 Q n ] is the derivative of logE Z1 (D) with respect to the parameters. For the model of interest here, Q i = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 1 C 1;i C 2;i C 3;i Z1;i exp(a4A1;i+a5A2;i) 1+b1Z1;i exp(a4A1;i+a5A2;i)+b2X2;i b1Z1;iA1;i exp(a4A1;i+a5A2;i) 1+b1Z1;i exp(a4A1;i+a5A2;i)+b2X2;i b1Z1;iA2;i exp(a4A1;i+a5A2;i) 1+b1Z1;i exp(a4A1;i+a5A2;i)+b2X2;i X2;i 1+b1Z1;i exp(a4A1;i+a5A2;i)+b2X2;i 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 : The variance of D conditioned on Z 1 is Var(DjZ 1 ) = E X1jZ1 (Var(DjZ 1 ;X 1 )) + Var X1jZ1 (E(DjZ 1 ;X 1 )) = diag(E Z1 (D)) +b 2 1 GCov(X 1 jZ 1 )G where G = 2 6 6 6 6 6 6 4 t 1 exp a + P J j=1 a 1;j C 1;j . . . t n exp a + P J j=1 a 1;j C n;j 3 7 7 7 7 7 7 5 : 48 Thus the variance of S Z1 is Var(S Z1 ) =QVar(DjZ 1 )Q T =I Z1 +b 2 1 QGCov(X 1 jZ 1 )GQ T : Let M =QG, the corrected variance estimate is Var( ^ ) =I 1 Z1 +b 2 1 I 1 Z1 MCov(X 1 jZ 1 )M T I 1 Z1 Note here I Z1 is conventional information matrix without considering measurement error, and both I Z1 and M depend on b 1 . Another observation is when there is no eect of X 1 , i.e. b 1 = 0, the variance of ^ is still I 1 Z1 . We use normal-lognormal approximation to make inference on estimated ^ b, by letting V 1 = I 1 Z1 b1;b1 and V 2 = I 1 Z1 MCov(X 1 jZ 1 )M T I 1 Z1 b1;b1 . For parameter associated with covariates not subject to measurement error, we denote their estimates as ^ a, and we use Wald condence interval with variance corrected to I 1 Z1 a;a + ^ b 2 1 I 1 Z1 MCov(X 1 jZ 1 )M T I 1 Z1 a;a 3.2.4 Ecient Calculation of M T Cov(XjZ)M In a large cohort with dosimetry system such as MWDS-2013, the calculation of Cov(X 1 jZ 1 ) will be very hard to perform. In the Mayak Worker Cohort, there are as many as 25,575 individuals with up to 60 years of follow-up at this time. The full-cohort analyses described below are based on person-year tables with roughly 350,000 individual person-year-dose contributions. In this case Cov(X 1 jZ 1 ) has more than 60 billion (109) distinct values and direct computation of or even storing this matrix in computer memory can be challenging. 49 Assume X 1 is of dimension nk, and M is of dimension pn , the pp matrix of primary interestMCov(X 1 jZ 1 )M T which can be computed using the more manageable intermediatenk matrix W =MX 1 , by observing that d Cov(W ) = 1 n 1 W T W 1 n W T 11 T W = 1 n 1 MX 1 X T 1 M T 1 n MX 1 11 T X T 1 M T =M 1 n 1 X 1 X T 1 1 n X 1 11 T X 1 M =M d Cov(X 1 )M The time complexity of calculating this product is reduced to (pnk +p 2 k) from (n 2 k +p 2 n 2 ). This is a signicant improvement when n is large, which is the case in our simulation. 3.2.5 Dosimetry System Error Components Diagnostics Stram and Kopecky (2003) also provided a way to infer the magnitudes of the dierent variance components in a SUMA dosimetry system from the dose replications. They observed that for dose X i , X j of dierent individuals (or individual person years) i;j, Cov(X i ;X j ) =Z i Z j 2 SM + 2 SA and for the same individual i Var(X i ) =Z 2 i ( 2 SM + 1)( 2 M + 1) 1 + 2 SA +jsigma 2 A With linear regression of sample variance and covariance on Z 2 i ;Z i Z j , estimates for all four vari- ance components can be obtained, which is very useful for exploratory analysis of given dosimetry system. However, it has huge computational burden because of the need for matrix Cov(XjZ). 50 Practically, we can randomly sample from individual person-years and perform regression on the covariance matrix of this subset. Here we propose an alternative approach for variance compo- nents diagnostic, with much less computation burden. Consider the k-th dose replication, for person-year i, we have X k;i =Z i SM;k M;k;i + SA;k + A;k;i If we think of Z i as imperfect measure of Z i M;k;i and A;k;i as random error, linear regression with X k on Z i will give us estimates for slope term SM;k and intercept term SA;k . We can run the regression for each dose replication, and the estimates for all SM;k and SA;k can give us an empirical distribution for shared error components. To estimate 2 M and 2 A , let ^ X k;i = Z i ^ SM;k + ^ SA;k , we have Since ^ SM;k SM;k , and ^ SA;k SA;k , (X k;i ^ X k;i ) 2 (Z i ^ SM;k ) 2 ( M;k;i 1) 2 + 2 A;k;i + 2Z i ^ SM;k ( M;k;i 1) A;k;i = (Z i ^ SM;k ) 2 2 M + 2 A +e k;i where e k;i = (Z i ^ SM;k ) 2 ( M;k;i 1) 2 2 M + ( 2 A;k;i 2 A ) + 2Z i ^ SM;k ( M;k;i 1) A;k;i which has expectation zero. This is obvious for the rst two terms, and it follows for the last term from the independence of M;k;i and A;k;i , and since in this regression the ^ 2 SM;k are constants. Thus regression of (X k;i ^ X k;i ) 2 on (Z i ^ SM;k ) 2 can provide estimates for 2 M = E(( M;k;i 1) 2 ) and 2 A = E( 2 A;k;i ), and we prefer to use this instead of random samples from Cov(X 1 jZ 1 ) because it uses all the data and runs faster. 51 3.2.6 Simulation We use simulation studies to evaluate the performance of the proposed method. For these studies, baseline rates and radiation eects for the outcome of interest were described using the ERR model in Equation 3.1 with X 1 representing internal dose in Gy, X 2 representing external dose in Gy. Covariates aecting baseline rates (that is rates in the absence of exposure) and those assumed to modify the internal dose eect are shown in the following table: Baseline rates Internal dose eect modication Covariate C1 C2 C3 A1 A2 Description Log(age/60) (Log(age/60)) 2 Sex Log(age/60) Sex This model is a simplication of models used in analyses of lung cancer risks analysis in Section 3.2.7. Since internal dose estimates are considerably more uncertain than external dose estimates, we focused on the situation in which only X 1 is measured with error. 3.2.6.1 Dose simulation For the simulation studies, we consider ve dosimetry systems. These include four SUMA dosime- tries with multiplicative but no shared or unshared additive errors. For each of the SUMA dosime- try systems, we used the average dose from MWDS-2013 across all dose replications as the mean doseZ in each of the SUMA models. The SUMA dosimetry systems used in the simulations have error components as listed below, Dose error model True dose X 1;i;j 2 SM 2 M 2 PM SUMA-S SM Z 1;i;j 0.318 0 0 SUMA-U Mi;j Z 1;i;j 0 0.223 0 SUMA-SU SM Mi;j Z 1;i;j 0.318 0.223 0 SUMA-SUP SM PMi Mi;j Z 1;i;j 0.318 0.223 0.200 52 For each of the SUMA uncertainty models, we generated 1,000 dose realizations. As in Kwon et al. (2016), for each simulation experiment one of the realizations is used to generate the outcome data. The remaining 999 realizations are then used to obtain the mean doses used in the modeling and in the calculation of MCov(X 1 jZ 1 )M and the adjusted condence intervals. The fth dose uncertainty model was dened by the actual MWDS-2013 realizations. In this case the true doses for a given simulation was dened to be one of the 1,000 MWDS-2013 realizations and used to generate the outcome data. As with the SUMA models, the remaining 999 realizations are used to obtain the mean doses used in the modeling and in the calculation of MCov(X 1 jZ 1 )M and the adjusted condence intervals. For external dose, we simply use the cumulative average annual exposure in MWDS-2013 as the true dose. 3.2.6.2 Survival data simulation For person i, we generated an event time following a piecewise exponential distribution specied by an ERR model of the form given in Equation 3.1. This event time was then compared to the observed time at the end of follow-up for this MWC member. If the generated event time was less than the end of follow-up, the person was treated as a case at the generated failure time, otherwise the subject was right-censored at their end-of-follow-up. Since last follow-up time is pre-dened, this is non-informative censoring and will not introduce bias into our analysis. For the Poisson regression we tabulated follow-up time, event counts, and covariate values for each distinct person-year of follow-up (approximately 125,000 total in any given simulation). We considered three models dened in terms of the baseline rate and internal dose ERR for men at age 60 as described in the following table: 53 Model Baseline rate (100,000 exp(a 0 )) (cases per 100,000 person years) Internal dose ERR /Gy (b 1 ) Expected cases Expected dose-associated cases Null model 100,000 exp(5:0) = 673.8 0 800 0 Moderate model 100,000 exp(6:5) = 150.3 3.6 300 100 Strong model 100,000 exp(5:0) = 673:8 3.6 1; 200 400 With the other parameters in the model taken as Baseline rates Internal dose eect modiers External dose (ERR/Gy) Risk factor log(age/60) log 2 (age/60) female:male log(age/60) female:male Parameter value a 1 = 5:64 a 2 =6:39 a 3 = log(0:5) a 4 =3:15 a 5 = 1:33 b 2 = 0:21 We evaluate moderate and strong models for each of the ve dosimetry systems described above. The null model is a little dierent for reasons we discuss later, and we evaluate it separately only for MWDS-2013. 3.2.7 Lung cancer risk modeling Lung cancer mortality is analyzed using data on workers with plutonium monitoring data and on unmonitored reactor and auxiliary plant workers (who are presumed to have no exposure from plutonium) with follow-up through 2010. The primary analyses used the MWDS-2013 plutonium and external dose realization data to estimate the eects of internal and external doses on the ERR and to examine the eects of adjusting these risk estimates for shared uncertainty in the dose estimates as represented by the dose realizations. In order to examine the impact of the change in dosimetry we also carried out an analysis based on the MWDS-2008 dose estimates. The primary purpose of the current analyses is to illustrate the impact of dose uncertainty on the uncertainty in the external and internal dose risk estimates. The analyses were based on 6,725 workers who survived at least two years after their initial monitoring and 8,375 unmonitored 54 reactor and auxiliary plant workers who are assumed to have no exposure to plutonium. Follow- up for unmonitored workers began at the time of initial employment in an eligible plant while follow-up for monitored workers began two years after the rst monitoring date. Migrants who were alive at the end of 2003 were censored at that date while those who migrated in 2004 or later were censored at their date of migration. There were a total of 498 lung cancer deaths with 420,479 person years of follow-up in this cohort. Among the workers with plutonium doses there were 269 lung cancers during 115,741 person years of follow-up. Unmonitored reactor and auxiliary plant workers had 305,007 person years of follow-up with 229 lung cancer deaths. The number of person years for monitored radiochemical and Plutonium production plant workers is smaller than that for the unmonitored workers used in these analyses because follow-up for monitored workers began two-years after the initial monitoring date while it began at the time of initial employment in an eligible plant for the unmonitored workers used here. Lung cancer risks were modeled using an ever-never smoking-adjusted in which the baseline rates (i.e. risk in the absence of internal or external radiation exposure) varied with attained age and sex. Using our R package model specication described in Section 3.5, the model we use can be represented as 1 sex * smoking + smoking * log(age/60) + smoking * log(age/60)^2 + 2 ERR(dose.internal | sex + log(age/60)) + ERR(dose.external) 55 3.3 Result 3.3.1 simulation The simulation results for overall condence interval coverage are given in Table 3.1 for the SUMA models. As expected, when there is only unshared error (SUMA-U), there is almost no dierence between the coverage of the na ve and corrected condence intervals. In the presence of shared error, the coverage of the na ve condence intervals for the slope parameter (b 1 ) is poor, ranging from 0.436 to 0.677 for the dierent models. For each of these models the coverage of the corrected condence interval is much closer to the desired value of 0.95. However, the coverage, especially for SUMA-SUP, tends to be asymmetric, i.e., when the condence interval fails to include the true value it is more likely to be above the upper limit than below the lower limit. (We would like to see probabilities of 0.025 for both the lower and upper limits). The correction has virtually no eect on condence interval coverage for parameters other than b1. However, the coverage of both the corrected and na ve condence intervals tends to be unbalanced. Further simulations using true dose are given in Table A.1 With MWDS-2013 in Table 3.2, the coverage of corrected condence intervals for all parameters is around 0.95 while na ve condence intervals have poor coverage, especially for b 1 . However, as with the SUMA models, the corrected condence limits appear to be somewhat unbalanced, especially for b 1 , i.e., the dose eect for the doses measured with error. In general, the proposed approximation markedly improved the condence-interval coverage for the dose-response parameter,b 1 relative to that of the na ve condence interval ignoring mea- surement error. For the Mayak simulation overall coverage is excellent but there is an imbalance in upper and lower condence limits for some parameters. The lack of balance in the condence intervals may be because M and I Z1 depend on the parameters that are being estimated. For example, based on a typical simulation for Mayak, the 56 Table 3.1 Condence interval coverage for model parameters with SUMA-U, SUMA-S, SUMA-SU, SUMA- SUP dose uncertainty models Model Para- meter Nave Condence Interval Corrected Condence Interval Moderate Strong Moderate Strong SUMA-U a0 .960 (.015, .025) 1 .953 (.015, .032) .961 (.014, .025) .954 (.015, .031) a1 .955 (.027, .018) .949 (.035, .016) .956 (.027, .017) .952 (.033, .015) a2 .957 (.024, .019) .933 (.031, .036) .957 (.024, .019) .935 (.030, .035) a3 .955 (.014, .031) .951 (.009, .040) .955 (.014, .031) .953 (.009, .038) b1 .945 (.054, .001) .906 (.091, .003) .949 (.050, .001) .918 (.079, .003) a4 .956 (.020, .024) .945 (.036, .019) .956 (.020, .024) .952 (.033, .015) a5 .957 (.025, .018) .939 (.052, .009) .960 (.023, .017) .948 (.044, .008) b2 .938 (.058, .004) .963 (.021, .016) .938 (.058, .004) .963 (.021, .016) SUMA-S a0 .950 (.014, .036) .946 (.025, .029) .950 (.014, .036) .946 (.025, .029) a1 .955 (.036, .009) .952 (.026, .022) .955 (.036, .009) .952 (.026, .022) a2 .952 (.028, .020) .952 (.029, .019) .952 (.028, .020) .952 (.029, .019) a3 .951 (.022, .027) .950 (.020, .030) .951 (.022, .027) .950 (.020, .030) b1 .663 (.254, .083) .436 (.358, .206) .943 (.039, .018) .947 (.027, .026) a4 .945 (.018, .037) .941 (.027, .032) .945 (.018, .037) .941 (.027, .032) a5 .963 (.018, .019) .940 (.031, .029) .963 (.018, .019) .940 (.031, .029) b2 .940 (.053, .007) .954 (.028, .018) .940 (.053, .007) .954 (.028, .018) SUMA-SU a0 .957 (.022, .021) .946 (.017, .017) .957 (.022, .021) .947 (.017, .036) a1 .956 (.029, .015) .949 (.025, .026) .956 (.029, .015) .953 (.023, .024) a2 .942 (.026, .032) .955 (.024, .021) .943 (.026, .031) .956 (.024, .020) a3 .945 (.020, .035) .937 (.017, .046) .945 (.020, .035) .938 (.017, .045) b1 .677 (.251, .072) .448 (.401, .151) .941 (.040, .019) .957 (.025, .018) a4 .942 (.032, .026) .949 (.035, .016) .944 (.032, .024) .956 (.029, .015) a5 .948 (.031, .021) .947 (.040, .013) .951 (.029, .020) .958 (.033, .009) b2 .954 (.042, .004) .950 (.035, .015) .954 (.042, .004) .950 (.035, .015) SUMA-SUP a0 .950 (.019, .031) .943 (.016, .041) .950 (.019, .031) .944 (.016, .040) a1 .949 (.033, .018) .954 (.028, .018) .951 (.033, .016) .955 (.027, .018) a2 .949 (.031, .020) .957 (.018, .025) .950 (.030, .020) .959 (.018, .023) a3 .950 (.018, .032) .936 (.015, .049) .950 (.018, .032) .938 (.014, .048) b1 .703 (.230, .067) .467 (.383, .150) .956 (.034, .010) .951 (.037, .012) a4 .949 (.024, .027) .939 (.044, .017) .952 (.024, .024) .946 (.038, .016) a5 .952 (.028, .020) .933 (.057, .010) .957 (.025, .018) .947 (.044, .009) b2 .938 (.056, .006) .954 (.034, .012) .938 (.056, .006) .954 (.034, .012) 1 Fraction of times that interval includes the true value with, in parenthesis, fraction of times that the upper bound is below the true value and that the lower bound is greater than the true value. na ve standard error forb 1 i.e. q I 1 Z1 b1;b1 equaled 0.67 when evaluated at the true valueb 1 =3.6 and 1.81 when evaluated at b 1 =10 as shown in Figure 3.2. One of the notable dierences between the simulations using the actual Mayak doses and those using the simplied SUMA models is that, with MWDS-2013 the coverage of other parameters thanb 1 are also aected by dosimetry error; the na ve CI's for most of the other parameters, and 57 Table 3.2 Condence interval coverage for model parameters with MWDS-13 internal dose uncertainties Parameter Nave Condence Interval Corrected Condence Interval Moderate Strong Moderate Strong a0 .943 (.015, .042) 1 .906 (.028, .066) .956 (.011, .033) .946 (.017, .037) a1 .954 (.032, .014) .924 (.050, .026) .956 (.031, .013) .937 (.044, .019) a2 .966 (.014, .020) .961 (.020, .019) .967 (.013, .020) .965 (.017, .018) a3 .936 (.019, .045) .917 (.016, .067) .944 (.012, .044) .941 (.010, .049) b1 .830 (.151, .019) .624 (.300, .076) .933 (.061, .006) .936 (.060, .004) a4 .946 (.034, .020) .916 (.061, .023) .959 (.024, .017) .958 (.033, .009) a5 .945 (.040, .015) .914 (.072, .014) .957 (.033, .010) .941 (.050, .009) b2 .914 (.076, .010) .876 (.079, .045) .942 (.054, .004) .964 (.021, .015) 1 Fraction of times that interval includes the true value with, in parenthesis, fraction of times that the upper bound is below the true value and that the lower bound is greater than the true value. especially the second dose response parameter,b 2 , is also lower than 95%. (For the SUMA models the coverage of na ve estimates was excellent except for the dose response b 1 ). The corrected CI's for the Mayak simulation show coverage close to 95% for all the parameters. This dierence between SUMA and Mayak is an indication of two things, rst that the Mayak dosimetry system is inherently more complicated than any of the SUMA models, and second that our correction method still works well in this more complicated setting. 0 2 4 6 8 10 0.5 1 1.5 b 1 Na¨ ıve Standard Error of b 1 Figure 3.2 Na ve Standard Error of b 1 vs. b 1 58 3.3.2 The Null Model Under the null hypotheses that b 1 = 0 the dose modifying eects of age and sex are no longer identiable. To be able to make proper tests about the null hypothesisH 0 :b 1 = 0 with or without measurement error, we perform the score test with a reduced model, i.e. with a 4 =a 5 = 0. The performance of the score test is given in Table 3.3. We choose the score test over the Wald test because the latter requires convergence of the model tting procedure. Under the null a large fraction fail to converge because of boundary conditions enforced on b 1 (i.e. that estimated probabilities may not be negative in value). We note that there is very little eect of measurement error on the performance of the score test. For example, for the moderate model the coverage is unbalanced (with less than desired rejections on the negative end) however this is seen also in a model tted with the true dose. Table 3.3 Score Test for the Null Model Parameter Strong Model Moderate Model With error b1 .954 (.021, .025) .967 (.008, .025) Without error .958 (.013, .029) .970 (.006, .024) 1 Fraction of times that interval includes the true value with, in parenthesis, fraction of times that the upper bound is below the true value and that the lower bound is greater than the true value. 3.3.3 Bias in Slope Estimates Table 3.4 Estimates and Standard Deviations of b 1 Model Parameter Mean, SD and p-value of estimates of b1 (true value=3.6) SUMA-U SUMA-S SUMA-SU SUMA-SUP MWDS-2013 Moderate mean 3.65 3.77 3.62 3.43 3.61 SD 0.97 2.19 1.98 1.89 1.48 p 0.094 0.013 0.802 0.004 0.813 Strong mean 3.43 3.58 3.39 3.11 3.33 SD 0.5 1.77 1.66 1.4 0.99 p < 10 4 0.717 < 10 4 < 10 4 < 10 4 59 Table 3.4 gives the means and standard deviations for ^ b 1 for all the models used. For the strong model there is a statistically signicant bias towards the null in the slope estimates except when there are only shared errors (i.e. model SUMA-S). This bias is likely due to the dilution eect described by Prentice (1982) in his discussion of survival analysis in the presence of random covariate errors. The attenuation in the risk estimate occurs because highly exposed individuals are more likely to have an early event, implying that the distribution of true dose given estimated dose has a lower mean value in the later time periods than in the earlier time periods. Since the dosimetry system does not use information from the response this results in a biased estimate of the slope parameter ^ b 1 . In the strong model about 5 percent of all individuals ( 400=7873) die because of exposure whereas in the moderate model this is reduced to about 1.25 percent and the dilution bias diminishes accordingly. The dilution eect is not dependent upon the size of the shared errors, rather it is a re ection of random unshared errors; the model with the greatest dilution (SUMA-SUP) has the most random error and the model where there is only shared error (SUMA-S) does not show a dilution eect. The latter follows because for SUMA-S everyone at a given estimated dose level has the same (unknown) true dose so that the distribution of true dose given estimated dose is not dependent on time. In many studies exposure-associated cases arise in only a very small fraction of the total number of individuals considered. For example, even in the A-bomb study the excess number of radiation- associated solid tumor cases over 40 years of follow-up (1958-1998) was estimated to be only about 2 percent of the exposed members of the cohort (http://www.rerf.jp/radefx/late e/cancrisk.html). Our moderate model, which does not show signicant dilution eects, is designed to give an excess number of cases which is very close to the actual Mayak analysis (about 100 cases). 3.3.4 Variance Components Diagnostics with MWDS-2013 The estimates for all variance components in the SUMA model using MWDS-2013 dose realizations are given in Table 3.5. When using the Stram-Kopecky method, we randomly selected 10,000 60 −0.1 −0.05 0 0.05 0.1 0 5 10 15 20 ǫ SA (a) 0 0.5 1 1.5 2 0 0.5 1 ǫ SM (b) Figure 3.3 Empirical distributions of the shared error components in MWDS-2013 elements from Cov(X 1;i ;X 1;j ) and Var(X 1;i ) each to do the regression. Estimates that are below 0 are truncated to 0 with the original value shown in parenthesis. The empirical distributions for the shared error components in MWDS-2013 are given in Figure 3.3 , where the curves are density estimates using kernel smoothing. From Table 3.5, we see that the estimates from the two methods are very comparable. However, from Figure 3.3b, we see the shared multiplicative error component has a less skewed distribution than the log normal, suggesting that further work on modeling the distribution of SM , could be undertaken. Table 3.5 Variance components diagnostics for MWDS-2013 Method ^ 2 SM ^ 2 SA ^ 2 M ^ 2 A Stram-Kopecky 0.091 0 (-0.000) 0.324 0 (-0.016) Regression estimates 0.109 0.001 0.285 0 (-0.008) 3.3.5 Comparison of MWDS-2013 and MWDS-2008 cumulative doses Table 3.6 provides summary information on the distribution of total cumulative MWDS-2013 plu- tonium dose for the monitored workers by the plant with the highest plutonium exposure potential in which they worked prior to 1983 and sex. These values are based on the mean doses over the 61 1000 realizations of annual plutonium doses provided by the current system. Over all monitored workers the total cumulative doses range up to almost 9 Gy. However, the mean cumulative plutonium dose is about 200 mGy with an upper 90th percentile of 386 mGy. The highest doses are seen among working in the Main 1 departments of the plutonium production plant. In this group of workers, women with a mean dose of almost 1.3 Gy, tend to have considerably higher doses than men, whose mean dose is about 0.5 Gy. Table 3.7 provides summary information on the distribution of cumulative MWDS-2008 dose for plutonium workers used in these analyses who have MWDS-2008 (deterministic) dose estimates. Although the maximum doses tend to be a bit higher, in general the MWDS-2008 doses are considerably lower than the MWDS-2013 estimates. In particular, the overall mean dose for men and women combined is about 132 mGy with an upper 90th percentile of 165 Gy. The sex dierences in the dose distributions were less pronounced with the older dosimetry. The following tables provide information on the distribution of mean MWDS-2013 (Table 3.8) and MWDS-2008 (Table 3.9) external dose estimates. The MWDS-2013 Monte-Carlo dosimetry system was designed so that the mean doses would be similar to the MWDS-2008 doses. This can be seen in the similarity of the dose distributions in the two tables. The mean gamma doses are generally slightly higher for men than women except in the plutonium Main 1 workplaces. Overall the mean total gamma dose was about 400 mGy. 3.3.6 Lung cancer survival analysis Lung cancer rates for both men and women increase roughly in proportion to the sixth power of age with some attening of the rates at older ages. Rates for male smokers were about 11 times those for male nonsmokers (95% CI 6 to 18) while the rates for female smokers were about 5 times those for non-smoking women (95% CI 2.2 to 11). The joint eect of radiation and smoking was assumed to be multiplicative. Using MWDS-2013, the eect of external exposure appears to 62 Table 3.6 Number of monitored workers and total mean cumulative MWDS-2013 plutonium dose distribu- tion summaries by sex and workplace. Workplace with maximumplutonium exposure potential Sex Number of monitored workers Total MWDS-2013 cumulative plutonium dose (mGy) to the lung at end of follow-up Median Mean 90th %-tile Maximum Auxiliary plant Men 9 34 72 186 186 Women - - - - - Total 9 34 72 186 186 Reactor complex Men 235 26 59 93 3,873 Women 46 30 49 111 469 Total 281 27 58 94 3,873 Radiochemical Plant Men 2,491 41 122 303 3,447 Women 1,053 34 87 171 5,052 Total 3,544 39 112 256 5,052 Plutonium auxiliary Men 584 18 86 201 2,819 Women 500 23 193 549 7,788 Total 1,084 20 136 313 7,788 Plutonium Main 2 Men 660 36 163 386 3,031 Women 232 26 157 387 3,980 Total 892 33 161 387 3,980 Plutonium Main 1 Men 684 101 496 1,645 5,976 Women 231 195 1,265 4,698 8,977 Total 915 117 690 2,159 8,977 Total Men 4,664 39 175 370 5,976 Women 2,062 34 252 421 8,977 Total 6,726 37 199 386 8,977 be linear in dose with the estimated ERR per Gy of 0.11. There was no evidence of external- dose eect modication by sex, age, or time since exposure. This estimate is almost the same as the MWDS-2008-based estimate given by Gilbert et al. (2013). As in the MWDS-2008 analyses in , the plutonium dose eect varies with sex and attained age. The estimated ERR/Gy for plutonium lung dose for men at attained age 60 is 3.5 while that for women is 13. These estimates are about half of the MWDS-2008-based estimates given in . This change in the point estimates is largely due to the fact that the MWDS-2013 plutonium dose estimates are considerably higher than the earlier estimates. The ERR associated with plutonium risks was estimated to decrease in proportion to one over age to the power 1.8. This decrease is somewhat less rapid than that reported in (3) in which the decrease was proportional to one over age 3.1 (95% CI -5.4 to -0.8). 63 Table 3.7 Number of monitored workers and total cumulative MWDS-2008 plutonium dose distribution summaries by sex and workplace. Workplace with maximumplutonium exposure potential Sex Number of monitored workers Total MWDS-2008 cumulative plutonium dose (mGy) to the lung at end of follow-up Median Mean 90th %-tile Maximum Auxiliary plant Men 8 18 36 122 122 Women - - - - - Total 8 18 36 122 122 Reactor complex Men 231 15 26 52 420 Women 46 11 33 109 369 Total 277 15 27 52 420 Radiochemical Plant Men 2,287 20 60 131 9,212 Women 1,018 20 58 102 4,407 Total 3,305 20 59 120 9,212 Plutonium auxiliary Men 559 14 41 99 871 Women 492 15 106 219 6,162 Total 1,051 15 71 138 6,162 Plutonium Main 2 Men 616 24 90 199 4,577 Women 227 17 64 153 1,050 Total 843 22 83 176 4,577 Plutonium Main 1 Men 601 49 582 546 218,758 Women 205 69 636 1,842 9,415 Total 806 53 596 748 218,758 Total Men 4,302 21 133 165 218,758 Women 1,988 20 130 167 9,415 Total 6,290 21 132 165 218,758 Table 3.10 contrasts the na ve and adjusted condence bounds for the MWDS-2013 risk esti- mates. The rst comparison of interest is the na ve Wald bounds to the CIM-adjusted Wald bounds. For internal exposures the adjusted condence intervals are considerably wider for both men and women. For the male risk the adjusted 95% CI is almost twice the width of the Wald bound while for women the ratio of the widths of the adjusted to that of the na ve interval is about 1.5. Based on the simulations above we feel that the changes in the widths of the condence intervals largely re ect the impact of shared uncertainty in the MWDS-2013 dose estimates. The same comparison for the MWDS-2013 external dose estimate indicates that there is very little change in the width of the condence intervals. However, it is also useful to compare the na ve likelihood 64 Table 3.8 Number of monitored workers and total mean cumulative MWDS-2013 gamma dose distribution summaries by sex and workplace. Workplace with maximumplutonium exposure potential Sex Workers in analysis cohort Total MWDS-2013 cumulative gamma dose (mGy) to the lung at end of follow-up Median Mean 90th %-tile Maximum Auxiliary plant Men 2,707 21 94 259 2,855 Women 675 18 90 259 985 Total 3,382 21 93 259 2855 Reactor complex Men 4,119 253 462 1,168 7,836 Women 1,164 99 199 515 6,274 Total 5,283 204 404 1,058 7,836 Radiochemical Plant Men 2,491 451 856 2,202 5,191 Women 1,053 501 767 1,884 4,948 Total 3,544 463 829 2,124 5,191 Plutonium auxiliary Men 584 39 202 507 3,515 Women 500 26 147 438 3,070 Total 1,084 32 177 459 3,515 Plutonium Main 2 Men 660 94 178 371 3,356 Women 232 81 140 336 1,523 Total 892 90 168 356 3,356 Plutonium Main 1 Men 684 125 310 850 3,506 Women 231 223 426 1,164 2,525 Total 915 141 339 962 3,506 Total Men 11,245 156 421 1,261 7,836 Women 3,855 104 338 1,052 6,274 Total 15,100 141 400 1,200 7,836 and Wald bounds. Wald bounds are based on an assumption that the asymptotic distribution of the maximum likelihood estimate is normal. Likelihood-based bounds are based directly on the likelihood function and can have better coverage (i.e. the actual coverage probability is closer to the nominal level) than Wald bounds. 3.4 Discussion We evaluated the performance of the resulting condence intervals in a series of simulation exper- iments using several types of dosimetry models including the actual Mayak Workers Dosimetry System 2013 (MWDS-2013). Overall we saw marked improvement in the coverage of 95 percent 65 Table 3.9 Number of monitored workers and total mean cumulative MWDS-2008 gamma dose distribution summaries by sex and workplace. Workplace with maximumplutonium exposure potential Sex Workers in analysis cohort Total MWDS-2008 cumulative gamma dose (mGy) to the lung at end of follow-up Median Mean 90th %-tile Maximum Auxiliary plant Men 2,707 21 92 253 2,803 Women 675 18 88 256 950 Total 3,382 21 91 254 2,803 Reactor complex Men 4,119 246 450 1,127 7,595 Women 1,164 97 194 507 6,000 Total 5,283 199 394 1,021 7,595 Radiochemical Plant Men 2,492 433 828 2,134 5,179 Women 1,053 481 746 1,845 4,918 Total 3,545 443 804 2,053 5,179 Plutonium auxiliary Men 584 39 196 476 3,500 Women 500 26 143 422 3,014 Total 1,084 32 172 443 3,500 Plutonium Main 2 Men 660 93 173 359 3,188 Women 232 79 137 325 1,490 Total 892 89 163 344 3,188 Plutonium Main 1 Men 684 122 297 823 3,438 Women 231 212 415 1,149 2,498 Total 915 134 326 945 3,438 Total Men 11,246 152 409 1,220 7,595 Women 3,855 103 329 1,024 6,000 Total 15,101 138 389 1,161 7,595 Table 3.10 Lung cancer ERR estimates based on MWDS-2008 and MWDS-2013 dose estimates with corrected-information-matrix-adjusted Wald bounds for the MWDS-2013 estimates Dose estimates End of follow-up ERR/Gy Nave 95% CI CIM-adjusted Likelihood Wald 95% CI Internal dose (men) MWDS-2008 2008 7.4 (5.0 to 11) * NA MWDS-2013 2010 3.5 (2.4 to 4.8) (2.2 to 4.8) (1.8 to 6.8) Internal dose (women) MWDS-2008 2008 24 (11 to 56) * NA MWDS-2013 2010 13.5 (6.3 to 29) (2.7 to 24) (2.6 to 34) External dose (men and women) MWDS-2008 2008 0.13 ( -0.04 to 0.38) * NA MWDS-2013 2010 0.11 (-0.04 to 0.32) (-0.072 to 0.296) (-0.072 to 0.299) 66 condence intervals for the dose-response parameter in the ERR model when the dose variable used in the analysis is the mean of dose realizations from a Monte Carlo dosimetry system that includes shared (i.e. non-independent) components of error. For example, using the replica- tions from the MWDS-2013 dosimetry system to adjust results obtained from our \strong" model (about 1,200 cancer cases, 400 of which were due to exposure) the uncorrected condence intervals covered the true value of the dose response parameter just 62.4 percent of the time compared to the desired 95 percent coverage. After correction, this was improved to 93.6 percent using the variance formula and the normal+lognormal approximation. In experiments with modications of this system (the SUMA models) we saw even larger corrections. We note, however, that even though overall coverage was excellent after correction, the condence intervals for the risk param- eter of interest in the ERR model were unbalanced, that is they tended to be quite conservative on the low end, and were overly liberal on the high end. Even when the dosimetry system included only unshared error this phenomenon was observed, and may be due to the dependence of the information matrix on the parameter estimates an aspect which we ignored in our variance calcu- lation. In particular, we see in Figure 3.2 that with or without dosimetry error the standard error of the dose response parameter b 1 in an ERR model appears to increase with positive b 1 . Table A.4 and A.5 show that even when there are no dose errors, the standard Wald test condence intervals for the two dose response parameters, b 1 and b 2 , have unbalanced coverage similar in direction to what is seen in Table 3.1 and Table 3.2. We therefore attribute at least part of the imbalance in our corrected condence intervals to be due to the structure of the risk model, rather than the measurement errors. Most likely this is because of the dependence of the information matrix on the parameters being estimated. A full treatment of this issue is deferred for later work since we note that treating the information matrix (and also the correction matrix M) as a function of b 1 would add considerable computational burden to our approach. 67 3.5 Implementation of a R Package An R package is implemented for the proposed method, as well as for generating survival data based on ERR model, and made public available on GitHub (https://github.com/zhuozhang/ Rerr). We designed a syntax for representing ERR model with formulae provided in R. We dene two functions in formula specication, `ERR' and `U', where `ERR' species the excess relative risk part and `U' denes uncertain dose variable, i.e., dose that comes with a dosimetry system. An example of such formula is the one we use for simulations in Chapter 3. 1 model.gen <- cell.cases ~ covariate.age + I(covariate.age^2) + sex + 2 ERR(dose.internal.true|covariate.age + sex) + 3 ERR(dose.external) + offset(cell.pyr) 4 5 model.est <- cell.cases ~ covariate.age + I(covariate.age^2) + sex + 6 ERR(U(dose.internal) | covariate.age + sex) + 7 ERR(dose.external) + offset(cell.pyr) In the code above, the rst formula species the model used for generating the survival data, and the second one species the model used for estimating parameters and making inference. They both have three parts. Take the rst formula for example, the rst part with covariate.age + I( covariate.age^2)+ sex denes baseline risk, and the second part ERR(dose.internal.true|covariate.age + sex) denes the rst ERR component, with covariate.age and sex as eect modiers, and the last one denes the second ERR part without any eect modiers. offset denes the variable to be used for the time spent in each person-year record when tabulating survival data into Poisson data. We see that the dierence between these two formulae above is that the formula for generating the data uses dose.internal.true, which represents the true doseX, while the second formula, used for estimation, uses dose.interval, which represents the mean dose Z, and also enclosed by U(), telling that this variable is measured with error and comes with a dosimetry system. 68 For one simulation, the sample code of the work ow with this package is given below 1 data 2 data.survival <- simulateERR(model.gen, data, theta, id, age, censor.time) 3 err <- parseERR(model.est, data=data.survival) 4 xm <- fitERR(err) 5 CI <- inferERR(err, xm$theta, dose.reps[data.new$index,], alpha=0.05) Here data is a data.frame containing all covariates information, including model related covariates covariate.age, sex, dose.internal.true, dose.internal, dose.external and survival data related infor- mation such as id, age, censor.time. The generated outcome variable is cell.cases as given by the formula, with oset variable cell.pyr. simulateERR then uses the given formula, data, and model parameters theta to generate tabulated survival data. Here model parameters are specied in a vector in the same order as in the formula. The returned variable data.survival containes the generated outcome variable. parseERR is subsequently used to parse the estimation formula with the survival data, and returns err, which is a list with parsed data structures. Afterwards err is parsed to fitERR to get ^ and also to inferERR to get condence intervals. inferERR also requires dose replications, and ^ which can be provided by fitERR. Currently only one variable is allowed to be associated with a dosimetry system. For other details, refer to the the manual of the package. 69 Chapter 4 Comparisons between the Variance Correction Method and Bayesian Methods As dicussed in Section 1.6.3, there are emerging research of using Bayesian method to solve measurement error problems with complex dosimetry systems, which comes naturally because the dosimetry system itself is constructed with Bayesian perspective. Dierent from traditional methods which separate the construction phase of the dosimetry system, and the analysis phase with the outcome data, methods that combine these two phases are suggested by somer esearchers with the intention of ahieving better eciency. Kwon et al. (2016) suggested that the tness of each dose realization to the data represents `likeliness' of it being the true dose, or `closeness' of it being to the true dose. Based on this idea, they proposed so-called Bayesian Model Averaging (BMA) method to get the overall condence interval of the model parameters. This method is claimed to greatly improve condence interval coverage and eliminating bias compared to classical methods. Meanwhile, some other researchers, including Birchall and Puncher (2016), notice that given reasonable number of dose realizations (such as 1,000), usually 1 dose realization dominates all others in likelihood, which suggests that averaging over all dose realizations based on weighting through likelihood should not do much better compared to using only the dominating one, which contradicts aforementioned BMA method. Since the adoption of complex dosimetry system is 70 only recent and there is no gold standard for dealing with it, this controversy is worth further investigation, which is the topic of this chapter. We simulated a cohort study using a simple ERR model to compare the performance of the BMA method proposed by Kwon et al., a full yet unrealistic Bayesian method, which is described shortly after, and our variance correction method. In the simulation, we use dosimetry model based on the SUMA model which includes either shared or unshared errors only, or both. 4.1 Data and Methods 4.1.1 Risk Model, Dosimetry Model and Cohort Simulation A simple ERR model is used for generating the simulation data. The hazard function is specied with baseline risk related to sex only and excess risk only associated with dose exposure, as follows h = exp(a 0 +a 1 S)(1 +bX) where S is a binary random variable representing sex, with S Bernoulli(0:5), (1 for male, 0 for female) and X is the dose of risk related exposure. The parameters are chosen as follows, a 0 a 1 b 6 log(2) 2 A cohort of 5,000 people is simulated; each person is followed for 10 years. In the whole follow-up, about 200 cases are due to baseline risk. For the dosimetry model, we use X i = Z i S i , with E( S ) = E( i ) = 1, both of which follow lognormal distribution, and Z i 2 1 . Note S represents shared multiplicative error, and i represents completely unshared multiplicative error, and the two errors are independent, i.e., S ? i ; i ? j ;8i;j. This dosemitry model is congured with 3 parameter settings, as listed below 71 Dosimetry Setting True Dose/X i Var( S )= 2 S Var( i )= 2 U DS-U Z i i 0 0.3 DS-S Z i S 0.3 0 DS-SU Z i S i 0.3 0.3 In each simulation with a specic dosimetry setting, we generate 400 dose realizations. One of the realizations is used as the true dose to generate the survival data while the rest are used for model tting and condence calculation with the BMA method and the varaince correction method. For BMA method, we also test its performance with only the rst 100 dose realizations made available to see how its performance depends on the number of available dose realizations. For the full Bayesian method, we assume the dosimetry model can be accessed directly. The survial data is tabulated into person-year format, which will be tted using Poisson regression as discussed in Section 3.2.2. 4.1.2 Bayesian Model Averaging Method In the BMA method proposed by Kwon et al., the posterior distribution of the parameters is computed by P(jD) = K X k=1 P(jD;X k )P(X k jD) wherek indexes dose realization from a total number of K, andD denotes the outcome data and covariates other than X. The exact formula used by the authors is a little dierent, which uses a indicator random variable that indicate which dose replication is used. Thus the posterior samples of this indicator variable represents the weight P(X k jD). This specication is discussed later in detail with the original MCMC method, Stochastic Approximation Monte Carlo (SAMC). However, there are tuning parameters in SAMC and it is hard to optimize them automatically. Also, we failed to achieve proper mixing in our preliminary experiments. Thus an alternative approxiamtion method is implemented, described as follows. 72 1. From the Bayes Theorem, with arbitrary k xed, we have P(jD;X k ) = P(Dj;X k )P(jX k ) R P(Dj;X k )P(jX k )d / P(Dj;X k )P(jX k ) Since both P(Dj;X k ) and P(jX k ) can be easily obtained,we can sample from P(jD;X k ) using Metropolis-Hastings algorithm. 2. We also have P(X k jD) = R P(Dj;X k )d() P K k=1 R P(Dj;X k )d() If we assume has a uniform prior distribution in region B = [M;M] p wherep = dim(), and we choose M large such that R B P(Dj;X k ) is small, we have P(X k jD) = R B P(Dj;X K )d P K k=1 R B P(Dj;X K )d R P(Dj;X K )d P K k=1 R P(Dj;X K )d We can obtain a niteM that satises this condition from the likelihood theory. Notice that the integral R P(Dj;X K )d can be approximated by importance sampling, which means P(X k jD) can be approximated. Together this means we can combine the samples from P(jD;X k ) with the weight P(X k jD) as samples from the posterior distribution of P(jD), which will then be used to easily calculate condence limits. Below we discuss the techniques we used and the method used by Kwon et al. 4.1.2.1 Metropolis-Hastings Algorithm Metropolis-Hastings (MH) algorithm was rst introduced by Hastings (1970) as generalizations of a earlier method proposed by Metropolis et al. (1953). 73 The idea of Metropolis-Hastings algorithm is to construct a Markov chain with the stationary distribution as the target distribution. When proper mixing is achieved, the samples from the Markov chain can be seen as obtained from the target distribution. In the case of discrete target distribution , assume the transition matrix P with elementp ij , wherei;j index possible values that can take. To make the target distribution stationary, we can have it satisfy reversibility condition i p ij = j p ji : Now assume the Q with element q ij is the proposal transition matrix to be used. To make it equivalent to P , modication is needed. A simple solution is to let p ij =q ij ij where ij is the probability of accepting the transition from i to j. This leads to i q ij ij = j q ji ji ) ij ji = j q ji i q ij : This can be obtained by simply letting a ij = min(1; j q ji i q ij ); and it can be veried that the reversibility condition is satised, andp ij is a valid transition matrix. Hence the transition can be implemented by rst sampling new state j from previous statei, and then accepted with probability a ij . For continuous target distributions, same argument holds except that we will be using a transition kernel instead of a matrix. 74 An important feature about MH algorithm is that normalizing constant is not required for target distribution since the probability of transition is only related to the ratio of two proba- bilities. However, this method does require proper choice of proposal transition kernel, since the ratio of (x i )=(x j ) in most cases depends on the relative values of or distance between x i and x j . If the proposal distribution is too aggressive, the new value may never get accepted. If the proposal distribution is too conservative, the time to convergence will be very long, and also the chain may be trapped in a region near one mode which has high barriers to other modes. In our simulation, we use multivariate normal distribution with quarter of the inverse of Fisher information matrix as its the variance as the proposal distribution for random walk of the parameters, and choose the Maximum Likelihood Estimates (MLE) as the initial values. Thus we can reasonably assume proper mixing has been achieved and the samples approximately come from the target distribution. 4.1.2.2 Importance Sampling Importance sampling is usually used to calculate integrals. Assume we want to calculate (assuming g(x) 0) I = Z g(x)dx: The integral is equivalent to I = Z g(x) f(x) f(x)dx; wheref(x) is a probability density function. If we can sample from the distribution with density functionf(x), the integral can be approximated as follows, by using the samplesx i ;i2f1;:::;Ng from f(x), ^ I = N X i=1 g(x i ) f(x i ) 1 N : The requirement for f is that the domain of f must contain that of g. 75 It can be easily seen that E( ^ I) =I since E xi g(x i ) f(x i ) =I;8i: Further Var xi g(x i ) f(x i ) = E g(x) 2 f(x) 2 I 2 Obviously the lowerst value for variance is 0 achieved by letting f(x) =g(x)=I. Then Var( ^ I) = 1 N E g(x) 2 f(x) 2 I 2 In practice, since we can't easily draw from g(x)=I, we draw from a proposal distribution f(x) such that thejg(x)=f(x)j is bounded so that we can get an approximation with good property. The advantage of using importance sampling is that when we can have a distribution that is similar to g(x)=I, we can obtain a estimate with small variance. For our analysis, we use multivariate t distribution with variance set to be the inverse Fisher information matrix. Multivariate t distribution is chosen for its heavy tail relative to multivariate normal distribution, with the intention of getting boundedjg(x)=f(x)j in our simulation setting. 4.1.2.3 Stochastic Approximation Monte Carlo Stochastic Approximation Monte Carlo (SAMC) algorithm is developed by Liang, Liu, and Carroll (2007) to better sample from a distribution with multiple modes, with intention to avoid being trapped in local optimum, with applications in importance sampling and model selection. The method has some improvement over widely used Wang-Laudau algorithm (Wang and Landau, 2001) and generalized Wang-Landau algorithm, both of which belong to adaptive MC methods, where the proposal distribution is updated using samples during running. The idea of SAMC is to divide the distribution space into subregions, where density in each subregion is 76 scaled, to achieve desired sampling frequency for each of the subregion. Specically, assume the target distribution isq(x), letE 1 ;:::;E m bem disjoint regions forming a partition of the sample space, = ( 1 ;:::; m ) be the desired sampling frquency vector with P m i i = 1, andf t g be a positive, non-decreasing sequence satisfying P 1 t=1 t =1; P 1 t=1 t <1 for some 2 (1; 2), and let be a compact set, which can be chosen to be [R;R] m for large R, the SAMC algorithm is described in Algorithm 2, Algorithm 2 The SAMC Algorithm t = 1, t1 = (0;:::; 0) while t<T do sample x t using MH algorithm with target distribution C t q t (x), where for x2 E i , q t (x) = q(x)= exp( i;t1 ), C t is a normalizing constant. calculatee t = (e 1;t ;:::;e m;t ) where e i;t =I Ei (x t ) set t = t1 + t (e t ) if t 62 then t = t +c1, where c is chosen such that t +c12 end if end while return fx t g Let ^ i;t be the sampling frequency of x within subregion E i up to iteration t, under mild conditions, the convergence result of the SAMC algorithm says that as t!1, ^ i;t ! +d for E i 6=; where d is a constant such that P i ^ i;t = 1. This means that we essentiall weight each subregion to change its sampling frequency from P(x2E i ) to i when P(x2E i )> 0;8i. In the framework of BMA by Kwon et al. (2016), let be an indicator variable for X, E i = fXj =kg, 2f1;:::;Kg, so we have P( =k) = P(X k jD). Let i = 1=K, it is expected that after large enough iterations, the sampling frequency at iteration t is approximately the same as , i.e. C t P(X k jD) exp( i;t ) i ) exp( i;t ) K C t P(X k jD): Thus we can obtain P(X k jD) throught exp( i;t ). The application of SAMC above comes from general model selection framework. However there are many problems in using the SAMC approach for this particular problem. Usually in model 77 selection problem, the number of models, equivalent asK in the above setting, is small. However, in a complex dosimetry system, thisK ranges from hundreds to thousands, which is relatively big, and the convergence takes much longer. Usually the choice for t is t = t0 max(t0;t) ;t2N + ;t 0 > 1. The convergence depends on the choice oft 0 , which is not easy to optimize given that the algorithm is expected to run for a long time. Assume that we need 50 samples with each dose to get a rough estimation of distributions of parameters, togheter we will need 5 10 4 samples. However, since the algorithm is stochastic, it takes much more iterations to have enough samples from every dose realization, not mentioning the number of iterations taken to achieve good mixing. Overall we think it is better to use the approximation method mentioned above, which is simpler to implement and takes mcuh less time, while following the same idea. 4.1.3 Full Bayesian Method The full Bayesian approach set-up is as follows,8i2f1;:::;ng, Y i Poisson( i ) i = i exp(a 0 +a 1 S i )(1 +bX i ) X i Z i S i S logNormal(1; 2 S ) i logNormal(1; 2 U ) a 0 ;a 1 ;b Gaussian(0;1) where i is the time record in person-year i. This model was rst implemented in JAGS, which have issues such as sampled i going to both1 and1. Thus we implement an adaptive Metropolis-within-Gibbs sampler as described below. 78 4.1.3.1 Adaptive Metropolis-within-Gibbs Sampling Gibbs sampling was rst introduced as a special case of MH algorithm and later developed into a more general framework. It is designed to sample from multivariate distribution. Assume we have a random variable X of dimension d, with its distribution described by density function f(x) =f(x 1 ;:::;x d ). The gibbs sampler algorithm is given below, Algorithm 3 Gibbs Sampler for t = 1 to T do for i = 1 to d do sample x i;t from f(x i jx 1 ;:::;x i1 ;x i+1 ;:::;x d ) end for record x t end for return fx t g The target distribution is f(x), guaranteed by the detailed balance equations as in MH algo- rithm. This algorithm serves as a general framework for Gibbs sampling and one modication, the Metropolis-within-Gibbs sampler (MwG), changes the sampling step to a MH step with the full conditional distribution as the target distribution, when this conditional distribution is hard to sample from directly. Original MH algorithm uses xed parameters that needs to be tuned afterwards by monitoring acceptance rate and convergence. Adaptive Metroplis algorithms have been developed addressing this diculty. The proposal distribution is adjusted during the sampling process to achieve optimal performance without manual interference. Convergence results and examples are given by Roberts and Rosenthal (2009). One example is an application of adaptive method to MwG sampling. For each random component in the model indexed by i, an associated variable l i is created as the logarithm of the standard error of the proposal distribution, usually a random walk with normal distribution. This l i is adjusted during sampling to achieve optimal acceptance rate 0.44 (Roberts and Rosenthal, 2001). The iterations are divided into equal batches, and the acceptance rate is estimated within 79 each batch. At the end ofj-th batch, if the acceptance rate is smaller than 0.44,l i is decreased by (j), otherwise increased by (j). (j) need to satisfy (j)! 0 as j!1, and a common choice is (j) = min(0:01;j 1=2 ). In our simulation, we use random walk with normal distribution with (j) = min(0:01;j 1=2 ), and the the batch size is chosen to be 50. 4.1.4 Variance Correction Method We use the variance correction method described in Chapter 2. The exact correction for the survival analysis is the same as in Chapter 3, and we use our R package described in Section 3.5. 4.2 Result Table 4.1 Coverages of Condence Intervals of Dierent Methods with SUMA-U Method Parameter Coverage (low, high) (%) y BMA a0 65.25 (1.75, 33.00) a1 76.50 (7.25, 16.25) b 55.25 (44, 0.75) BMA-100 reps z a0 56 (0, 44) a1 75 (6, 19) b 46 (54, 0) Full Bayesian a0 96.00 (3.50, 0.50) a1 94.0 (1.00, 5.00) b 72.00 (28.0, 0.00) Corrected CI a0 97.25 (1.50, 1.25) a1 94.25 (1.00, 4.75) b 94.5 (5.25, 0.25) Na ve CI a0 97.25 (1.50, 1.25) a1 93.75 (1.00, 5.25) b 94.00 (5.75, 0.25) CI with true dose a0 97.0 (1.75, 1.25) a1 94.5 (1.00, 4.50) b 95.0 (4.50, 0.50) y In parenthesis, the rst number is the fraction of times that the upper condence limit falls below the true value and the second number is the fraction of times that the lower condence limit goes beyond the true value. z Using only the rst 100 dose realizations from the total 400 dose realizations as the dosimetry system. Only 100 simulations performed. 80 Performance of dierent methods with dosimetry setting DS-U is given in Table 4.1. Among all tested methods, BMA method performs the worst. The coverage of its condence intervals for all parameters falls far below the 95%; the coverage for b is only 55.25%, and it is heavily biased towards the null, i.e., in 46% of the simulations the upper condence limit lies below the true value. Using BMA with 100 dose replications yields slightly worse performance compared to using 400, while their patterns are comparable. With full Bayesian approach and a correct specication of the dosimetry system, the coverage for base line risk eects a 0 and a 1 rises to around 95%, while the coverage for dose eect b 1 is only 72%, and also strongly biased towards the null. The uncorrected condence interval using the mean dose for tting the model performs very well; the coverage of all parameters are close to 95%. The corrected condence intervals give similar result, and these condence intervals perform almost identically well to those using true dose for tting the model, though all three of them have a little bias for a 1 and b. Table 4.2 Coverages of Condence Intervals of Dierent Methods with SUMA-S Method Parameter Coverage (low, high) (%) y BMA a0 89.25 (5.25, 5.50) a1 92.00 (2.00, 6.00) b 86.25 (0.50, 13.25) BMA-100 reps z a0 86 (7, 7) a1 88 (4, 8) b 86 (1, 13) Full Bayesian a0 93.25 (5.00, 1.75) a1 95.25 (1.75, 3.00) b 90.00 (2.50, 7.50) Corrected CI a0 94.75 (2.50, 2.75) a1 95.25 (1.75, 3.00) b 95.50 (2.25, 2.25) Na ve CI a0 94.75 (2.50, 2.75) a1 95.25 (1.75, 3.00) b 43.00 (20.25, 16.75) CI with true dose a0 94.75 (2.50, 2.75) a1 95.25 (1.75, 3.00) b 93.75 (5.50, 0.75) y In parenthesis, the rst number is the fraction of times that the upper condence limit falls below the true value and the second number is the fraction of times that the lower condence limit goes beyond the true value. z Using only the rst 100 dose realizations from the total 400 dose realizations as the dosimetry system. Only 100 simulations performed. 81 Performance of dierent methods for dosimetry setting DS-S is given in Table 4.2. Coverage for all parameters using BMA method stays around 90%. Strong bias is seen for b. However, dierent from Table 1, the bias here is away from the null. BMA with 100 dose realizations does almost as well. The full Bayesian approach gives good coverage around 95% for baseline risk eect a 0 and a 1 , but a slightly worse coverage for b, being at 90%. Variance correction method works very well with all coverages perfectly around 95%, performing as good as using the true dose for model tting. The na ve condence intervals without any correction gives good coverage fora 0 ,a 1 but terrible coverage for dose eect b at only 43%. Table 4.3 Coverages of Condence Intervals of Dierent Methods with SUMA-SU Method Parameter Coverage (low, high) (%) y BMA a0 60.55 (2.26, 37.19) a1 73.87 (11.06, 15.08) b 63.57 (16.83, 19.60) BMA-100 reps a0 68.69 (2.02, 29.29) a1 73.74 (6.06, 20.20) b 58.59 (20.20, 21.21) Full Bayesian a0 92.75 (6.00, 1.25) a1 93.00 (3.25, 3.75) b 89.75 (2.75, 7.50) Corrected CI a0 95.25 (2.75, 2.00) a1 94.00 (3.25, 2.75) b 94.75 (3.00, 2.25) Na ve CI a0 95.25 (2.75, 2.00) a1 93.75 (3.25, 3.00) b 44.25 (38.25, 17.50) CI with true dose a0 96.00 (2.00, 2.00) a1 94.00 (3.25, 2.75) b 94.25 (4.25, 1.50) y In parenthesis, the rst number is the fraction of times that the upper condence limit falls below the true value and the second number is the fraction of times that the lower condence limit goes beyond the true value. z Using only the rst 100 dose realizations from the total 400 dose realizations as the dosimetry system. Only 100 simulations performed. Performance of dierent methods for dosimetry setting DS-SU is given in Table 4.3. The performance of BMA methods, regardless of using 100 or 400 dose replications, is very similar to that in DS-U setting as shown in Table 4.1. The performance of full Bayesian method, variance 82 correction method and na ve method is very similar to that in DS-S setting, as shown in Table 4.2. 4.3 Discussion The performance of BMA method contrasts greatly to what Kwon et al. (2016) claimed. Since the performance of variance correction method meets our expectation and we have done extensive testing in the previous chapter, here we only analyze the problem of Bayesian method and interpret the result. 4.3.1 Unshared Error Only First consider the DS-U dosimetry setting, a situation with only unshared errors. With Z known, we dene the error vector of X to be (X) = (log 1 ;:::; log n ) where X = (Z 1 1 ;:::;Z n n ) and(Z) is a vector of 0. We dene the distance between X i ;X j to be d(X i ;X j ) = d((X i );(X j )) = v u u t n X k=1 (log i;k log j;k ) 2 This distance is not a metric, but simply a representation of similarity between two dose realiza- tions. 83 Since i follows lognormal distribution log i follows Gaussian distribution, or log i Gaussian log(1:3) 2 ; log(1:3) Immediately we see that log 2 ( i ) log(1:3) 2 1 ( log 2 (0:3) 4 ), and thus d 2 (X i ;Z) log(1:3) 2 n ( n log 2 (0:3) 4 ); and d 2 (X i ;X j ) 2 log(1:3) 2 n : In our simulation,n = 5; 00010, d 2 (X i ;Z) and d 2 (X i ;X j ) are concentrated in a relatively small hyperspherical shell region in this high dimensional space, as illustrated in Figure 4.1a and 4.1b. This means, for a limited number of dose realizations such as 400 or 100, all the realizations are very far away from each other. There are two observations here. First, assumeX i is the true dose, tting the model with X j ;j6=i is similar to a classical error problem; Cov(X j ;Y )<Cov(X i ;Y ) while Var(X i ) = Var(X j ), and it is expected the estimated value ^ b will be biased towards the null, which explains the tremendous bias of its condence interval. Second, because of curse of dimensionality,X j will be around this hyperspherical shell region, and this implies that P(DjX j ) will be much smaller than P(DjX i ) and thus the relative weight between X j 's, j6= i cannot be an eective indicator of how close the estimates are to the true values. In fact, all the parameter estimates are strongly biased no matter what their associated weights are, as observed in the results. Now consider the full Bayesian method. The log-likelihood for person-year i is logL i =Y i log i Z 2 i log i + log(1:3) 2 2 2 log(1:3) 84 Z X d 2 ∼ log(1.3)χ 2 n (λ) n = 5× 10 4 λ = nlog 2 (0.3)/4 P > 0.95 d 2 = 13,425 d 2 = 13,763 (a) Most of the probability of d 2 (X;Z) is concentrated in the gray region. X true X k d 2 ∼ 2log(1.3)χ 2 n n = 5× 10 4 P > 0.95 d 2 = 25,912 d 2 = 26,563 (b) Most of the probability of d 2 (X k ;Xtrue) is concentrated in the gray region. X true X k X ∗ ˆX (c) The MCMC chain approaches ^ X from a starting dose X k . X true X k (d) BMA method can only jumps between the available dose replications, failing to approach Xtrue Figure 4.1 Illustration of the Bayesian methods with only unshared errors. 85 where i = i exp(a 0 +a 1 S i )(1 +bZ i i ): There are two components contributing the the log-likelihood; the rst one is from P(YjX) and the second one is from P(XjZ). The rst part can be maximized when we take Y i = i , obtained by i , i = Y i bZ i i exp(a 0 +a 1 S i ) 1 bZ i : The second part can be optimized by taking log i = log(1:3)=2. Hence the overall optimal must be somewhere between these two, which we denote as ^ i . After proper mixing of the MwG algorithm, the MCMC chain should be sampling in region around ^ X, as shown in Figure 4.1c. This is dierent from the BMA approach, where the it is not possible to obtain new samples from the dosimetry system, so the chain will be forced to jump between dose realizations around the hyperspherical shell region, as shown in Figure 4.1d. ^ X corresponds to the Maximum A Posteriori (MAP) estimate in Bayesian problems. We estimate ^ X by iterative tting i 's and other model parameters. The results for estimates of a 0 ;a 1 ;b are given in Table 4.4, and it can be seen that estimates for b are biased. a0 a1 b Mean -5.77 0.71 1.55 SD 0.11 0.095 0.22 Table 4.4 Parameter Estimates with ^ X Estimates for log( i )'s are very close to their expected value log(1:3)=2 (results not shown). In fact, for a single dose-year record indexed by i, the only informative data for estimating i is Y i , and this will not improve with the size of the study. Thus the estimate for log( i ) is heavily weighted towards E(log i ) since Var(Y ) is much larger compared to 2 U . This really means that even with complete knowledge of the dosimetry system, the response data cannot help us much in determining the value of unshared error component. 86 0 1 2 3 b log L(b| a 0 ,a 1 ) ǫ SM = ǫ ǫ SM = 2ǫ (a) The shape of log likelihood function log L(bja0;a1) under dierent shared error components. 0 1 2 3 4 1/ǫ SM P(X k |D) (b) The calculated value of P(X k jD) vs. 1=SM , which should be linear in theory. Figure 4.2 Bayesian methods with only shared errors favor smaller shared errors 4.3.2 Shared Error Only Now consider the DS-S setting, one with only shared error component. Assume the true dose is X i , and for X j with j6=i, we have i = i exp(a 0 +a 1 S i ) 1 +b S;i S;j : ^ b usingX j is a consistent estimator forb S;i = S;j . SinceX i ,X j are only dierent by a multiplica- tive factor, the maximum log-likelihood achieved with dierent X j are the same. Assume that tting with true dose X i gives us MLE ^ b i , then with X j , we get ^ b j = ^ b i S;i = S;j . Thus the 95% condence interval constructed based on the distribution of S will be ^ b i S;i q S ;0:975 ; ^ b i S;i q S ;0:025 : 87 If ^ b i =b, this condence interval will have coverage exactly equals 0.95. However, not every X j , or S;j is treated the same; they have dierent posterior probability with a uniform prior. Observe that P(X j jD)/ Z b P(DjX j ;b)db: Even though maximal obtainable P(DjX j ;b) is the same across dierentj, the shape of P(DjX j ;b) are scaled for dierentj, as shown in Figure 4.2a. X j with smaller S;j , corresponding to a bigger ^ b j , have bigger P(X j jD). In theory, P(X k jD) is proportional to the inverse of S;j . The approxi- mate weights calculated in our simulations in one simulation are shown in Figure 4.2b, which also shows our importance sampling method works reasonably. Hence the posterior distribution of b favor bigger b, leading to a bias away from the null. In fact, in Bayesian statistics, there is no uninformative prior, since every prior is giving information and aecting the nal result. The argument above explains why full Bayesian method gives reasonable coverage while the condence interval is biased away from the null. The same argument holds for BMA method. The inferior performance of BMA method shown in Table 4.2 may be related to the approximation method we use rather than using only 400 dose realizations, as can be seen from Figure 4.2b that the approximate weights do not precisely follow a straight line. 4.3.3 Shared and Unshared Errors In dosimetry setting DS-SU, when both shared and unshared errors are present, the full Bayesian method and the BMA method behave dierently. BMA method, due to the restricted number of dose realizations, suers from the curse of dimensionality caused by unshared error components. In the full Bayesian method, considering only the unshared error, the chain will only sample from region around ^ X after mixing, which limit the eect of the unshared error components. Rather, the shared error component has a much greater impact on the posterior distribution. 88 4.4 Conclusion To conclude, the variance correction method has good coverage over all dosimetry settings, while Bayesian methods, which try to determine true dose, don't work well with complex dosimetry systems. We see that even with complete knowledge of the dosimetry system, using the full Bayesian approach, we can't get proper estimate for unshared error components, and the eect of shared components depends on the assumption of prior distribution. Even worse, the BMA method performs terribly due to curse of dimensionality. A possible solution for overcoming curse of dimensionality is to disable or get rid of the unshared error components in the dosimetry system, and use only the shared errors for performing Bayesian analysis. However, there are dosimetry systems that have relatively large unshared error components, and ignoring unshared error components might not be a good solution. 89 Chapter 5 Impact and Future Research The Mayak Worker Cohort provides the scientic community a well-constructed dataset along with a modern dosimetry system, which can be used to address questions of a wide range about the eect of radiation. The new dosimetry system MWDS-2013, established upon Bayesian perspective, will inspire epidemiologists to reassess the eect of both external exposure and internal exposure to various organs, such as liver, lung, bone marrow, kidney, etc. All these attempts strongly require the consideration and adjustment for measurement error, which coexists intrinsically with the dosimetry system. However, the huge amount of information provided by the new dosimetry system oers computational challenges to statisticians. Though we have numerous methods, such as Rosner's method, SIMEX, etc., that can be used to correct for measurement error in relatively simple settings, we don't have a standard approach in the community for a complex dosimetry system with both shared and unshared errors, usually implemented through dose realizations in Bayesian perspective. Our variance correction method, based on normal-lognormal approximation of the distribution of the eect estimator, provides impressive coverage for both linear models and ERR models, as can be seen from the simulation result in Chapter 2. It can also be easily used in survival analysis with real dosimetry system as seen in Chapter 3. Also, in Chapter 4, we compared our method 90 to a Bayesian approach proposed by (Kwon et al., 2016) and showed our method has a great advantage over Bayesian methods that aim to either perform weighting or estimate true dose. We have submitted our rst paper titled Implementing a corrected information method for ad- justing for shared error when using Monte Carlo dosimetry systems in analysis of epidemiological data to Plos One, which describes our method and results of simulation study with Mayak Worker Cohort and MWDS-2013. We are also under the process of writing another two papers; one on survival analysis of radiation eect on lung cancer with Mayak Worker Cohort and the other one on comparison between BMA method by Kwon et al. (2016) and our variance correction method. Our future research work will be focused on adding quadratic terms to the model and including multiple error associated covariates in our R package. We will also try extension of our method to generalized linear models. 91 Bibliography Birchall, A and M Puncher (2016). \The Mayak Worker Dosimetry System (MWDS-2013): How to Reduce Hyper-Realisations to Realisations". In: Radiation Protection Dosimetry. Carroll, Raymond J et al. (2006). Measurement error in nonlinear models: a modern perspective. CRC press. Cook, John R and Leonard A Stefanski (1994). \Simulation-extrapolation estimation in paramet- ric measurement error models". In: Journal of the American Statistical association 89.428, pp. 1314{1328. Davis, Scott et al. (2007). \Hanford Thyroid Disease Study Final Report". In: Fred Hutchinson Cancer Research Center, Seattle, WA. Devanarayan, Viswanath and Leonard A Stefanski (2002). \Empirical simulation extrapolation for measurement error models with replicate measurements". In: Statistics & Probability Letters 59.3, pp. 219{225. Durrett, Rick (2010). Probability: theory and examples. Cambridge university press. Gilbert, ES et al. (2013). \Lung cancer risks from plutonium: an updated analysis of data from the Mayak worker cohort". In: Radiation research 179.3, pp. 332{342. Hastings, W Keith (1970). \Monte Carlo sampling methods using Markov chains and their appli- cations". In: Biometrika 57.1, pp. 97{109. Kalb eisch, John D and Ross L Prentice (2011). The statistical analysis of failure time data. Vol. 360. John Wiley & Sons. Khokhryakov, Victor V et al. (2013). \Mayak Worker Dosimetry System 2008 (MWDS-2008): assessment of internal dose from measurement results of plutonium activity in urine". In: Health physics 104.4, pp. 366{378. Koshurnikova, NA et al. (1999). \Characteristics of the cohort of workers at the Mayak nuclear complex". In: Radiation research 152.4, pp. 352{363. Kwon, Deukwoo et al. (2016). \Bayesian dose{response analysis for epidemiological studies with complex uncertainty in dose estimation". In: Statistics in Medicine 35.3, pp. 399{423. Laird, Nan and Donald Olivier (1981). \Covariance analysis of censored survival data using log- linear analysis techniques". In:JournaloftheAmericanStatisticalAssociation 76.374, pp. 231{ 240. Liang, Faming, Chuanhai Liu, and Raymond J Carroll (2007). \Stochastic approximation in Monte Carlo computation". In: Journal of the American Statistical Association 102.477, pp. 305{320. Lide, David R et al. (2005). \Physical constants of organic compounds". In: CRC handbook of chemistry and physics 89, pp. 3{1. Metropolis, Nicholas et al. (1953). \Equation of state calculations by fast computing machines". In: The journal of chemical physics 21.6, pp. 1087{1092. Neriishi, Kazuo et al. (1991). \The observed relationship between the occurrence of acute radia- tion eects and leukemia mortality among A-bomb survivors". In: Radiation research 125.2, pp. 206{213. Prentice, Ross L (1982). \Covariate measurement errors and parameter estimation in a failure time regression model". In: Biometrika 69.2, pp. 331{342. 92 Roberts, Gareth O and Jerey S Rosenthal (2009). \Examples of adaptive MCMC". In: Journal of Computational and Graphical Statistics 18.2, pp. 349{367. Roberts, Gareth O, Jerey S Rosenthal, et al. (2001). \Optimal scaling for various Metropolis- Hastings algorithms". In: Statistical science 16.4, pp. 351{367. Rosner, B, WC Willett, and D Spiegelman (1989). \Correction of logistic regression relative risk estimates and condence intervals for systematic within-person measurement error". In: Statis- tics in medicine 8.9, pp. 1051{1069. Stram, Daniel O and Kenneth J Kopecky (2003). \Power and uncertainty analysis of epidemio- logical studies of radiation-related disease risk in which dose estimates are based on a complex dosimetry system: some observations". In: Radiation research 160.4, pp. 408{417. Stram, Daniel Oet al. (2015). \Shared dosimetry error in epidemiological dose-response analyses". In: PloS one 10.3, e0119418. Thomas, Duncan, Daniel Stram, and James Dwyer (1993). \Exposure measurement error: in u- ence on exposure-disease relationships and methods of correction". In: Annual review of public health 14.1, pp. 69{93. Vaart, Aad W Van der (2000). Asymptotic statistics. Vol. 3. Cambridge university press. Vasilenko, E K et al. (2007). \Mayak worker dosimetry study: an overview". In: Health physics 93.3, pp. 190{206. Wang, Fugao and DP Landau (2001). \Ecient, multiple-range random walk algorithm to calcu- late the density of states". In: Physical review letters 86.10, p. 2050. 93 Appendix A Tables 94 Table A.1 Bias, Coverage, Length of CI for Simple ERR Model with Dierent Dosimetry Models with 5000 Subjects (with Dose Known) b Dosimetry * Bias(SD) Coverage y Length of CI z 0.01 SUMA1 0:001(0:020) 0:931(0:001; 0:068) 0:039 SUMA2 0:001(0:013) 0:954(0:003; 0:043) 0:031 SUMA3 0:000(0:005) 0:959(0:001; 0:040) 0:019 HSUMA1 0:001(0:008) 0:942(0:011; 0:047) 0:023 HSUMA2 0:001(0:008) 0:935(0:011; 0:054) 0:022 HSUMA3 0:000(0:006) 0:939(0:003; 0:058) 0:021 HSUMA4 0:000(0:006) 0:933(0:009; 0:058) 0:017 0.05 SUMA1 0:001(0:026) 0:937(0:001; 0:062) 0:068 SUMA2 0:001(0:019) 0:955(0:003; 0:042) 0:057 SUMA3 0:001(0:011) 0:955(0:003; 0:042) 0:041 HSUMA1 0:002(0:014) 0:940(0:007; 0:053) 0:048 HSUMA2 0:001(0:014) 0:948(0:010; 0:042) 0:045 HSUMA3 *1 0:001(0:012) 0:949(0:006; 0:045) 0:044 HSUMA4 0:001(0:010) 0:948(0:010; 0:042) 0:037 0.2 SUMA1 0:007(0:065) 0:938(0:001; 0:061) 0:200 SUMA2 0:007(0:051) 0:957(0:001; 0:042) 0:169 SUMA3 0:004(0:034) 0:956(0:003; 0:041) 0:130 HSUMA1 0:007(0:043) 0:958(0:000; 0:042) 0:151 HSUMA2 0:005(0:037) 0:952(0:003; 0:045) 0:137 HSUMA3 0:005(0:037) 0:951(0:001; 0:048) 0:136 HSUMA4 0:004(0:030) 0:955(0:005; 0:040) 0:111 1 SUMA1 *3 0:098(0:485) 0:910(0:000; 0:090) 1:178 SUMA2 *3 0:068(0:347) 0:927(0:000; 0:073) 0:964 SUMA3 0:047(0:273) 0:936(0:000; 0:064) 0:794 HSUMA1 0:072(0:316) 0:938(0:000; 0:062) 0:960 HSUMA2 0:052(0:283) 0:948(0:000; 0:052) 0:787 HSUMA3 0:044(0:287) 0:929(0:000; 0:071) 0:807 HSUMA4 *6 0:017(0:173) 0:951(0:000; 0:049) 0:609 * The number in the superscript after * indicates how many simulations cannot get a MLE due to diver- gence. y Overall coverage with percentage of missed coverage for upper and lower condence limits shown in parenthesis respectively. z Median of the length of the condence interval from all simulations. 95 Table A.2 Bias, Coverage, Length of CI for ERR Model with Two Error-Prone Covariates (with Dose Known) (b1;b2) Dosimetry Bias for b1 Coverage y for b1 Bias for b2 Coverage y for b2 (0.05, 0.1) SUMA2 *31 0:008(0:087) 0:927(0:000; 0:073) 0:012(0:154) 0:932(0:000; 0:068) HSUMA3 *8 0:002(0:019) 0:928(0:000; 0:072) 0:006(0:040) 0:927(0:000; 0:073) HSUMA4 *3 0:001(0:016) 0:932(0:000; 0:068) 0:005(0:037) 0:931(0:000; 0:069) (0.02, 0.2) SUMA2 *44 0:010(0:123) 0:917(0:000; 0:083) 0:041(0:572) 0:927(0:000; 0:073) HSUMA3 *19 0:002(0:028) 0:906(0:001; 0:093) 0:015(0:145) 0:925(0:000; 0:075) HSUMA4 *14 0:000(0:012) 0:926(0:004; 0:070) 0:009(0:060) 0:929(0:000; 0:071) (0.01, 1) SUMA2 *178 0:001(0:060) 0:781(0:001; 0:218) 0:004(0:658) 0:781(0:000; 0:219) HSUMA3 *153 0:002(0:029) 0:860(0:000; 0:140) 0:079(0:767) 0:851(0:000; 0:149) HSUMA4 *135 0:003(0:054) 0:890(0:000; 0:110) 0:084(0:719) 0:873(0:000; 0:127) * The number in the superscript after * indicates how many simulations cannot get a MLE due to diver- gence. Table A.3 Bias, Coverage, Length of CI for ERR Model with Dose Eect Modier (with Dose Known) a1 a2 Bias(b) Coverage for b1 Bias(a1) Coverage(a1) Bias(a2) Coverage(a2) 0.01 0.01 *4 0:001 0:931 0:000 0:951 0:003 0:952 0.1 *4 0:002 0:937 0:000 0:948 0:000 0:945 0.5 *158 0:001 0:950 0:000 0:960 0:000 0:950 0.1 0.01 *2 0:001 0:929 0:000 0:959 0:002 0:949 0.1 *4 0:002 0:931 0:000 0:950 0:002 0:951 0.5 *188 0:001 0:945 0:000 0:950 0:000 0:951 0.5 0.01 *604 0:024 0:487 0:011 0:768 0:173 0:795 0.1 *590 0:025 0:483 0:016 0:724 0:066 0:695 0.5 *724 0:020 0:547 0:049 0:576 0:160 0:547 * The number in the superscript after * indicates how many simulations cannot get a MLE due to diver- gence. 96 Table A.4 Condence interval coverage for model parameters tted with true dose Dose Uncertainty Model Parameter y Na ve Condence Interval Moderate Strong SUMA-U a0 .967 (.014, .019) z .951 (.021, .028) a1 .954 (.030, .016) .947 (.033, .020) a2 .955 (.025, .020) .937 (.033, .030) a3 .961 (.017, .022) .945 (.021, .034) b1 .953 (.047, .000) .948 (.044, .008) a4 .956 (.015, .029) .947 (.027, .026) a5 .950 (.023, .027) .949 (.028, .023) b2 .940 (.056, .004) .958 (.028, .014) SUMA-S a0 .950 (.014, .036) .946 (.025, .029) a1 .955 (.036, .009) .952 (.026, .022) a2 .952 (.028, .020) .952 (.029, .019) a3 .951 (.022, .027) .950 (.020, .030) b1 .932 (.068, .000) .943 (.041, .016) a4 .945 (.018, .037) .941 (.027, .032) a5 .963 (.018, .019) .940 (.031, .029) b2 .940 (.053, .007) .954 (.028, .018) SUMA-SU a0 .953 (.023, .024) .949 (.021, .030) a1 .949 (.031, .020) .954 (.022, .024) a2 .949 (.022, .029) .952 (.023, .025) a3 .949 (.021, .030) .951 (.022, .027) b1 .936 (.057, .007) .942 (.044, .014) a4 .943 (.025, .032) .965 (.018, .017) a5 .949 (.027, .024) .946 (.019, .035) b2 .951 (.045, .004) .947 (.038, .015) SUMA-SUP a0 .956 (.017, .027) .953 (.021, .026) a1 .949 (.034, .017) .956 (.025, .019) a2 .950 (.030, .020) .948 (.021, .031) a3 .951 (.021, .028) .949 (.020, .031) b1 .938 (.061, .001) .949 (.041, .010) a4 .960 (.017, .023) .943 (.026, .031) a5 .960 (.014, .026) .950 (.023, .027) b2 .937 (.058, .005) .950 (.039, .011) y Model parameters. a0{a3 baseline rate parameters: centered rate (male age 60), log(age/60), log(age/6) 2 , and log of the female:male sex ratio, respectively;b1 internal dose eect (ERR/Gy);a4{a5 internal dose eect modiers log(age/60) internal dose eect modiers: log(age/60) and log of the female:male sex ratio; and b2 external dose eect (ERR/Gy). z Probability that interval includes the true value with, in parenthesis, the probability that the lower bound is below the true value and that the upper bound is greater than the true value. 97 Table A.5 Condence interval coverage for model parameters with MWDS-13 internal dose tted with true dose Dose Uncertainty Model Parameter y Na ve Condence Interval Moderate Strong MWDS-2013 a0 .966 (.013, .021) z .956 (.022, .022) a1 .960 (.024, .016) .947 (.027, .026) a2 .965 (.013, .022) .956 (.020, .024) a3 .943 (.023, .034) .953 (.021, .026) b1 .962 (.037, .001) .950 (.035, .015) a4 .959 (.019, .022) .960 (.019, .021) a5 .959 (.021, .020) .953 (.026, .021) b2 .926 (.070, .004) .952 (.036, .012) y Model parameters. a 0 {a 3 baseline rate parameters: centered rate (male age 60), log(age/60), log(age/6) 2 , and log of the female:male sex ratio, respectively; b 1 internal dose eect (ERR/Gy); a 4 { a 5 internal dose eect modiers log(age/60) internal dose eect modiers: log(age/60) and log of the female:male sex ratio; and b 2 external dose eect (ERR/Gy). z Probability that interval includes the true value with, in parenthesis, the probability that the lower bound is below the true value and that the upper bound is greater than the true value. 98
Abstract (if available)
Abstract
In epidemiological studies, exposures of interest are often measured with uncertainties, which may be independent or correlated. For some important studies, Monte Carlo dosimetry systems that provide multiple realizations of exposure estimates have been used to represent complex error structures. While the effects of independent measurement errors on parameter estimation and methods to correct these effects have been studied comprehensively, the literature on the effects of correlated errors, and associated correction methods is much more sparse. ❧ A novel method is derived and implemented that calculates corrected confidence intervals based on the approximate asymptotic distribution of parameter estimates in linear excess relative risk models. These models are widely used in survival analysis, particularly in radiation epidemiology. Specifically, for the dose effect estimate of interest, a mixture distribution consisting of a normal and a lognormal component is applied. This choice of asymptotic approximation guarantees that corrected confidence intervals will always be bounded. ❧ A simulation study was conducted to evaluate the proposed method in survival analysis using a realistic ERR model. We used both simulated Monte Carlo dosimetry systems (MCDS) and actual dose histories from the Mayak Worker Dosimetry System 2013, a MCDS for plutonium exposures in the Mayak Worker Cohort. Results show our proposed methods provide much improved coverage probabilities for the dose effect parameter, and noticeable improvements for other model parameters. ❧ The Bayesian model averaging method and a full Bayesian method were examined. The comparison between the Bayesian methods and the proposed method is based on simulation and the result is investigated.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Correcting for shared measurement error in complex dosimetry systems
PDF
A hierarchical physiologically-based pharmacokinetic modeling platform for genetic and exposure effects in metabolic pathways
PDF
Evaluating the associations between the baseline and other exposure variables with the longitudinal trajectory when responses are measured with error
PDF
ROC surface in the presence of verification bias
PDF
Using multi-angle imaging spectroradiometer aerosol mixture properties and meteorology for PM₂.₅ assessment in Iran
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Two-step testing approaches for detecting quantitative trait gene-environment interactions in a genome-wide association study
PDF
Covariance-based distance-weighted regression for incomplete and misaligned spatial data
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Uncertainty quantification in extreme gradient boosting with application to environmental epidemiology
PDF
Latent unknown clustering with integrated data (LUCID)
PDF
Multilevel Bayesian latent class growth mixture model for longitudinal zero-inflated Poisson data
PDF
Comparison of nonlinear mixed effect modeling methods for exhaled nitric oxide
PDF
Statistical downscaling with artificial neural network
PDF
Machine learning approaches for downscaling satellite observations of dust
PDF
Essays on bioinformatics and social network analysis: statistical and computational methods for complex systems
PDF
Stochastic inference for deterministic systems: normality and beyond
PDF
Best practice development for RNA-Seq analysis of complex disorders, with applications in schizophrenia
PDF
Prediction modeling with meta data and comparison with lasso regression
Asset Metadata
Creator
Zhang, Zhuo
(author)
Core Title
Inference correction in measurement error models with a complex dosimetry system
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
07/07/2017
Defense Date
12/05/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
complex dosimetry system,inference,measurement error,Monte Carlo dosimetry system,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Stram, Daniel Oscar (
committee chair
), Franklin, Meredith (
committee member
), Gauderman, William James (
committee member
), Lewinger, Juan Pablo (
committee member
), Zhang, Jianfeng (
committee member
)
Creator Email
zhangzhuo.nju@gmail.com,zhuozhan@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-393314
Unique identifier
UC11266407
Identifier
etd-ZhangZhuo-5486.pdf (filename),usctheses-c40-393314 (legacy record id)
Legacy Identifier
etd-ZhangZhuo-5486.pdf
Dmrecord
393314
Document Type
Dissertation
Rights
Zhang, Zhuo
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
complex dosimetry system
inference
measurement error
Monte Carlo dosimetry system