Robustifying Cox - Regression Model Estimation Using M - estimators with application to breast cancer patients | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
IRAQI JOURNAL OF STATISTICAL SCIENCES | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Volume 20, Issue 2, December 2023, Pages 166-174 PDF (807.41 K) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Document Type: Research Paper | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
DOI: 10.33899/iqjoss.2023.0181221 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Authors | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Salwa Salah Aldean Qassim Haidari* ; Bashar Abd Al-Aziz Al-Talib | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Department of Statistics and Informatics, College of Computer Science and Mathematics, University of Mosul, Mosul, Iraq | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Abstract | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This paper focused on estimating the survival time of real data for breast cancer patients in Nineveh province for the period between 2007- 2013. Robust estimation formulas were proposed and dealt with the Cox regression model in survival analysis. Determine the degree of hazard faced by women infected with this disease. Where it was proposed to use some Robust weights, and some classical variance estimators were replaced with Robust estimators to get an efficient estimation of the model, and also the suggestion of Robust weight functions. The Huber weight function was the best and was applied with the three templates to get the best model for the person of the variables that influence the occurrence of the event. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Highlights | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Acknowledgment
The authors are very grateful to the University of Mosul, College of Computer Science and Mathematics for their provided facilities, which helped improve this work's quality.
Conflict of interest
The author has no conflict of interest. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Keywords | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Cox regression Robust Regression Robust weight Outliers Huber’s M; Estimate | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Full Text | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The Cox regression model is one of the important models proposed by the scientist Cox (1972), and it is one of the methods that fits the case of the dependent variable that will be bi-response, as is the case in the data of cancer patients (a binary variable related to time). Existence of outliers in the data whose primary objective is to estimate the parameters of a model that represents information about the majority of the data. The weighted estimate in the Cox regression model proposed by (Schemper et.al. 2009) has advantages over the proposed models in the case of commensurate hazards and this model is not limited by the distribution of survival time.
The Cox Regression Model is one of the most important and most common models in survival analysis models that deal with cases in which the time variable that precedes the occurrence of a specific event is important in analyzing the phenomenon of which the study is concerned. This method has several advantages, the most important of which is that it is considered It is one of the modern methods characterized by the accuracy of its results, as well as the ease of dealing with survival data that appears when time is taken into account. (Bradic, 2011), (Ali & Qadir, 2022). The original Cox regression model was proposed by Cox (1972). If T was a continuous random variable, the basic model would be as follows:
: It represents the conditional hazard of the occurrence of the event at time t for the items that have the vector of the explanatory variables :The baseline hazard function that depends on time and corresponds to the vector of the explanatory variables β: is a vertical vector with dimension p×1 of the unknown parameters.The Partial likelihood method is used to estimate the unknown parameters. : is a class vector with dimension 1×p of the explained variables. It is the relative hazard (Hazard Ratio) that does not depend on time, that is, the effect of the variables explained by the increase or decrease in hazard is constant and does not change according to the change of a time point t. Also, the ratio between any two rates of hazard is constant and not dependent on time. (Al-Saqal,2020) (Al-Kafrani,2015).
The estimation of the parameters of the normal regression model in the presence of outliers is inefficient; Because there will be a mismatch between the data of the subject of the study and the basic hypotheses that must be available in the model. Because of this, the traditional methods will lose their good properties for estimating the parameters of the studied model. Therefore, alternative statistical methods have been found that are more resistant to the presence of outliers, and these are called Robust Methods. These estimates resulting from this alternative method are called Robust Estimators, and they are insensitive to outliers (Al-Obeidi, 2015). Robust-statistics is the least affected method by outliers. It is a procedure that seeks to identify outliers and reduce their impact on parameter estimations. It also works on its residual analysis and reduces the weights of outliers (Low-weighted), or completely removes outliers in general. The assigned weights must be studied. For each view, an estimate of the views that were largely excluded, and an assessment of whether these views were significant in the analysis. Robust regression can be defined as: An estimator retains many of the desirable properties of estimates when some of the regression model's assumptions are violated. A general class of statistical procedures designed to reduce the sensitivity of estimating regression model parameters to failure to meet model assumptions. We can also say that the estimator or statistical procedure is Robust if it provides useful information (Hamoudat, 2020).
Sometimes researchers may encounter many statistical problems, some of which may be clear and others not clear, so the researcher will find himself in need of new methods that enable him to organize the course of the experiment by making the resulting error as small as possible and at the same time obtaining an unbiased estimate. for the amount he is looking for. The idea of studying outliers began with simple ideas based on intuition and guesswork (Bhar & Cupta, 2001). The issue of outliers has been discussed in many studies and researche, because these values have an impact on the accuracy and integrity of the statistical data and the accuracy of the results to be achieved, and outliers are defined as observations that deviate greatly from other observations, that is, they are not consistent with the rest of the observations of the group for any of the variables. a particular phenomenon (Al-Baqaal, 2017). Also, outlier values are defined as those observations that seem illogical and show a significant deviation from the rest of the components of the sample in which these observations were found. It was reported (Barnett, 1978) that the outlier observation in a group of data is that observation that appears illogical when compared to other The data set Many researchers have defined outlier values, but all definitions flow into one concept, which is that outlier observation is an observation that is inconsistent with the rest of the observations (Keller & Brian, 2000). Rousseeuw also showed that outliers can be detected by looking at the error boxes, but this belief is wrong. When the extreme values are leverage points, they can approach the regression line, meaning that their error is very small, while the rest of the error values are large. Although these points are good, and from this the importance of diagnosing outliers appears as an important step in the analysis and decision-making process and represents one of the general goals in data analysis (Barnett & Lewis, 1978). The appearance of the outlier’s vision is due to several reasons, which are: The outlier appears when the data reverts to a heavy tailed distribution. Outliers appear when we have a contaminating distribution. Hawkins (1980) explained that the data come from two types of distributions: The first: the basic distribution, which generates new data. The second: the contaminating distribution, which generates outlier values. These values arise as a result of errors in measurement, reading, recording, or sampling errors arising from poor sample selection and poor representation of the community (Al-Dabbagh, 2020).
Many researchers have been concerned in their research with the issue of outliers and how they affect the accuracy of the results and study them to reach the best decisions. The newly discovered methods have varied in the field of outliers and their identification. One of these methods is scatter plots, which is the most common method. As well as the box drawing method. A distinction is made between two types of methods for detecting outliers, including Univariate Methods, which will take each variable separately, and multiple methods, which will take into account correlations between variables in the same data set. There are also other methods for detecting outliers, including the histogram, which is a very common graphic format (Hammodat, 2020). Tukey also introduced a method for organizing interval-scaled data called stem-and-leaf plotting. It is an alternative to the Histogram. The presence of outliers or outliers in the observations of the explanatory variables or the response variable affects the estimates of the model parameters, as well as the selection of the variables affecting the regression model and the associated statistics (Dan & Ijeoma, 2013).
There are many robust methods available for estimating the model parameters, and it is one of the modern robust techniques with good characteristics, the most important of which is the Robust M estimators (M-Estimators), as it has been recently paid attention to by many researchers due to its efficiency, and its use in estimating the model parameters in the absence of subject The distribution of errors for the normal distribution, that is, the observations of the response variable do not follow the normal distribution, as it works to give less weight to the outlier observation to reduce its impact and use the iteration method in the calculation, which leads to reducing the effect of self-correlation (Huber, 1973). This method is summarized in minimizing the effect of the large residual values, i.e. minimizing the sum of the squares of the errors, bearing in mind that the estimator M corresponds to the Maximum Likelihood estimates because the function ρ(.) links the potential function to choose an appropriate distribution of the residuals, and that the residual error is:
Let us assume that a data set represents a random sample that follows a continuous distribution with a probability density function as θ represents the distribution parameter and the parameter θ can be estimated using the greatest possible method and the partial likelihood formula for the greatest possibility function is as follows:
(4) Where is maximized and in terms of the function
(6) where is called the effect function and represents the partial derivative of the function ρ
Since: K: an integer. δ: an initial estimate of the measurement parameter. And the possibility function for a random sample is in the following form: (8) And estimates of the greatest possibility θ ̂ , δ ̂ to μ , δ are chosen to maximize this possibility, and when deriving the natural logarithm of the function of the greatest possibility we get
(al-obedi_2015).
a- Huber function The Robust estimators are based on the Huber weight function, which has arithmetic advantages, but is sensitive to points of attraction, which are parabolas in the vicinity of zero and increase linearly at the level, where an efficiency of 95% is obtained when the errors are distributed normally with a constant. , and that the Huber weight function is: (10) As c takes the default value c = 1.345, as the cut-off constant for each function is used to modify the efficiency of the resulting estimators for specific distributions, (S represents the standard deviation of errors), and smaller values than the constant c are more resistant to extreme values, but at the expense of lower efficiency when distributed Naturally, the tuning constant generally gives reasonable high efficiency in the case of normal, and sometimes in applications we need to estimate the standard deviation to be used in the results, and that the value of the cut-off constant ranges from one standard deviation to two standard deviations for the values of observations or errors. standard for errors. Sometimes it is recommended to use Huber's estimator in almost all cases (Al-Obeidi, 2015).
b- Hampel function
As the default values for tuning constants are a=2, b=4 and c=8.
c- Bisquare function Also called Tukey Beaton or Tukey Biweight, it reaches 95% efficiency when the errors are normally distributed. (12) Where c defaults to c=4.685.
Hazards are expected to change according to survival analysis often over time and hazard ratio, so statistical methods will require determining time or taking the mean value over time relativity. There are 3 equations for the average hazard ratio (13) (14) (15) Where AHR is the definition of average Hazard Ratio, sAHR stands for Simple Average Hazard Ratio, gAHR for Geometric Average Hazard Ratio, The weight function w(t) was chosen to reflect the relative importance related to the hazard ratios in different periods, the most used values are w(t)=1 and w(t)=s(t). Survival function or equivalent, the proportion of individuals affected by the hazard ratio at t. (Schemper, 2009).
The consequences of the assumptions violating the Cox proportional hazards model are discussed and options for proportionality are reviewed to deal with the non-proportional Cox model, where an additional option for analysis is proposed, which produces weighted estimates at the time points in which events occur. This procedure can be considered as generalizations of the tests for multiple covariates variables, in the same manner that the proportional hazards model represents a generalization of the log-rank test and its advantages are represented in the estimates of the average hazard ratios also for the covariates with the Non-proportional and especially convergent hazard functions. (Breslow,1974 ) Through an empirical study, it was found that the average hazard ratios are very close to the accurate calculations of the average hazard ratios, as defined by. suppose that . A random sample from the distribution T,C,Z(.) that satisfies the model (16) where T is the observation time (failure) of the random variable. The time of the common variable Z(.) is a predictable process. For each value of i we note (Prentice,1978).
10. Study data and its variables The data were obtained from Hazem Al-Hafiz Hospital for Cancer Cancers in Nineveh Governorate, and that the research population represents patients with breast cancer who were diagnosed with the disease in the period from 2007 to 2013. Their total number is 246 patients. The data of this study are the dates of disease diagnosis until the date of death or the date of the last follow-up of the patient in 2013 AD, and the follow-up period was calculated in months from the date of diagnosis of the disease until death or the date of the last follow-up, which is the survival time. Survival which is given a value of (0) when the patient is dead and a value of (1) if the patient is alive This variable is a descriptive variable whose mathematical relationship with the rest of the variables is built by the Cox regression model and as follows:
That is, the variable adopted in the above equation is h(t) )the severity function) represents the variable of a function in terms of the original response variable)Y) since h)t( represents the death rate, The illustrative variables are as follows: 1-Age:Represents age in years and is a qualitative variable 2-Presentation: It is a descriptive variable 3- Tumor Site: A descriptive variable that takes (1) if it is from the right side, and takes (2) if it is from the left side "Left". 4- Mass Site: Descriptive variable 5- Lymph Node: A descriptive variable that takes the following values: (0) If negative (1)If positive 6- Metastasis: a descriptive variable 7- Estrogen: This variable takes one of the following values: (1): Yes (0): No 8- Progesterone: It takes one of the following values: (1): Yes (0) No 9- Her2 is an immunoma generator on the surface of cells: It takes one of the following values (0): No (1) There is "Yes". 10- DXT: Deep X-ray therapy has one of the following values: (0): "not treated", (1): "treated". 11- Hormonal Therapy: (0) "not treated". (1) "treated".
Month: The time variable on which Cox's regression method depends, calculated from the beginning of the diagnosis of the injury and ending with knowledge of the analysis of survival (occurrence of the event) which is death in our case.
Cox regression model with Huber weight function and Robustness to outlier data After we analyzed the data using R ,We note from Tables (1, 2, 3) that when applying the three templates (PH-AHR-ARE) in the absence of Robustness and applying the weight function (Huber) to the variables, it turns out that in the template (PH) and (AHR) We showed a significant variable of the progesterone hormone, and its probability value was (0.000000), meaning that (sig = 0.00000 < 0.05) in both templates, which means that this variable is significant, meaning that it effect the occurrence of the death event. Observing Wald's statistics, it became clear to us that the (PH) template had the largest value, as it was (1087.745), meaning that this template was more efficient than the rest of the templates in determining the variables affecting the model.
Table 1: Template model (PH) with weighted estimation, outliers, and Robustness
Table 2: AHR template model with weighted estimation, outliers, and Robustness
Table 3: Template model (ARE) with weighted estimation, outliers, and Robustness
Proposed algorithm for fortifying Cox's regression model In this thesis, an algorithm was attempted to fortify the estimation process in the Cox regression model to identify the main factors affecting the dependent variable (dwell time in our data), and this was done by following the following series of steps:
Figure 1: The proposed algorithm to Robust the Cox regression model
The results of the proposed likelihood to immunize the Cox regression model The results of the analysis indicated that when applying the three templates (PH-AHR-ARE) in the case of outlier data, only the disease progression variable appeared in the absence of significant Robust to the variables, and its probability value was (0.04473447), meaning that (sig = 0.04473447 < 0.05) It effect the model, but in the case of immunity to the variables, the disease progression variable (the way the disease progressed) and the progesterone variable appeared significant, and the probability value for them was respectively (0.02479655, 0.000000, 0.03495496) mean that (sig=0.02749655<0.05) (sig=0.000000<0.05) , (sig=0.03495496<0.05) effect the occurrence of the event). As for when applying the Cox regression model of the Robust weighted by functions (Huber, Hampel, Bisquare) on the data that contain outlier values, it was found that the Huber function is the function that gave good results, as the progesterone hormone variable showed us a significant in the case of Robustness to the variables, which means that the progesterone hormone The most influential variable on the occurrence of the event (death). It has been found from conducting all the analyzes of the Wald statistic that in the case of data without outliers, the results of the template (ARE) appeared in the case of a greater value of Robust, which amounted to (739.0424), but in the case of data with outliers and in the presence of Robustness, the Wald statistic for the (PH) test was greater It amounted to (938.2368), and in the case of applying weight to outlier data, a larger (PH) test also appeared, and its value was (1087.745). And by making comparisons using (P-Value) and (Wald) test, we found that the template (PH) in the presence of outlier values and Robust weights is the best among all the templates that were tested, as it produced a model with the highest significance.
Table 4: The proposed method for immunizing the Cox regression model
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
References | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
References
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Statistics Article View: 122 PDF Download: 104 |