To identify the bio-mark genes related to disease with high dimension and low sample size gene expression data, various regression approaches with different regularization methods have been proposed to solve this problem. Nevertheless, high-noises in biological data significantly reduce the performances of methods. The accelerated failure time (AFT) modelwas designed for gene selection and survival time estimation in cancer survival analysis. In this article, we proposed a novel robust sparse accelerated failure time model (RS-AFT) through combining the least absolute deviation (LAD) and L regularization. An iterative weighted linear programming algorithm without regularization parameter tuning was proposed to solve this RS-AFT model. The results of the experiments show our method has better performancebothin gene selection and survival time estimationthan some widely used regularization methods such as lasso, elastic net and SCAD. Hence we thought the RS-AFT model may be a competitive regularization method in cancer survival analysis.
Accurate estimation for the cancer patients’ survival time with high dimension and low sample size gene expression dataset is a significant challenge in survival analysis. The efficient method which can identify the relevant genes associated with tumours may be helpful for cancer research and treatment. In the last two decades, the Cox proportional hazards (Cox) model with the regularization approach has been widely used for the patient risk classification and relevant biomarkers identification [1, 2, 3]. However the Cox proportional hazards model may not be suitable if the data does not meet the proportional hazards assumption. Meanwhile, the patient’s survival time estimationhas become a very important requirement in clinical treatment. Hence the accelerated failure time (AFT) model has already become one succedaneum of the Cox model in cancer survival analysis. Nevertheless the small sample size limits the performance of AFT model construction, such as the censored data in the cancer clinical data cannot be directly used in the model training. To increase the number of available data in the AFT model, some different imputation methods were proposed in cancer survival analysis. The most widely used one is Buckley-James estimation method [4, 5, 6], it estimates the censored data using the Kaplan-Meier approach; the other one is called ranking based estimator, it estimated the survival time from computing the score function of the partial likelihood [7, 8]. In our proposed RS-AFT model, we used Kaplan-Meier approach  to deal with the censored data.
The traditional ordinary least squares (OLS) approach has been used to construct the prediction model for a long time. However the OLS approach is sensitive to noise in the data, which significantly reduces its robustness in the practical application. Meanwhile the OLS estimation cannot achieve an unbiased solution under some certain conditions, and its estimated variance is quite large . To improve the performance of the OLS estimation, the robust regression and the regularization methods were proposed. The least absolute deviation (LAD) is the kind of the robust regression method to confront the noise. The regularization approaches are widely used for variable selection in high dimensional data analysis. To overcome the shortcomings in the OLS method, Li et al.  proposed a RLAD method that combines the robust regression and regularization approach together. After that the LAD-lasso  and LAD-Adaptive lasso  were implemented. However compared to the L type regularization, the L(0 1) type regularization can obtain more sparse result, and it has some attractive properties, such as unbiasedness, oracle properties and consistency of variable selection [14, 15]. Therefore, Chang et al.  proposed LAD-L regularization, which outperforms some existing methods based on the OLS with L type regularization approaches in variable selection.
Considering the high dimensional and low sample size data in cancer survival analysis, many different kinds of regularization methods were used as the penalty function to combine with the regression loss function, such as lasso , elastic net , the smoothly clipped absolute deviation (SCAD) , L regularization . These methods help the model predict the objective function value and select the feature genes related to the disease; however we found it was a difficult work to get a balance between the prediction accuracy and sparsity. Usually high prediction accuracy means large numbers of the selected genes; it means people have to waste much time for researching some unrelated genes. We considered the LAD-L regularization, which have the advantages of LAD and L (0 1), was a good choice to instead of these old regularization methods. Hence we proposed a robust sparse AFT model with LAD-L regularization approach (RS-AFT), we thought the new model can generate good performances in survival time estimation, and it has a powerful ability to find the cancer related genes because of its sparsity.
Supposing the dataset included samples, represents the single patient’s sample, where is the observed survival time pf the patient, represents the sample is the censored data and if means the sample is the completed data, indicate the dimensional covariates.
The AFT model can be written as a linear regression model: where (.) is the log transformation or some other monotone functions, is the independent random error with a normal distribution function, and is the regression coefficient vector of variables.
For estimating the censored time, we used the Kaplan-Meier weights estimation method because of its simple and fast . The estimated value of the censored time can be written as:
where is the step of at time .
As we know, the least squares approach method is widely used to find :
To overcome the shortcomings of least squares approach, especially for data with high noise, the least absolute deviation (LAD) was adopted:
In fact, not all genes in the microarray dataset may be associated with the patient’s survival time, which means some coefficients may be zero in the true model. A good method should select bio-mark genes consistently and efficiently. Some regularization methods have been widely used to find the true disease related genes. The different penalty function regularized AFT model using LAD approach will be written as:
The AFT model with the LAD-lasso regularization approach is:
Trying to get more sparse solutions, we proposed the robust sparse AFT model with the LAD-L approach (RS-AFT):
Solving RS-AFT model is a non-convex optimization problem. We designed the weighted iterative algorithm to solve it. The regularization part in the RS-AFT model can be replaced by the first-order Taylor expansion:
The minimization problem of the RS-AFT model will be shown:
In the literature , the BIC method was used to select the optimal regularization parameter . The likelihood function of the posterior probability by BIC is given by:
( is obtained by the least absolute deviation of the AFT model) can be seen as the estimator of can be written as:
Since the variable selection consistency of the L method has been proved in , we simply set 1 in the weighted iterative algorithm for turning the parameter . The detail procedure of the weighted iterative algorithm for the RS-AFT model is given:
|The weighted iterative algorithm for the RS-AFT model|
|Input: The training dataset|
|Output: The AFT estimator|
|1:||Initialize 0 (), compute the by using the least absolute deviation in the AFT model;|
|2:||Set , ;|
In this section, we compared the AFT models with four different regularization approaches (LAD-Lq, lasso, SCAD, elastic net (EN)). Firstly we generated the vectors of independent standard normal distribution and set where is the correlationcoefficient , the patient’s survival time was computed as: . The number of the censored data was decided by the censored rate , and the censored time were determined from a random distribution accordingly. The observed survival time in the simulated data was defined as: & . To test the performances of the different methods in the noise environment, we calculated where is the noise control parameters and is the independent random errors from (0, 1). Finally the simulated data were represented as .
We set the dimension of the simulated datasets 1500. The coefficients of the 10 genes in these 1500 genes were nonzero, and the coefficients of the remaining 1490 genes are zero. The right censored rate 30%. We set training sample size 150, the correlation coefficient 0, 0.3 and the noise control parameter 0, 0.3 respectively. Each result obtained by different method was tested on a dataset including 50 samples, and the final outcomes were averaged over 100 repeats in the programme.
In this article, we used four evaluation parameters to compare the performances of different methods, the sensitivity, specificity, efficiency and absolute error . The sensitivity, specificity, and efficiency parameters were used to test the gene selection performance. Supposing true positive (TP) is the number of selected correct genes, true negative (TN) is the number of the irrelevant genes which are selected, false negative (FN) is the number of the related genes to the disease which are not selected, and the false positive (FP) is the number of the irrelevant genes which are not selected by different methods.
The absolute error was computed to test the ability of survival time estimation:
where the is the survival time of patient in the dataset, and the is the estimated survival time of the patient using our model.
|Control parameter||Number of total selected genes||Number of correct genes|
Tables 1 and 2 show gene selection performances of different methods in the different parameter settings. We found that with the decreasing of the noise parameter and the correlation coefficient , the models’ performances become better. In Table 1 the RS-AFT always selected the least disease related genes in different datasets. Conversely, the AFT model with elastic net invariably selected most genes. The number of total genes selected by AFT model with SCAD was more than our model but less than lasso. Compared the number of selected correct genes, the elastic net selected most correct genes because its largest number of selected genes; the number of correct genes selected by remain three methods were much closed.
In Table 2, elastic net obtained the highest sensitivity because it selected most correct genes, but the specificity of elastic net was lowest because most irrelevant genes. The values of specificity obtained by our model were much closed to 1, it means RS-AFT model rarely selected irrelevant genes, we can say most of the selected genes obtained by RS-AFT were correct. And also we found the gabs between the values of specificity obtained by RS-AFT and SCAD were very small. Compared the efficiency, it is easy to find the gene selection efficiency of RS-AFT was highest, it means the users can easily find the true disease related genes in the RS-AFT model selected genes. These above results indicate that compared the gene selection performance, our RS-AFT model was better than the AFT models with lasso, elastic net and SCAD, it can help researchers find the real bio-mark genes fast.
The absolute errors obtained by different methods in simulation experiments were shown in Table 3, we can find the absolute errors obtained by elastic net model were always biggest, the SCAD was better than lasso, and the RS-AFT model achieved the smallest absolute error. Hence we thought our method has best performance in survival time estimation compared other three methods.
From the above discussion, we thought the RS-AFT model was a more appropriate approach for can survival analysis in the microarray gene expression data because of its good performance of gene selection, and the high estimation precision for the patients’ survival time.
4.2Real data experiments
In this section, different methods were applied to the four real survival microarray datasets respectively, Diffuse large B-cell lymphoma dataset (DLBCL) 2002, DLBCL (2003), Lung cancer dataset and AML dataset. The DLBCL 2002 contains about 240 lymphoma patients’ information and was first published in  by Rosenwald. Each patient sample includes the expression data of 7399 genes and the observed survival or censored time. Compared to DLBCL2002, DLBCL2003 only have 92 samples about the lymphoma patient, but the number of observed genes increased to 8810 . The lung cancer dataset was published by van Beer , it has 86 cancer patients’ samples which each sample include 7129 genes. The AML dataset was first mentioned by Bullinger, and has 116 patients which contains 6283 genes. A brief introduction of these datasets is summarized in Table 4.
|Dataset||No. of genes||No. of samples||No. of censored||No. of training||No. of testing|
Trying to compare the performance of four different AFT models, two thirds of the samples in the real dataset were used for the training and the other samples were seen as the data. The regularization parameters of different methods are tuned by the 5-fold cross validation.
The relevant gene selection performances of different AFT models in the four real datasets were shown in Table 5. The number of genes selected by our RS-AFT model was the least. The results of the SCAD were second-least and closed to the results of RS-AFT model. The third-least one is the number of genes selected by AFT model with lasso. The number of genes selected by AFT model with EN was much more compared with the other three methods. It means the researchers will pay much time to eliminate the irrelevant genes.
Table 6 describes the averaged absolute error obtained by different AFT models in four datasets. It was obviously the performance of SCAD was better than lasso and the elastic net achieved the biggest absolute errors. And we can get the same conclusion as in the simulation experiments: the RS-AFT model achieved highest estimation precision with the least errors, which are much smaller than other method.
Comparing the performances in Tables 5 and 6, the results proved our RS-AFT model both have better performances in gene selection and survival time estimation. These are very important considerations in disease research and clinical application in cancer survival analysis. Hence we thought our method is more competitive than other regularization methods.
For biological analysis of the results, 15 top-ranked genes selected by the different AFT methods in Lung cancer dataset were shown in Table 7. Compared with the other AFT models based on the least squares approaches with different regularization methods, the RS-AFT model selected some unique genes, such as SMAD4, ENPP2, LLGL1. SMAD4 belongs to the member of Smad family which is one kind of signal transduction proteins. The Smad family proteins play a key role in transmitting the TGF-beta signals from the cell-surface receptor to cell nucleus, mutation or deletion of SMAD4, which has been proved to lead to the pancreatic cancer . We think it may be strongly associated with the lung cancer. ENPP2 is also known as ATX, this gene can stimulate the motility of tumour cells. The expression of ENPP2 has been found to be up regulated in some different kinds of cancers . The protein encoded by the gene LLGL1 was said to be very similar to the tumour suppressor of drosophila which is a highly relevant gene to cancer . What is more, some relevant genes selected by other AFT models with lasso, elastic net and SCAD, were also found by the RS-AFT, for example, TRA2A, WWP1, DOC2A and HUWE1. They are significantly associated to the lung cancer which has been discussed .
We also obtained the similar experimental results from the analysis of the other three real datasets. The biological analysis showed that the RS-AFT model not only can find the relevant genes which were selected by AFT models with other regularization methods, but also can find some unique genes, which were not selected by other AFT models but also significantly associated to disease. Hence, we can say the RS-AFT model may find the disease related genes accurately and efficiently.
The experiment results show that the RS-AFT model outperforms some existing survival estimation approaches. It can effectively select the bio-mark genes and estimate the patients’ survival time accurately in high dimensional and low sample size biological datasets. With the less mark genes and accurate survival time prediction, this method will be a more practical tool for cancer research and treatment.
In the data experiments we found that large number of the censored data great effect the accuracy of the RS-AFT model. The more censored data, the more difficulty we get in the experiments. Hence in the future work, we will try to combine the RS-AFT model with some machine learning methods, such as some semi supervised methods, we thought they may have strong ability to against with the censored data, the more completed data will improve the accuracy of our RS-AFT model obviously.
This work is supported by the Macau Science and Technology Development Funds (Grand No. 003/2016/AFJ) from the Macau Special Administrative Region of China.
Conflict of interest
None to report.
Tibshirani R. The lasso method for variable selection in the Cox model. Stat Med 1997; 16: 385-395.
Gui J, Li H. Penalized Cox regression analysis in the high-dimensional and low-sample size setting, with applications to microarray gene expression data. Bioinformatics 2005; 21: 3001-3008.
Liu C, et al. The L1/2 regularization method for variable selection in the Cox model. Appl Soft Comput 2014; 14(c): 498-503.
Buckley J, James I. Linear regression with censored data. Biometrika 1979; 66: 429-436.
Tsiatis A. Estimating regression parameters using linear rank tests for censored data. Ann Stat 1990; 18: 354-372.
Huang J, Ma S, Xie H. Regularized estimation in the accelerated failure time model with high dimensional covariates. Biometrics 2006; 62: 813-820.
Cai T, Huang J, Tian L. Regularized estimation for the accelerated failure time model. Biometrics 2009; 65: 394-404.
Jin Z, Lin DY, Wei LJ, Ying Z. Rank-based inference for the accelerated failure time model. Biometrika 2003; 90: 341-353.
Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc 1958; 53: 457-481.
Chang XY, Xu ZB, Zhang H, et al. Robust regularization theory based on Lq (0 < q < 1) regularization: The asymptotic distribution and variable selection consistence of solutions (in Chinese). Sci Sin Mat 2010; 40(10): 985-998. doi: 10.1360/012010-77.
Li W, Michael DG, Ji Z. Regularized least absolute deviations regression and an efficient algorithm for parameter tuning. In: Proceedings of the Six International Conference on Data Mining Washington, IEEE Computer Society 2006; 690-700.
Wang H, Li G, Jiang G. Robust regression shrinkage and consistent variable selection through the lad-lasso. J Business Economic Statist 2007; 25: 347-355.
Xu JF, Ying ZL. Simultaneous estimation and variable selection in median regression using lasso-type penalty. Ann Inst Stat Math 2010; 62: 487-514.
Xu ZB, Zhang H, Wang Y, et al. L1/2 regularization. Sci China Ser F 2010; 53: 1159-1169.
Chartrand R, Staneva V. Restricted isometry properties and nonconvex compressive sensing. Inverse Problem 2008; 24: 1-14.
Rajaratnam B, Sparks D. Fast Bayesian lasso for high-dimensional regression. Statistics 2015.
Jing LI, Wang J, Hui LI, et al. Selection and classification of elastic net feature with fused electroencephalogram features. Journal of Biomedical Engineering 2016.
Miao L, Zhou J, Naylor C, et al. Application of penalized linear regression methods to the selection of environmental enteropathy biomarkers. Biomarker Research 2017; 5(1): 9.
Liu C, Liang Y, Luan XZ, et al. The L1/2, regularization method for variable selection in the Cox model. Applied Soft Computing 2014; 14(1): 498-503.
Datta S. Estimating the mean life time using right censored data. Stat Methodol 2005; 2: 65-69.
Hurvich CM, Tsai CL. Regression and time series model selection in small samples. Biometrika 1989; 76: 297-307.
Sohn I, Kim J, Jung SH, Park C. Gradient lasso for Cox proportional hazards model. Bioinformatics 2009; 25(14): 1775-1781.
Rosenwald A, et al. The use of molecular profiling to predict survival after chemotherapy for diffuse large B-cell lymphoma. N Engl J Med 2002; 346: 1937-1946.
Rosenwald A, et al. The proliferation gene expression signature is a quantitative integrator of oncogenic events that predicts survival in mantle cell lymphoma. Cancer Cell 2003; 3: 185-197.
Beer DG, et al. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat Med 2002; 8: 816-824.
Bullinger L, et al. Use of gene-expression profiling to identify prognostic subclasses in adult acute myeloid leukemia. N Engl J Med 2004; 350: 1605-1616.
Boone BA, et al. Loss of SMAD4 staining in pre-operative cell blocks is associated with distant metastases following pancreaticoduodenectomy with venous resection for pancreatic cancer. J Surg Oncol 2014; 110(2): 171-5.
Umezu-Goto M, et al. Autotaxin has lysophospholipase D activity leading to tumor cell growth and motility by lysophosphatidic acid production. J Cell Biol 2002; 158(2): 227-33.
Schimanski CC, et al. Reduced expression of Hugl-1, the human homologue of Drosophila tumour suppressor gene lgl, contributes to progression of colorectal cancer. Oncogene 2005; 24(19): 3100-9.
Chai H, et al. The L1/2 regularization approach for survival analysis in the accelerated failure time model. Comput Biol Med 2014.