You are viewing a javascript disabled version of the site. Please enable Javascript for this site to function properly.
Go to headerGo to navigationGo to searchGo to contentsGo to footer
In content section. Select this link to jump to navigation

Robust sparse accelerated failure time model for survival analysis

Abstract

To identify the bio-mark genes related to disease with high dimension and low sample size gene expression data, various regression approaches with different regularization methods have been proposed to solve this problem. Nevertheless, high-noises in biological data significantly reduce the performances of methods. The accelerated failure time (AFT) modelwas designed for gene selection and survival time estimation in cancer survival analysis. In this article, we proposed a novel robust sparse accelerated failure time model (RS-AFT) through combining the least absolute deviation (LAD) and Lq regularization. An iterative weighted linear programming algorithm without regularization parameter tuning was proposed to solve this RS-AFT model. The results of the experiments show our method has better performancebothin gene selection and survival time estimationthan some widely used regularization methods such as lasso, elastic net and SCAD. Hence we thought the RS-AFT model may be a competitive regularization method in cancer survival analysis.

1.Introduction

Accurate estimation for the cancer patients’ survival time with high dimension and low sample size gene expression dataset is a significant challenge in survival analysis. The efficient method which can identify the relevant genes associated with tumours may be helpful for cancer research and treatment. In the last two decades, the Cox proportional hazards (Cox) model with the regularization approach has been widely used for the patient risk classification and relevant biomarkers identification [1, 2, 3]. However the Cox proportional hazards model may not be suitable if the data does not meet the proportional hazards assumption. Meanwhile, the patient’s survival time estimationhas become a very important requirement in clinical treatment. Hence the accelerated failure time (AFT) model has already become one succedaneum of the Cox model in cancer survival analysis. Nevertheless the small sample size limits the performance of AFT model construction, such as the censored data in the cancer clinical data cannot be directly used in the model training. To increase the number of available data in the AFT model, some different imputation methods were proposed in cancer survival analysis. The most widely used one is Buckley-James estimation method [4, 5, 6], it estimates the censored data using the Kaplan-Meier approach; the other one is called ranking based estimator, it estimated the survival time from computing the score function of the partial likelihood [7, 8]. In our proposed RS-AFT model, we used Kaplan-Meier approach [9] to deal with the censored data.

The traditional ordinary least squares (OLS) approach has been used to construct the prediction model for a long time. However the OLS approach is sensitive to noise in the data, which significantly reduces its robustness in the practical application. Meanwhile the OLS estimation cannot achieve an unbiased solution under some certain conditions, and its estimated variance is quite large [10]. To improve the performance of the OLS estimation, the robust regression and the regularization methods were proposed. The least absolute deviation (LAD) is the kind of the robust regression method to confront the noise. The regularization approaches are widely used for variable selection in high dimensional data analysis. To overcome the shortcomings in the OLS method, Li et al. [11] proposed a RLAD method that combines the robust regression and regularization approach together. After that the LAD-lasso [12] and LAD-Adaptive lasso [13] were implemented. However compared to the L1 type regularization, the Lq(0 <q< 1) type regularization can obtain more sparse result, and it has some attractive properties, such as unbiasedness, oracle properties and consistency of variable selection [14, 15]. Therefore, Chang et al. [10] proposed LAD-L q regularization, which outperforms some existing methods based on the OLS with L1 type regularization approaches in variable selection.

Considering the high dimensional and low sample size data in cancer survival analysis, many different kinds of regularization methods were used as the penalty function to combine with the regression loss function, such as lasso [16], elastic net [17], the smoothly clipped absolute deviation (SCAD) [18], L1/2 regularization [19]. These methods help the model predict the objective function value and select the feature genes related to the disease; however we found it was a difficult work to get a balance between the prediction accuracy and sparsity. Usually high prediction accuracy means large numbers of the selected genes; it means people have to waste much time for researching some unrelated genes. We considered the LAD-Lq regularization, which have the advantages of LAD and Lq (0 <q< 1), was a good choice to instead of these old regularization methods. Hence we proposed a robust sparse AFT model with LAD-Lq regularization approach (RS-AFT), we thought the new model can generate good performances in survival time estimation, and it has a powerful ability to find the cancer related genes because of its sparsity.

2.Method

Supposing the dataset included n samples, (yi,δi,xi)i=1n represents the single patient’s sample, where yi is the observed survival time pf the patient, δi=0 represents the sample is the censored data and if δi=1 means the sample is the completed data, xi=(xi1,,xip)indicate the p dimensional covariates.

The AFT model can be written as a linear regression model: h(yi)=xiTβ+εi,i=1,,n where h(.) is the log transformation or some other monotone functions, εi is the independent random error with a normal distribution function, and β=(β1,β2,,βp) is the regression coefficient vector of p variables.

For estimating the censored time, we used the Kaplan-Meier weights estimation method because of its simple and fast [6]. The estimated value h(yi) of the censored time yi can be written as:

(1)
h(yi)=(δi)h(yi)+(1-δi){S^(yi)}-1t(r)>th(t(r))ΔS^(t(r))

where ΔS^(t(r)) is the step of at time t(r) [20].

As we know, the least squares approach method is widely used to find β:

(2)
βLS=argmini=1n(h(yi)-xiTβ)2

To overcome the shortcomings of least squares approach, especially for data X with high noise, the least absolute deviation (LAD) was adopted:

(3)
βLAD=argmini=1n|h(yi)-xiTβ|

In fact, not all genes in the microarray dataset may be associated with the patient’s survival time, which means some coefficients β may be zero in the true model. A good method should select bio-mark genes consistently and efficiently. Some regularization methods have been widely used to find the true disease related genes. The different penalty function regularized AFT model using LAD approach will be written as:

(4)
βLAD=argmin{i=1n|h(yi)-xiTβ|+λP(β)}

The AFT model with the LAD-lasso regularization approach is:

(5)
β=argmin{i=1n|h(yi)-xiTβ|+j=1pλ|βj|1}

Trying to get more sparse solutions, we proposed the robust sparse AFT model with the LAD-Lq approach (RS-AFT):

(6)
β=argmin{i=1n|h(yi)-xiTβ|+j=1pλ|βj|q}

3.Algorithm

Solving RS-AFT model is a non-convex optimization problem. We designed the weighted iterative algorithm to solve it. The regularization part |β|q in the RS-AFT model can be replaced by the first-order Taylor expansion:

(7)
|β|q|β0|q+1|β0|1-q(|β|-|β0|)=|β||β0|1-q

The minimization problem of the RS-AFT model will be shown:

(8)
β=argmin{i=1n|h(yi)-xiTβ|+j=1pλ|β0,j|1-q|βj|}

In the literature [21], the BIC method was used to select the optimal regularization parameter λ. The likelihood function of the posterior probability by BIC is given by:

(9)
l(β)=i=1n|h(yi)-xiTβ|+j=1p(λ|βj|q-log(λ)log(n)-log(n))

Minimizing l(β):

(10)
λn,j=log(n)|βj|q

|β𝐿𝐴𝐷|q (β𝐿𝐴𝐷 is obtained by the least absolute deviation of the AFT model) can be seen as the estimator of |β|q,λ can be written as:

(11)
λ=log(n)|β𝐿𝐴𝐷|q

Since the variable selection consistency of the L(0<q1)q method has been proved in [1], we simply set q= 1 in the weighted iterative algorithm for turning the parameter λ. The detail procedure of the weighted iterative algorithm for the RS-AFT model is given:

The weighted iterative algorithm for the RS-AFT model
Input: The training dataset (Y,δ,X)
Output: The AFT estimator
1:Initialize β= 0 (j=1,,p), compute the β𝐿𝐴𝐷 by using the least absolute deviation in the AFT model;
2:Set β=β𝐿𝐴𝐷, t=1;
3:whilej=1p|βtj-β(t-1)j|<10E-5do
4: Update β:
5:argmin{i=1n|h(yi)-xiTβ|+j=1pλ|β0,j|1-q|βj|};
6:t=t+1;
7:end while
8:return β.

4.Results

4.1Simulation experiments

In this section, we compared the AFT models with four different regularization approaches (LAD-Lq, lasso, SCAD, elastic net (EN)). Firstly we generated the vectors of independent standard normal distribution γ,0γi1,γi2,,γip,(i=1,2,,n) and set xij=γij1-c+γi0c,(j=1,,p), where c is the correlationcoefficient [22], the patient’s survival time was computed as: ti=exp(j=1pβijxij),(j=1,2,,p). The number of the censored data was decided by the censored rate r, and the censored time ei were determined from a random distribution accordingly. The observed survival time in the simulated data was defined as: yi=min(ti,ei) & δi=I(ytiyti). To test the performances of the different methods in the noise environment, we calculated yi=yi+sε, where s is the noise control parameters and ε is the independent random errors from N(0, 1). Finally the simulated data were represented as (yi,δi,xi).

We set the dimension of the simulated datasets p= 1500. The coefficients of the 10 genes in these 1500 genes were nonzero, and the coefficients β of the remaining 1490 genes are zero. The right censored rate r= 30%. We set training sample size n= 150, the correlation coefficient c= 0, 0.3 and the noise control parameter s= 0, 0.3 respectively. Each result obtained by different method was tested on a dataset including 50 samples, and the final outcomes were averaged over 100 repeats in the programme.

In this article, we used four evaluation parameters to compare the performances of different methods, the sensitivity, specificity, efficiency and absolute error E. The sensitivity, specificity, and efficiency parameters were used to test the gene selection performance. Supposing true positive (TP) is the number of selected correct genes, true negative (TN) is the number of the irrelevant genes which are selected, false negative (FN) is the number of the related genes to the disease which are not selected, and the false positive (FP) is the number of the irrelevant genes which are not selected by different methods.

(12)
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦=𝑇𝑃𝑇𝑃+𝐹𝑁
(13)
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦=𝑇𝑁𝑇𝑁+𝐹𝑃
(14)
𝐸𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑐𝑦=selected correct genestotal selected genes

The absolute error E was computed to test the ability of survival time estimation:

(15)
E=i=1n|yei-yi|n & δi=1

where the yi is the survival time of patient i in the dataset, and the yei is the estimated survival time of the patient using our model.

Table 1

Theperformanceofgene selection obtainedby different AFT methods

Control parameterNumber of total selected genesNumber of correct genes
RS-AFTLassoSCADENRS-AFTLassoSCADEN
c= 0.3, s= 0.328.4273.6039.41115.858.158.218.118.55
c= 0.3, s= 0.018.2449.0326.9781.438.798.858.829.17
c= 0, s= 0.320.4755.1728.7887.428.608.658.648.93
c= 0, s= 0.013.8736.4720.1955.389.159.189.129.42

Table 2

The gene selection performancesofdifferent methods in simulation experiments

Control parameterSensitivitySpecificityEfficiency
RS-AFTLassoSCADENRS-AFTLassoSCADENRS-AFTLassoSCADEN
c= 0.3, s= 0.30.8150.8210.8110.8550.9860.9560.9790.9280.2870.1110.2060.074
c= 0.3, s= 0.00.8790.8850.8820.9170.9940.9730.9880.9520.4820.1810.3270.113
c= 0, s= 0.30.8600.8650.8640.8930.9920.9690.9860.9470.4200.1560.3000.102
c= 0, s= 0.00.9150.9180.9120.9420.9970.9820.9930.9690.6590.2520.4520.170

Tables 1 and 2 show gene selection performances of different methods in the different parameter settings. We found that with the decreasing of the noise parameter s and the correlation coefficient c, the models’ performances become better. In Table 1 the RS-AFT always selected the least disease related genes in different datasets. Conversely, the AFT model with elastic net invariably selected most genes. The number of total genes selected by AFT model with SCAD was more than our model but less than lasso. Compared the number of selected correct genes, the elastic net selected most correct genes because its largest number of selected genes; the number of correct genes selected by remain three methods were much closed.

In Table 2, elastic net obtained the highest sensitivity because it selected most correct genes, but the specificity of elastic net was lowest because most irrelevant genes. The values of specificity obtained by our model were much closed to 1, it means RS-AFT model rarely selected irrelevant genes, we can say most of the selected genes obtained by RS-AFT were correct. And also we found the gabs between the values of specificity obtained by RS-AFT and SCAD were very small. Compared the efficiency, it is easy to find the gene selection efficiency of RS-AFT was highest, it means the users can easily find the true disease related genes in the RS-AFT model selected genes. These above results indicate that compared the gene selection performance, our RS-AFT model was better than the AFT models with lasso, elastic net and SCAD, it can help researchers find the real bio-mark genes fast.

The absolute errors obtained by different methods in simulation experiments were shown in Table 3, we can find the absolute errors obtained by elastic net model were always biggest, the SCAD was better than lasso, and the RS-AFT model achieved the smallest absolute error. Hence we thought our method has best performance in survival time estimation compared other three methods.

From the above discussion, we thought the RS-AFT model was a more appropriate approach for can survival analysis in the microarray gene expression data because of its good performance of gene selection, and the high estimation precision for the patients’ survival time.

Table 3

The absolute error E obtained by different methods

Control parameterRS-AFTLassoSCADEN
c= 0.3, s= 0.31.953.852.564.37
c= 0.3, s= 0.01.162.431.752.94
c= 0, s= 0.31.282.611.833.12
c= 0, s= 0.00.731.741.172.05

4.2Real data experiments

In this section, different methods were applied to the four real survival microarray datasets respectively, Diffuse large B-cell lymphoma dataset (DLBCL) 2002, DLBCL (2003), Lung cancer dataset and AML dataset. The DLBCL 2002 contains about 240 lymphoma patients’ information and was first published in [23] by Rosenwald. Each patient sample includes the expression data of 7399 genes and the observed survival or censored time. Compared to DLBCL2002, DLBCL2003 only have 92 samples about the lymphoma patient, but the number of observed genes increased to 8810 [24]. The lung cancer dataset was published by van Beer [25], it has 86 cancer patients’ samples which each sample include 7129 genes. The AML dataset was first mentioned by Bullinger[26], and has 116 patients which contains 6283 genes. A brief introduction of these datasets is summarized in Table 4.

Table 4

The detail information of four real gene expression datasets used in the experiments

DatasetNo. of genesNo. of samplesNo. of censoredNo. of trainingNo. of testing
DLBCL (2002)739924010216872
DLBCL (2003)881092286428
Lung cancer712986626026
AML6283116498135

Table 5

The number of selected genes obtained by different AFT models on the real datasets

DatasetRS-AFTLassoSCADEN
DLBCL (2002)58.52131.2673.70168.43
DLBCL (2003)29.8483.2830.61109.71
Lung cancer28.4386.5139.42102.46
AML39.11110.3768.57152.83

Trying to compare the performance of four different AFT models, two thirds of the samples in the real dataset were used for the training and the other samples were seen as the data. The regularization parameters of different methods are tuned by the 5-fold cross validation.

The relevant gene selection performances of different AFT models in the four real datasets were shown in Table 5. The number of genes selected by our RS-AFT model was the least. The results of the SCAD were second-least and closed to the results of RS-AFT model. The third-least one is the number of genes selected by AFT model with lasso. The number of genes selected by AFT model with EN was much more compared with the other three methods. It means the researchers will pay much time to eliminate the irrelevant genes.

Table 6 describes the averaged absolute error E obtained by different AFT models in four datasets. It was obviously the performance of SCAD was better than lasso and the elastic net achieved the biggest absolute errors. And we can get the same conclusion as in the simulation experiments: the RS-AFT model achieved highest estimation precision with the least errors, which are much smaller than other method.

Comparing the performances in Tables 5 and 6, the results proved our RS-AFT model both have better performances in gene selection and survival time estimation. These are very important considerations in disease research and clinical application in cancer survival analysis. Hence we thought our method is more competitive than other regularization methods.

Table 6

Absolute error obtained by different AFT models on the real microarray datasets

DatasetRS-AFTLassoSCADEN
DLBCL (2002)0.651.310.841.87
DLBCL (2003)1.162.281.412.83
Lung cancer1.843.302.664.18
AML3.114.933.746.08

Table 7

The disease related genes selected by different AFT methods in lung cancer dataset

RankRSLassoENSCAD
1SMAD4WWP1TRA2AWWP1
2ENPP2HUWE1WWP1TRA2A
3TRA2ATRA2ACCL21HUWE1
4LLGL1CCL21HUWE1CCL21
5WWP1ADMADMADM
6DYNLT3PBXIP1RPL36ALPHKG1
7DOC2ARPS29HLA-CHLA-C
8HUWE1TNNC2PEX7RPS29
9TEKDOC2AZNF148DOC2A
10PHKG1HLA-CINHAATRX
11PFN1HTR6RPS29ENPP2
12RPL23TFAP2CDOC2AZNF148
13ENPP2ZNF148SERINC3TFAP2C
14POLR2AHUMBINDCGNSTNNC2
15CFTRRPL36ALATRXRAD23B

5.Conclusion

For biological analysis of the results, 15 top-ranked genes selected by the different AFT methods in Lung cancer dataset were shown in Table 7. Compared with the other AFT models based on the least squares approaches with different regularization methods, the RS-AFT model selected some unique genes, such as SMAD4, ENPP2, LLGL1. SMAD4 belongs to the member of Smad family which is one kind of signal transduction proteins. The Smad family proteins play a key role in transmitting the TGF-beta signals from the cell-surface receptor to cell nucleus, mutation or deletion of SMAD4, which has been proved to lead to the pancreatic cancer [27]. We think it may be strongly associated with the lung cancer. ENPP2 is also known as ATX, this gene can stimulate the motility of tumour cells. The expression of ENPP2 has been found to be up regulated in some different kinds of cancers [28]. The protein encoded by the gene LLGL1 was said to be very similar to the tumour suppressor of drosophila which is a highly relevant gene to cancer [29]. What is more, some relevant genes selected by other AFT models with lasso, elastic net and SCAD, were also found by the RS-AFT, for example, TRA2A, WWP1, DOC2A and HUWE1. They are significantly associated to the lung cancer which has been discussed [30].

We also obtained the similar experimental results from the analysis of the other three real datasets. The biological analysis showed that the RS-AFT model not only can find the relevant genes which were selected by AFT models with other regularization methods, but also can find some unique genes, which were not selected by other AFT models but also significantly associated to disease. Hence, we can say the RS-AFT model may find the disease related genes accurately and efficiently.

The experiment results show that the RS-AFT model outperforms some existing survival estimation approaches. It can effectively select the bio-mark genes and estimate the patients’ survival time accurately in high dimensional and low sample size biological datasets. With the less mark genes and accurate survival time prediction, this method will be a more practical tool for cancer research and treatment.

In the data experiments we found that large number of the censored data great effect the accuracy of the RS-AFT model. The more censored data, the more difficulty we get in the experiments. Hence in the future work, we will try to combine the RS-AFT model with some machine learning methods, such as some semi supervised methods, we thought they may have strong ability to against with the censored data, the more completed data will improve the accuracy of our RS-AFT model obviously.

Acknowledgments

This work is supported by the Macau Science and Technology Development Funds (Grand No. 003/2016/AFJ) from the Macau Special Administrative Region of China.

Conflict of interest

None to report.

References

[1] 

Tibshirani R. The lasso method for variable selection in the Cox model. Stat Med (1997) ; 16: : 385-395.

[2] 

Gui J, Li H. Penalized Cox regression analysis in the high-dimensional and low-sample size setting, with applications to microarray gene expression data. Bioinformatics (2005) ; 21: : 3001-3008.

[3] 

Liu C, et al. The L1/2 regularization method for variable selection in the Cox model. Appl Soft Comput (2014) ; 14: (c): 498-503.

[4] 

Buckley J, James I. Linear regression with censored data. Biometrika (1979) ; 66: : 429-436.

[5] 

Tsiatis A. Estimating regression parameters using linear rank tests for censored data. Ann Stat (1990) ; 18: : 354-372.

[6] 

Huang J, Ma S, Xie H. Regularized estimation in the accelerated failure time model with high dimensional covariates. Biometrics (2006) ; 62: : 813-820.

[7] 

Cai T, Huang J, Tian L. Regularized estimation for the accelerated failure time model. Biometrics (2009) ; 65: : 394-404.

[8] 

Jin Z, Lin DY, Wei LJ, Ying Z. Rank-based inference for the accelerated failure time model. Biometrika (2003) ; 90: : 341-353.

[9] 

Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc (1958) ; 53: : 457-481.

[10] 

Chang XY, Xu ZB, Zhang H, et al. Robust regularization theory based on Lq (0 < q < 1) regularization: The asymptotic distribution and variable selection consistence of solutions (in Chinese). Sci Sin Mat (2010) ; 40: (10): 985-998. doi: 10.1360/012010-77.

[11] 

Li W, Michael DG, Ji Z. Regularized least absolute deviations regression and an efficient algorithm for parameter tuning. In: Proceedings of the Six International Conference on Data Mining Washington, IEEE Computer Society (2006) ; 690-700.

[12] 

Wang H, Li G, Jiang G. Robust regression shrinkage and consistent variable selection through the lad-lasso. J Business Economic Statist (2007) ; 25: : 347-355.

[13] 

Xu JF, Ying ZL. Simultaneous estimation and variable selection in median regression using lasso-type penalty. Ann Inst Stat Math (2010) ; 62: : 487-514.

[14] 

Xu ZB, Zhang H, Wang Y, et al. L1/2 regularization. Sci China Ser F (2010) ; 53: : 1159-1169.

[15] 

Chartrand R, Staneva V. Restricted isometry properties and nonconvex compressive sensing. Inverse Problem (2008) ; 24: : 1-14.

[16] 

Rajaratnam B, Sparks D. Fast Bayesian lasso for high-dimensional regression. Statistics (2015) .

[17] 

Jing LI, Wang J, Hui LI, et al. Selection and classification of elastic net feature with fused electroencephalogram features. Journal of Biomedical Engineering (2016) .

[18] 

Miao L, Zhou J, Naylor C, et al. Application of penalized linear regression methods to the selection of environmental enteropathy biomarkers. Biomarker Research (2017) ; 5: (1): 9.

[19] 

Liu C, Liang Y, Luan XZ, et al. The L1/2, regularization method for variable selection in the Cox model. Applied Soft Computing (2014) ; 14: (1): 498-503.

[20] 

Datta S. Estimating the mean life time using right censored data. Stat Methodol (2005) ; 2: : 65-69.

[21] 

Hurvich CM, Tsai CL. Regression and time series model selection in small samples. Biometrika (1989) ; 76: : 297-307.

[22] 

Sohn I, Kim J, Jung SH, Park C. Gradient lasso for Cox proportional hazards model. Bioinformatics (2009) ; 25: (14): 1775-1781.

[23] 

Rosenwald A, et al. The use of molecular profiling to predict survival after chemotherapy for diffuse large B-cell lymphoma. N Engl J Med (2002) ; 346: : 1937-1946.

[24] 

Rosenwald A, et al. The proliferation gene expression signature is a quantitative integrator of oncogenic events that predicts survival in mantle cell lymphoma. Cancer Cell (2003) ; 3: : 185-197.

[25] 

Beer DG, et al. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat Med (2002) ; 8: : 816-824.

[26] 

Bullinger L, et al. Use of gene-expression profiling to identify prognostic subclasses in adult acute myeloid leukemia. N Engl J Med (2004) ; 350: : 1605-1616.

[27] 

Boone BA, et al. Loss of SMAD4 staining in pre-operative cell blocks is associated with distant metastases following pancreaticoduodenectomy with venous resection for pancreatic cancer. J Surg Oncol (2014) ; 110: (2): 171-5.

[28] 

Umezu-Goto M, et al. Autotaxin has lysophospholipase D activity leading to tumor cell growth and motility by lysophosphatidic acid production. J Cell Biol (2002) ; 158: (2): 227-33.

[29] 

Schimanski CC, et al. Reduced expression of Hugl-1, the human homologue of Drosophila tumour suppressor gene lgl, contributes to progression of colorectal cancer. Oncogene (2005) ; 24: (19): 3100-9.

[30] 

Chai H, et al. The L1/2 regularization approach for survival analysis in the accelerated failure time model. Comput Biol Med (2014) .