You are viewing a javascript disabled version of the site. Please enable Javascript for this site to function properly.
Go to headerGo to navigationGo to searchGo to contentsGo to footer
In content section. Select this link to jump to navigation

EBSRMF: Ensemble based similarity-regularized matrix factorization to predict anticancer drug responses

Abstract

Drug sensitivity prediction to a panel of cancer cell lines using computational approaches has been a challenge for two decades. With the emergence of high-throughput screening technologies, thousands of compounds and cancer cell lines panels with drug sensitivity data are publicly available at various pharmacogenomics databases. Analyzing these data is crucial to improve cancer treatment and develop new anticancer drugs. In this work, we propose EBSRMF: Ensemble Based Similarity-Regularized Matrix Factorization, which is a bagging based framework to improve the drug sensitivity prediction on the Cancer Cell Line Encyclopedia (CCLE) data. Based on the fact that similar drugs and cell lines exhibit similar drug response, we have investigated cell line and drug similarity matrices based on gene expression profiles and chemical structure respectively. The drug sensitivity value is used as outcome values which are the half maximal inhibitory concentrations (IC50). In order to improve the generalization ability of the proposed model, a homogeneous ensemble based bagging learning approach is also investigated where multiple SRMF models are used to train N subsets of the input data. The outcome of each training algorithm is aggregated using the averaging method to predict the outcome. Experiments are conducted on two benchmark datasets: CCLE and GDSC. The proposed model is compared with state-of-the-art models using multiple evaluation metrics including Root Means Square Error (RMSE) and Pearson Correlation Coefficient (PCC). The proposed model is quite promising and achieves better performance on CCLE dataset when compared with the existing approaches.

1Introduction

Cancer is a disease that spreads genetically and can be caused by the irregular growth of human cells. Around 200 types of cancers have been diagnosed so far that are impacting the global public health sector. The human genetic micro environment is complex, making it difficult to treat cancer. Similar cancer types can react differently to the same drug for different people which is due to the genetic and molecular variations among peoples. So these variations have caused prominent challenges in predicting drug responses for patients. Precision medicines consider the human genomics profile and prescribe drugs that could best work to control the cancer growth in humans [1]. The relationship between drugs and human genomics profile is revealed by performing large throughput screening and is available in the form of pharmacogenomics datasets [2]. These large datasets are now available publicly like Genomics of Drug Sensitivity in Cancer (GDSC) and Cancer Cell Line Encyclopedia (CCLE). Both datasets contain omics data like gene expressions, mutations, methylation, etc. The CCLE dataset has compiled around 1457 cell lines from different cancer types such lungs, kidneys, etc., and drug responses of 24 drugs across 479 cell lines [3]. The GDSC dataset has grouped around 652 cell lines and their drug responses against 135 drugs. The drug responses are in the form of IC50 values that are used to measure the sensitivity of drugs [4]. The availability of these large datasets helps in understanding the human profiles and plays a role in predicting drug responses, drug discoveries, and drug repositioning. There is a need to develop an evaluator that can understand the hidden relationship between drug responses and omics data and try to predict the best drug responses against cell lines.

In the past decade, machine learning algorithms have been widely used in many applications like robotics, affecting computing, facial biometric verification and even in medical diagnosis [5, 6]. Recently different machine learning techniques have been used for drug response prediction including random forest, support vector machine (SVM), and neural network [7–9]. Many approaches have been developed by assuming the fact that similar drugs which are common in chemical structure can possess similar responses on similar cell lines [10–12]. But these single models are good in learning the linear relationship among data and to some extent the non-linear relationship as well but do not perform well with high dimensional data and have poor generalization ability. The ensemble-based approach can deal with high dimensional nonlinear data [13]. The outcome of multiple models is aggregated together to predict the outcome. The ensemble-based models improved the predictability of the model [14]. Another machine learning approach that has been used recently is Matrix factorization (MF). MF is mostly used in recommendation systems where users provide ratings or voting to particular items. MF helps in finding missing values or predicting new values by mapping features in K latent space and trying to find the relationship between users and items. Similarity-based matrix factorization technique in drug response prediction problems, has shown remarkable progress in recent research work discussed in the literature. The idea is to find the similarities of users and items and then use them in predicting the outcome. Based on these facts we have used drug and cell lines similarities in our research for predicting drug responses.

Although many approaches have been proposed to predict the anti-cancer drug responses, it is still an uphill task to produce an evaluator that can predict drug responses accurately. The only solution is to use power of genomics data and drug’s properties that helps to develop a more reliable model that will be able to predict results with more accuracy.

Inspired by recent advances in ensemble learning and matrix factorization, we have proposed an ensemble based similarity-regularized matrix factorization (EBSRMF). In our proposed model, bagging based ensemble technique is investigated in which multiple SRMF [15] models are combined.In addition, similarity matrices and IC50 drug response values are also integrated. Each SRMF model is used to train one of the subsets of the input dataset. The size and shape of each subset are similar. The outcomes from multiple models are aggregated to predict the final outcome. The proposed model exploits the similarities and successfully interprets non-linear relationships, dimensionality reduction and drug response model. We have performed the 10-fold cross-validation on both CCLE and GDSC datasets. The model is then compared with other state-of-the-art models using Root Mean Squared Error (RMSE) and Pearson Correlation Coefficient (PCC) as performance measures. The model achieves 0.21 RMSE on CCLE and 0.69 RMSE on GDSC datasets. The results show that ensemble based matrix factorization approach has potential to produce better result in thisproblem context.

In summary, this research has the following key contributions:

  • - We move a step in the direction of improving the drug sensitivity prediction using ensemble based matrix factorization approach. The novelty of this research is to find the best gene-drug association. To the best of our knowledge, this ensemble based SRMF for the drug response prediction has never been proposed in the literature. The proposed approach has achieved significant performance when compared with the five state-of-the-models.

  • - We apply and test our EBSRMF approach on CCLE and GDSC dataset on the basis of RMSE and PCC scores. In comparison with state-of-the-art models, the lowest RMSE of EBSRMF on CCLE dataset and cumulative RMSE score on both CCLE GDSC datasets shows the feasibility of the proposed technique.

This paper is organized as follows. Section 2 reviews related work followed by methods and methodology in Section 3. Section 4 discusses experimental settings with results and discussion in Section 5. Section 6 concludes this paper.

2Related work

Most of the machine learning algorithms extracted key features from the dataset and used them in model training. Researchers have been using publicly available genomic profiles and drug responses and used them in proposing in-silico predictors. Jaing Sheng et al. [16] proposed a model using drug and cell line similarities and uses GDSC drug responses as training data. The model was then tested using a 10-fold cross-validation on CCLE dataset which is based on the assumption that drugs that are similar in structure possess similar responses on cell lines. [17] also worked with the same similarity aspect of cell lines and using a logistic matrix factorization approach. In this study, they treated drug response prediction as a classification problem. They achieved higher accuracy as compare to existing classificationmodel.

Another method proposed by Amanud din et al. [18] that incorporates the QSAR technique, in which they integrated the cell lines features, drug target information and drug features to predict drug responses. The authors applied a Kernelized Bayesian Matrix factorization (KBMF) model that used a pairwise similarity matrix (kernel) between all drugs and achieved the coefficient of determination R2 score of 0.32. This shows that the uses of drug and cell line’s similarities matrices could probably help in predicting better results. Zhang et al. [19] proposed a dual-layer integrated model (DLN) by using cell line and drug similarity networks. The similarity matrix was constructed using drug chemical structure and gene expression correlation. They also used CCLE and CGP datasets for model validation. The model uses PCC as a performance measure and got a score of 0.6. The model also predicts drug responses for missing values. Li et al. [20] proposed a deep learning model, in which they integrated gene expression features with compound chemical features to predict drug sensitivity. They used first deep auto-encoder to get the optimized gene expression features and then integrate reduced gene expression features with compound chemical features i.e. Morgan fingerprints features. The final matrix with drug responses was fed into a deep feedforward network to train the model on CCLE and GDSC datasets. Menden et al. [21] used multi-omics information along with 1D and 2D drug chemical compounds features to model the drug responses using three layers neural network (NN). The R2 and RMSE were used as predictive measures. The model was able to achieve R2 of 0.6 and RMSE of 0.97. In a very recent study, Wang et al. [15] uses similarity-regularized matrix factorization (SRMF) to predict the drug responses by using drug and cell lines similarities. The drug chemical structure was collected from Pubchem and then converted into 256 vector morgan fingerprints from which drug similarity was constructed. Similarly, the cell line similarity matrix was generated by finding the correlation among gene expressions. The model was able to predict better results than the KBMF and DLN. Another similar study was conducted by Aman et al. [2] in which they have also used drug and cell line similarities for prediction. They have converted the drug and cell lines response matrix into latent space with a reduced dimensionality to capture the non-linear relationship between the drug andcell line.

Apart from matrix factorization the ensemble-based model has also shown significant improvements in the predictions. The STREAM [22] is a ridge-regression based model for drug sensitivity prediction. It was a single-task learner on gene expression profile and computationally efficient. The authors evaluated performance of their STREAM model on SANGER [23] and CCLE datasets. Aman et al. [24] proposed a model using multi-task learning and stacking together four different base learners. The model was trained and tested using GDSC and CCLE datasets. Liu et al. [25] also used these two datasets and applied them to a model that ensemble together ridge regression (RR) and low-rank matrix completion (MC).The model was compared with others and showed high prediction accuracy.

3Material and methods

3.1Datasets and data preprocessing

In this study, a novel method EBSRMF is proposed to predict drug responses for cancer cell lines. The expression profiles of cell lines and drug sensitivity data were collected from two large publicly available datasets CCLE and GDSC, and drug chemical structures are downloaded from PubChem [26].

Table 1 shows the description about datasets.

3.1.1CCLE

The CCLE consists of gene expression arrays of around 1457 cell lines. The gene expression represented in all cell lines is around 20000. In our work, we have selected 363 cell lines with response data of around 24 drugs [3]. The drug responses are in IC50 values. The lower the IC50 value the higher the cell line is sensitive to that drug and vice versa. The IC50 values are converted into negative logarithms. The chemical structures of 24 drugs are downloaded from PubChem in the form of 2D structures standard delay format (SDF) files. The chemical structure is converted into 256 bits Morgan fingerprints using camb [27]. The drug similarities are then calculated by using the Jaccard package using R language. 363 cell lines where each cell line has 20000 genes expression features values that are used to produce cell line similarities by calculating the correlation among them.

3.1.2GDSC

The gene expression data of around 789 cell lines have been downloaded in the form of CEL files. The gene expression then normalizes using the oligo R package [28]. 2D structures of around 135 drugs are also downloaded from PubChem and converted to drug similarity matrix as mentioned above. The final GDSC drug response matrix contains 652 cell lines against 135 drugs.

Table 1

Datasets summary

GDSCCCLE
Response matrix (n × p)135×65224×363
Drug Similarity (d × d)135×13524×24
Cell Line similarity (c × c)652×652363×363

3.2Methodology

The proposed model is shown in Fig. 2 consists of some major components such as Matrix Factorization, Ensemble Learning. These components are described below.

Fig. 2

The input data for EBSRMF includes drug responses (with some unknown values) and drug and gene similarity based on the chemical structure of drugs and gene expressions. The input data splits into N subsets. Each subset then passed to the SRMF method that mapped the drugs and cell lines into shared latent space with low dimensionality Y and Z. Y and Z are used to reconstruct drug responses with new responses.The outcome of each SRMF is aggregated to predict the final drug responses.

The input data for EBSRMF includes drug responses (with some unknown values) and drug and gene similarity based on the chemical structure of drugs and gene expressions. The input data splits into N subsets. Each subset then passed to the SRMF method that mapped the drugs and cell lines into shared latent space with low dimensionality Y and Z. Y and Z are used to reconstruct drug responses with new responses.The outcome of each SRMF is aggregated to predict the final drug responses.

3.2.1Matrix factorization

The matrix factorization (MF) technique is mostly used in the recommendation system in which we have a response matrix with some missing values. Matrix factorization helps in predicting those missing values. It is a type of collaborative filtering algorithm. In the context of drug response prediction, the response matrix R is of size n × p where n represents the number of drugs and p is the number of cell lines. The matrix factorization split matrix R into two matrices let say Y and Z where Y is n × k matrix and Z is k × p matrix defining cell line –cell line and drug-drug relation respectively and K is lower dimensionality shared latent space. The splitting of the matrix should be done in such a way that if we again multiply the two matrices it should approximately equal to the actual matrix Y . ZT ≈ R. Each column and row in the resultant matrices show the strong bonding between cell lines and drugs.

3.2.2Ensemble learning

Ensemble learning is a technique in which multiple models are trained on a set of data and the outcomes of all these models are aggregated to predict the final improved outcome [29]. There are different ensemble methods including bagging, boosting, and stacking. In the Bagging technique, the training dataset is divided into multiple subsets with replacement, and each base algorithm is trained on one of these subsets. The outcome of these algorithms which are trained on a random sampling of data is aggregated by either voting or aggregating approaches. The main idea here is to improve the generalization ability of the system and produce more accurate and robust results. Figure 1 ensemble shows the ensemble bagging technique.

Fig. 1

Ensemble bagging technique.

Ensemble bagging technique.

3.2.3EBSRMF

EBSRMF: Ensemble Based Similarity-Regularized Matrix Factorization is our proposed approach and Fig. 2 proposedmodel shows the complete picture of our proposed model. The goal is to develop a framework that is based on matrix factorization and uses similarities between drugs and genomics profiles to predict the drug responses. Initially, gene expressions and drug responses data are taken from CCLE and GDSC datasets, whereas drug’s chemical structures are taken from PubChem dataset. These drug’s structures are then converted into Morgan fingerprints using camb [30] to make compound feature vectors of 256 bits length. After generating drug and cell lines feature vectors, then we transformed these vectors into drug and cell lines similarities matrix. The dataset is split into n subsets of equal sizes with replacement. We adopted the same SRMF [15] approach to perform the training on each subset to predict the drug responses. The outcomes from all subsets are aggregated to get the final outcome.

The final response matrix contains the mapping of m drugs and n cell lines into a shared K latent feature space. The properties of the drug and cell line are represented by Y and Z matrices with y and z latent coordinates respectively. The inner product of Y and Z are used to reconstruct the final response matrix that also contains the newly predicted drug responses. The overall aim is to approximate the already known drug mi response value for ni cell line in a latent space. This is goal is achieved through the following objective function:

(1)
miny,z||W.(DR-YZT)||F2

Here DR is the known drug response matrix containing some missing responses, Y and Z are estimated matrices containing drug mi and ni cell lines as row vectors respectively. These matrices are used to reconstruct the response matrix. The values of Y and Z are estimated by using the gradient descent technique. The process continues iteratively until we find the lowest error. The Equation 1 is the simplest technique where W represents a weighted matrix. The F is a frobenius norm regularization [31] parameter that is used to avoid the overfitting during the training phase.

Algorithm 4 shows the main steps of our proposed method EBSRMF. Moreover Table 2 symbols lists down all the common symbols used in this research work. The proposed approach is inspired by the already defined similarity-regularized matrix factorization [15] framework for drug prediction. This research work is based on the assumption that similar drugs and cell lines give similar drug responses. The drug response predictor is also capable of predicting missing values.

Table 2

Symbols used in research work

SymbolsMeaning
DRKnown drug responses
KLatent space dimension
CSCell line - cell line similarity matrix
DSDrug - drug similarity matrix
λDsRegularization parameter for drug features
λCsRegularization parameter for cell lines features
νLearning rate
di, djDrugs morgan fingerprints
ciCell line

To avoid the overfitting during the training process, the latent matrices Y and Z are also regularized using Equation 2 which is defined as:

(2)
miny,z||W.(DR-YZT)||F2+λl(||Y||F2+||Z||F2)

The main idea here is to exploit the relationship between drugs and cell lines by calculating their similarities and use this information in predicting the responses by reducing the differences in similarity matrices. Here the similarity differences are also used as regularizing terms as shown in Equation 3 sim.

(3)
miny,z||W.(DR-YZT)||F2+λl(||(Y||F2+||Z||F2)+λDs||DS-YYT||F2+λCs||CS-ZZT||F2

Where λDs and λCs are regularization parameters for drug and cell lines similarities. Equations 1, 2 and 3 are similar to objective function mentioned in [15].

4Experimental settings

The model is executed on the dell machine with 16GB RAM, 1 TB hard disk, and windows as the operating system. Hyperparameter K is set to 14 and 47 for the CCLE and GDSC datasets respectively and tuned from the training data. The drug responses are normalized and converted in the range of [-1 to 1] by taking the maximum absolute value and dividing each record with that value. This has been done to make the data consistent with the similarity matrices. The λl, λDs, andλCs are selected from the range of 2-5 to 0. The σ, τ, and weight parameters are selected from range 0 to 1 with gradually incrementing with 0.001 as learning rate. For bagging, bootstrap sample size of N = 4 is being used.

4.1Compared methods

To measure the accuracy and robustness of the proposed model, we have compared it with other state-of-the-art models including DeepDSC, KBMF, SRMF, DLN and RF are briefly summarized below. All of these models used cell line features and drug chemical structures. DeepDSC A Deep Learning Method to Predict Drug Sensitivity of Cancer Cell Lines [20]. KBMF Kernelized Bayesian Matrix factorization [18]. SRMF Similarity-regularized matrix factorization [15] DLN Dual-Layer Integrated Cell Line-Drug Network Mode [19] RF Random Forest Based Drug Sensitivity Prediction [7]

4.2Evaluation measures

In this research, we have used two evaluation measures to evaluate the performance of our proposed model including Root Mean Square Error (RMSE) as defined in Equation 4 and Pearson Correlation Coefficient (PCC) as defined in Equation 6. RMSE measure is used to calculate the difference between the actual and the predicted values. Whereas PCC is used to find the correlation between the drugs and cell lines.

(4)
RMSE(Dr)=C(R(Dr,Cl)-R(Dr,Cl))2n
where n is the number of cell lines that contain the known drug responses and R (Dr, Cl) and Rsquo (Dr, Cl) represents the actual and predicted values for drug (Dr) and cell lines (Cl) respectively. Also averaged PCC, averaged RMSE, and averaged MSE are also calculated for all drugs. The sensitive and resistant information of the cell line against each drug is also considered to understand drug behavior. PCC and RMSE are also calculated separately for sensitive and resistant cell lines [32].

The average PCC and RMSE are also calculated for these cell lines against each drug. All the performance measurements are calculated on each subset; the outcome was obtained by aggregating the prediction of each subset.

We have performed 10-fold cross-validation on CCLE and GDSC datasets to obtain a predictive measurement. The data is randomly split into 10 folds iteratively with one fold is used for validation and the remaining is used in training.

Table 6
Algorithm 1 EBSRMF
  Input: Drug similarity matrix (DS), Cell-line similarity matrix
  (CS), Drug response matrix (DR), Latent space dimension (K),
  regularization parameter, dataset spilt count (N), λl, λds, λcs
  Output: Predict response matrix
  Algorithm Steps:
  - Calculate drug similarity matrix (DS) based on drugs
      structure and by using jaccard similarity coefficient
           
DS(di,dj)=didjdidj(5)
      where di and dj are the morgan fingerprints of drugs.
  - Calculate cell lines similarity matrix (CS) based on
      genes of cell lines by using PCC.
           
CS(Ci,Cj)=g=1n(Ci,g-Ci)(Cj,g-Cj)g=1n(Ci,g-Ci)2g=1n(Cj,g-Cj)2
    (6)
      where Ci,g denotes the expression of gene g in cell line
      Ci and Csquoi represents the mean expressions of all genes in
      cell line Ci.
  - Split the data with replacement into N bootstrap samples
      such that each set contains the same number of records.
      So, assuming that we have N bootstrap samples of size
      B denoted as
                  
{z11,z21,....,zB1},{z12,z22,....,zB2},.....,{z1N,z2N,....,zBN}
        (7)
      where z is the given drug response matrix.
  - Each subset then fixed to the weak learner (SRMF) that
      mapped the drugs and cell lines into shared latent space.
      we can fit L almost independent weak learners (one on
      each dataset)
                              
w1(.),w2(.),....,wL(.)
                                      (8)
  - The outcome of each weak learner (SRMF) is then
      aggregated to predict the final drug responses.
             
sL(.)=1Li=1Lwi(.)(9)

5Result and discussion

In this section, we will compare our proposed model with the state-of-the-art methods. We will first compare our proposed bagging based ensemble model with individual model (SRMF) i.e. without bagging.

5.1Comparison between EBSRMF and SRMF

The performance of the proposed ensemble based model is first evaluated by comparing with individual model i.e. single SRMF model proposed by Wang et al [15] where all the samples are used in the training data. Table 3 shows the comparison between the individual and ensemble model. On the CCLE dataset, the proposed model achieves an average RMSE of 0.2185 whereas the individual model achieves 0.2459 and thus overall improvement of 2.74% by using the proposed model. The RMSE for each subsets of the proposed model are 0.21, 0.2137, 0.2208, and 0.2216 respectively. It is interesting to observe that in some models, bagging has better performance than overall ensemble based model but it would be difficult to know which models in bagging will guarantee how well it is generalized on unseen data.

Table 3

The performance comparison of proposed model and individual matrix factorization model

RMSE ProposedRMSE IndividualPCC ProposedPCC IndividualR2ProposedR2 Individual
CCLE0.21850.24590.860.810.730.64
GDSC0.69360.91440.920.850.810.688

Our proposed ensemble based approach has solved this problem and gives us more confidence in generalization error as it is a combination of various small models. The model is also validated on the PCC. The average PCC of the individual model has got 0.81, whereas it has been 0.86 for the ensemble based proposed model and thus overall improvement of 5% . It should be noted that the higher the value of PCC, the better the model is.

A similar test run has been conducted on the GDSC dataset (Table 3). RMSE score for the ensemble based proposed model is 0.6963 whereas it is 0.91 on the individual model and thus overall improvement of 22% . Similarly, the average PCC for ensemble and individual are 0.92 and 0.85 respectively.

5.2Comparison with the state-of-the-art models

The comparison with the other state-of-the-art models such as DeepDSC, SRMF, KBMF, DLN and RF has shown in Tables 4 and 5. We have RMSE and PCC as performance metrics. The outcome of all the models considered in the same way as mentioned in their research work. The proposed model shows better performance by achieving a low RMSE of 0.2185 for the CCLE data set. Hence EBSRMF predicts better drug sensitivity than other models.

Table 4

Comparison with other models on GDSC datasets

RMSEPCC
DeepDSC [20]0.52
KBMF [18]1.590.49
SRMF [15]1.430.62
DLN [19]2.080.44
RF [7]1.690.40
Our EBSRMF0.690.92
Table 5

Comparison with other models on CCLE datasets

RMSEPCC
DeepDSC [20]0.23
KBMF [18]0.710.64
SRMF [15]0.570.71
DLN [19]0.860.64
RF [7]0.610.62
Our EBSRMF0.210.86

6Conclusion and future work

We have proposed an ensemble-based matrix factorization model which predicts the drug responses of anti-cancer. The idea is to map the data into the low dimensional space to extract a non-linear relationship between the drug and cancer cell lines [33]. We have introduced drug-drug and cell line-cell line similarity matrices and use them based on Pearson correlation objective function to predict responses. The similarity matrices help in constructing intermediate matrices which are being used to predict the outcome. The two similarity matrices also help in reducing the error loss during training. In addition, ensemble learning is also investigated to reduce the overfitting by bringing diversity in the various models of ensemble. The model has achieved low error on CCLE dataset as compared to other state-of-the-art algorithms which indicate that the proposed model has good ability to predict the missing and new drug responses.

Future work aims to train the proposed model on the large collection of genomic profiles which includes mutations, pathways, copy number, and drug-target interactions. The genomic profile can help in constructing a cell line similarity matrix with a high correlation which can lead to better prediction.

Acknowledgment

This work was supported in part by the Engineering and Physical Sciences Research Council (EPSRC) Grant EP/P009727/1 and in part by the Leverhulme Trust Grant RF-2019-492. This work is also partially supported by HEC NRPU Grant 10225.

References

[1] 

Guanghua Xiao , Shuangge Ma , John Minna and Yang Xie , Adaptive prediction model in prospective molecular signature–based clinical studies, Clinical Cancer Research 20: (3) ((2014) ), 531–539.

[2] 

Aman Sharma and Rinkle Rani , Ksrmf: Kernelized similarity based regularized matrix factorization framework for predicting anti-cancer drug responses, Journal of Intelligent & Fuzzy Systems 35: (2) ((2018) ), 1779–1790.

[3] 

Jordi Barretina , Giordano Caponigro , Nicolas Stransky , Kavitha Venkatesan , Adam Margolin A , Sungjoon Kim , Christopher Wilson J. , Joseph Leha´r , Gregory Kryukov V. and Dmitriy Sonkin , et al. The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity, Nature 483: (7391) ((2012) ), 603–607.

[4] 

Wanjuan Yang , Jorge Soares , Patricia Greninger , Elena Edelman J. , Howard Lightfoot , Simon Forbes , Nidhi Bindal , Dave Beare , James Smith A. and Richard Thompson I. , et al. Genomics of drug sensitivity in cancer (gdsc): a resource for therapeutic biomarker discovery in cancer cells, Nucleic Acids Research 41: (D1) ((2012) ), D955–D961.

[5] 

Richard Jiang , Anthony Ho T.S. , Ismahane Cheheb , Noor Al-Maadeed , Somaya Al-Maadeed and Ahmed Bouridane , Emotion recognition from scrambled facial images viamany graph embedding, Pattern Recognition 67: ((2017) ), 245–251.

[6] 

Richard Jiang , Ahmed Bouridane , Danny Crookes , Emre Celebi M. and Hua-Liang Wei , Privacyprotected facial biometric verification using fuzzy forest learning, IEEE Transactions on Fuzzy Systems 24: (4) ((2015) ), 779–790.

[7] 

Isidro Cortés-Ciriano , Gerard van Westen J.P. , Guillaume Bouvier , Michael Nilges , John Overington P. , Andreas Bender and Thérèse Malliavin E. , Improved large-scale prediction of growth inhibition patterns using the nci60 cancer cell line panel, Bioinformatics32: (1) ((2016) ), 85–95.

[8] 

Turki Turki and Zhi Wei , A link prediction approach to cancer drugsensitivity prediction, BMC Systems Biology 11: (5) ((2017) ), 1–14.

[9] 

Cai Huang , Roman Mezencev , John McDonald F. and Fredrik Vannberg , Open source machine-learning algorithms for the prediction ofoptimal cancer drug therapies, PLoS One 12: (10) ((2017) ), e0186906.

[10] 

Artem Cherkasov , Eugene N. Muratov , Denis Fourches , AlexandreVarnek , Igor I. Baskin , Mark Cronin , John Dearden , Paola Gramatica , Yvonne C. Martin , Roberto Todeschini , et al. Qsar modeling: wherehave you been? where are you going to? Journal of MedicinalChemistry 57: (12) ((2014) ), 4977–5010.

[12] 

Andreas Bender and Robert Glen C. , Molecular similarity: a keytechnique in molecular, informatics, & Biomolecular Chemistry 22: (12) ((2004) ), 3204–3218.

[13] 

Aman Sharma and Rinkle Rani , Ensembled machine learning frameworkfor drug sensitivity prediction, IET Systems Biology 14: (1) ((2020) ), 39–46.

[14] 

Ko A.H.-R. Robert Sabourin de Souza Britto A. Combining diversity and classification accuracy for ensemble selection in random subspaces, In The 2006 IEEE International Joint Conference on Neural Network Proceedings, pages 2144–2151. IEEE, 2006.

[15] 

Lin Wang , Xiaozhong Li , Louxin Zhang and Qiang Gao , Improvedanticancer drug response prediction in cell lines using matrixfactorization with similarity regularization, BMC cancer 17: (1) ((2017) ), 1–12.

[16] 

Jianting Sheng , Fuhai Li and Stephen Wong T.C. , Optimal drugprediction from personal genomics profiles, , IEEE Journal ofBiomedical and Health Informatics 19: (4) ((2015) ), 1264–1270.

[17] 

Akram Emdadi and Changiz Eslahchi , Dsplmf: a method for cancer drugsensitivity prediction using a novel regularization approach inlogistic matrix factorization, Frontiers in Genetics 11: ((2020) ), 75.

[18] 

Muhammad Ammad-Ud-Din , Elisabeth Georgii , Mehmet Gonen , Tuomo Laitinen , Olli Kallioniemi , Krister Wennerberg , Antti Poso and Samuel Kaski , Integrative and personalized qsar analysis in cancerby kernelized bayesian matrix factorization, Journal ofchemical information and modeling 54: (8) ((2014) ), 2347–2359.

[19] 

Naiqian Zhang , Haiyun Wang , Yun Fang , Jun Wang , Xiaoqi Zheng and Shirley Liu X. , Predicting anticancer drug responses using a dual-layerintegrated cell line-drug network model, PLoS Computational Biology 11: (9) ((2015) ), e1004498.

[20] 

Min Li , YakeWang , Ruiqing Zheng , Xinghua Shi , Yaohang Li , Fang-XiangWu and Jianxin Wang , Deepdsc: a deep learning method to predict drugsensitivity of cancer cell lines, IEEE/ACM transactions oncomputational biology and Bioinformatics 18: (2) ((2019) ), 575–582.

[21] 

Michael P. Menden , Francesco Iorio , Mathew Garnett , Ultan McDermott , Cyril H. Benes , Pedro J. Ballester and Julio Saez-Rodriguez , Machinelearning prediction of cancer cell sensitivity to drugs based ongenomic and chemical properties, PLoS One 8: (4) ((2013) ), e61318.

[22] 

Elias Chaibub Neto , In Sock Jang , Stephen Friend H. and Adam A Margolin , The stream algorithm: computationally efficient ridge-regression via bayesian model averaging, and applications to pharmacogenomic prediction of cancer cell line sensitivity, In Biocomputing 2014, pages 27–38.World Scientific, 2014.

[23] 

Mathew Garnett J. , Elena Edelman J. , Sonja Heidorn J. , Chris Greenman D. , Anahita Dastur , King Wai Lau , Patricia Greninger , Richard Thompson I. , Xi Luo , Jorge Soares , et al., Systematicidentification of genomic markers of drug sensitivity in cancercells, , Nature 483: ((2012) ), 570–575.

[24] 

Aman Sharma and Rinkle Rani , Drug sensitivity prediction frameworkusing ensemble and multi-task learning, , International Journalof Machine Learning and Cybernetics 11: (6) ((2020) ), 1231–1240.

[25] 

Chuanying Liu , Dong Wei , Ju Xiang , Fuquan Ren , Li Huang , Jidong Lang , Geng Tian , Yushuang Li and Jialiang Yang , An improvedanticancer drug-response prediction based on an ensemble methodintegrating matrix completion and ridge regression, , MolecularTherapy-Nucleic Acids 21: ((2020) ), 676–686.

[26] 

Evan Bolton E. , Yanli Wang , Paul Thiessen A. and Stephen Bryant H. , Pubchem: integrated platform of small molecules and biological activities, In Annual reports in computational chemistry, volume 4, pages 217–241. Elsevier, (2008) .

[27] 

Daniel Murrell S. , Isidro Cortes-Ciriano , Gerard van Westen J.P. , Ian Stott P. , Andreas Bender , Thérèse Malliavin E. and Robert Glen C. , Chemically aware model builder(camb): an r package for property and bioactivity modelling of smallmolecules, Journal of Cheminformatics 7: (1) ((2015) ), 1–10.

[28] 

Benilton Carvalho S. and Rafael Irizarry A. , A framework foroligonucleotide microarray preprocessing, Bioinformatics 26: (19) ((2010) ), 2363–2367.

[29] 

Josef Kittler , Mohamad Hatef , Robert Duin P.W. and Jiri Matas , Oncombining classifiers, IEEE Transactions on Pattern Analysis and Machine Intelligence 20: (3) ((1998) ), 226–239.

[30] 

Catherine Brooksbank , Mary Todd Bergman , Rolf Apweiler , Ewan Birney and Janet Thornton , The european bioinformatics institute’s dataresources , Nucleic Acids Research 42: (D1) ((2014) ), D18–D25.

[31] 

Ana Luísa Custódio , Humberto Rocha and Luís NunesVicente , Incorporating minimum frobenius norm models in directsearch, Computational Optimization and Applications 46: (2) ((2010) ), 265–278.

[32] 

Richard Marcotte , Azin Sayad , Kevin Brown R. , Felix Sanchez-Garcia , Jüri Reimand , Maliha Haider , Carl Virtanen , James Bradner E. , Gary Bader D. , Gordon Mills B. , et al., Functional genomic landscapeof human breast cancer drivers, vulnerabilities, and resistance, Cell 164: (1-2) ((2016) ), 293–309.

[33] 

Mehmet Gönen and Ethem Alpaydin , Multiple kernel learningalgorithms, The Journal of Machine Learning Research 12: ((2011) ), 2211–2268.