You are viewing a javascript disabled version of the site. Please enable Javascript for this site to function properly.
Go to headerGo to navigationGo to searchGo to contentsGo to footer
In content section. Select this link to jump to navigation

Predicting drug-target interactions using matrix factorization with self-paced learning and dual similarity information

Abstract

BACKGROUND:

Drug repositioning (DR) refers to a method used to find new targets for existing drugs. This method can effectively reduce the development cost of drugs, save time on drug development, and reduce the risks of drug design. The traditional experimental methods related to DR are time-consuming, expensive, and have a high failure rate. Several computational methods have been developed with the increase in data volume and computing power. In the last decade, matrix factorization (MF) methods have been widely used in DR issues. However, these methods still have some challenges. (1) The model easily falls into a bad local optimal solution due to the high noise and high missing rate in the data. (2) Single similarity information makes the learning power of the model insufficient in terms of identifying the potential associations accurately.

OBJECTIVE:

We proposed self-paced learning with dual similarity information and MF (SPLDMF), which introduced the self-paced learning method and more information related to drugs and targets into the model to improve prediction performance.

METHODS:

Combining self-paced learning first can effectively alleviate the model prone to fall into a bad local optimal solution because of the high noise and high data missing rate. Then, we incorporated more data into the model to improve the model’s capacity for learning.

RESULTS:

Our model achieved the best results on each dataset tested. For example, the area under the receiver operating characteristic curve and the precision-recall curve of SPLDMF was 0.982 and 0.815, respectively, outperforming the state-of-the-art methods.

CONCLUSION:

The experimental results on five benchmark datasets and two extended datasets demonstrated the effectiveness of our approach in predicting drug-target interactions.

1.Introduction

Predicting drug-target interaction (DTI) is a crucial phase in drug discovery (DD) [1] and drug repositioning (DR) [2] for discovering novel targets of existing drugs [3, 4, 5]. The traditional methods for new DD are time-consuming and have a high failure rate; therefore, traditional new drug development is not a good choice [3, 6]. Various computer prediction methods have been proposed in recent years to improve the efficiency of new drug research and discovery, thus increasing the development efficiency and reducing expenditure to a certain extent. According to previous works [7, 8, 9], the current methods are mainly categorized into three groups [10, 11, 12, 13, 14, 15, 16, 17]: (1) molecular docking (MD) methods, (2) ligand-based methods, and (3) chemical genomics methods.

The MD methods involve simulation experiments based on the 3D structure drug and protein [11, 18]. However, the simulation of the 3D structure of massive ligands and targets, as well as their massive simulation calculation using MD-based methods, requires a lot of time and computing equipment [19, 20]. The ligand-based methods assume that drugs with similar functions have similar functional properties and may also have corresponding targets. They predict the drug target using ligand similarity. However, this approach suffers from unpredictable targets without known ligands. On the contrary, errors in chemical structure and physiological effects beyond structural relationships (e.g., the metabolites may be active molecules) may limit its use in drug repurposing. The chemical method facilitates rapid and large-scale DTI predictions to generate drug candidates and targets, making it the most efficient method in drug research [21, 22]. Adopting this method for DTI prediction has become a prominent research issue with the continuous increase in drug-related data and the launch of a large number of databases, such as DrugBank [23], KEGG [24], PubChem [25], BRENDA [26], and SuperTarget [27].

Recently, chemical genomics-based computational approaches for DTI prediction have advanced rapidly. They are mainly categorized into three groups: classification-based methods, network diffusion (network propagation), and matrix factorization (MF). The classification-based methods treat a DR prediction task as a binary classification task that whether has an association between drug and target. These methods are not yet proof with wet experimental. In 2008, Yamanishi et al. [28] established a bipartite network technique to predict DTIs for four target classes: G protein-coupled receptors, by combining chemical and genomic spaces (GPCRs), nuclear receptor (NR), ion channel (IC), and enzyme (E). Yamanishi’s dataset [28] is regarded as the gold standard by many researchers; several newly developed algorithms based on it have displayed better performance. Based on this benchmark dataset, Bleakley et al. [29] suggested a novel supervised inference method for predicting unknown DTIs based on benchmark datasets, namely, a kernel-based support vector machine (KN-SVM) model.

In recent years, the MF methods are widely used in many DR prediction works, which combines two low-rank matrices to factorize the matrix. Liu et al. [30] proposed a neighborhood regularized logistic MF model. Hao et al. [31] designed a logistic MF based on a dual network (DNILMF) approach to predict DTIs. Yang et al. [32] performed the nonlinear MF technique and the negative sampling technique for DR prediction. SPLCMF, a collaborative MF method combined with self-paced learning (SPL), is an efficient DTI prediction method proposed by Xia et al. [33]. Yang et al. [34] developed an MF method based on multi-similarities bilinear MF for DR prediction. Ding et al. [35] developed a multiple kernel-based triple collaborative MF method to predict DTIs. Wang et al. [36] used a neighborhood regularized logistic MF method based on extracted features from a neural tangent kernel to predict DTIs. These previous studies showed the feasibility of MF used in DR prediction tasks, but it still had two challenges. (1) The model easily fell into a bad local optimal solution due to the high noise and high missing rate in the data. (2) Single similarity information makes the learning power of the model insufficient in terms of identifying the potential associations accurately.

To cope with the aforementioned challenges, we propose a model named Self-Paced Learning with Dual similarity information and Matrix Factorization (SPLDMF), which combines the self-paced learning method into MF. Furthermore, more similarity information related to drugs and targets is integrated into the model to improve the prediction performance. First, many previous works demonstrate that SPL has the superiority of relieving the problem of bad local optimum, especially when data is sparse [37, 38]. Inspired by the human learning process, the core idea of SPL is to automatically include more samples from simple to complex for training in a purely self-paced manner. Thus, we make improvement of MF based on the SPL mechanism to adapt for the data with high noise and high missing rate. Then, the SPLDMF method also incorporates more data into our model to improve its capacity for learning, which can predict the potential relationship more accurately. Experimental results on five benchmark datasets and two extend datasets demonstrate the effectiveness of our approach in predicting drug-target interactions. Our model obtains the best results on each dataset we tested, such as AUC and AUPR of SPLDMF achieve 0.982 and 0.815, outperforming state-of-the-art models among similar methods to our knowledge

2.Materials

Yamanishi [28], Kuang [39], and Hao [31] datasets are three critical databases used for validating the proposed DTI-related algorithm. The Yamanishi dataset is called a benchmark database, which contains drug-target relationships from databases such as KEGG BRITE [40], BRENDA [41], SuperTarget [27], and DrugBank [23], target protein sequence from KEGG Gene Database [40], and drug compounds from KEGG Drug and Compound Database [40]. Moreover, the Yamanishi database is categorized into four datasets: NR, GPCR, IC, and E. It contained 445 drugs and 664 targets in E, 210 drugs and 204 targets in IC, 223 drugs and 95 targets in GPCR, and 54 drugs and 26 targets in NR. The details of the dataset are depicted in Table 1. The Kuang dataset had 3681 known interaction pairs [39], including 786 drugs and 809 targets (Table 1). The Hao dataset comprised 829 drugs, 733 targets, and 3688 identified interaction pairs [31] (Table 1).

Table 1

Summary of four benchmark and two expanded datasets

DatasetNo of drugsNo of targetsNo of interactionsSparsity
E4456642,9260.010
IC2102041,4760.034
GPCR223956350.030
NR5426900.064
Kuang78680936810.006
Hao82973336880.006

For targeted analysis and prediction, we ensured that each drug contained at least one FDA-approved ATC code in the dataset.

3.Methods

This study introduced a novel DTI prediction model, self-paced learning with dual similarity information and MF method (SPLDMF), to predict unknown DTIs.

3.1Task description

Five matrices St, Sd, Pt, Pd, and Y represented target similarity, drug similarity, drug topological feature similarity, target topological feature similarity, and known DTI, respectively. The task was to explore how to use known information to predict unknown DTIs. Then, four scenarios based on DTI were created to more comprehensively display the performance of the model (Fig. 1). To describe these four scenarios, we utilized five drugs (i.e., D1 to D5) and four targets (i.e., T1 to T4) as an example. Then, the D1T1 interaction pair on the orange background can represent four scenarios depending on the conditions: (1) known drug-known target (scenario 1 in Fig. 1a); (2) known drug-new target (scenario 2 in Fig. 1b); (3) new drug-known target (scenario 3 in Fig. 1c); and (4) new drug-new target (scenario 4 in Fig. 1d)

Figure 1.

Four scenarios of DTI predictions. The pair with orange background represents (a) known drug-known target; (b) known drug-new target; (c) new drug-known target; and (d) new drug-new target.

Four scenarios of DTI predictions. The pair with orange background represents (a) known drug-known target; (b) known drug-new target; (c) new drug-known target; and (d) new drug-new target.

Figure 2.

Process of our proposed model.

Process of our proposed model.

In the protocol, definitions reference to a “known drug” means that the experimental drug has at least one interaction with the targets (e.g., D1 in Figs 1a and 2b, respectively). Similarly, “known target” means that the experimental target has at least one interaction with drugs. In contrast, “new drug” denotes that the experimental drug has no known interactions with the targets (e.g., D1 in Fig. 1c and d). Similarly, “new target” means that the experimental target has no existing interaction with drugs. The focus of this study is to use the SPLDMF method to improve the DTI prediction ability of the model. Specifically, the algorithm assigns scores to drug-target pairs to estimate the likelihood of their interaction, and the higher the score is, the more likely the drug and target will interact.

Suppose Nd known drugs are represented by a matrix D), then D={d1,d2,,dNd}. Assuming Nt known targets, a set of known targets T can be represented as T={t1,t2,,tNt}. Let {SdPd} represent similarity matrices related to drugs, and the dimension of Sd and Pd is Nd×Nd. Similarly, if {StPt} are the similarity matrices involving targets, then the dimension of St and Pt is Nt×Nt. Let Y be an Nd×Nt adjacency matrix, which can be expressed as the DTI. When Yij= 1, the drug di interacts with the target tj; when Yij= 0, no interaction between drug di and the target tj is observed. Our goal was to reconstruct F, which was an Nd×Nt score matrix. When the score Fij of F is higher, it meant that the drug di more likely interacted with the target tj.

3.2Network topology feature calculation

In this study, the attributive and topological properties of the drug and the target were used. The drug and target attributive features referred to the drug structure and the amino acid sequence of the target protein, respectively. Yamanishi et al. [28] also collected a dataset including the attributive feature similarity of the drug and the target. The structural data of all network nodes were referred to as topological features. Drug-drug topological feature similarity and target-target topological feature similarity were measured using the Node2vec method and the cosine similarity method, respectively, to extract the topological features of drugs and targets from the DTI network [43].

The DTI matrix YRNd×Nt was obtained from the dataset. Then, a weightless and undirected network graph G=(V,E) was constructed based on the DTI matrix Y, where V denotes the set of nodes, |V|=Nd+Nt, where |V| denotes the number of nodes. E denotes the set of edges, |E|=iNdjNtYij, where |E| denotes the number of edges. When Y(i,j)= 1, an edge exists such that Vi and Vj are connected; when Y(i,j)= 0, no edge exists, and Vi and Vj are not connected. Then, a second-order random walk was performed on the network graph G using the Node2vec method to obtain the topological features of drugs and targets. Moreover, we obtained the d-dimensional topological features of the drug and target using the Node2vec method. Next, we calculated the drug-drug and target-target topological feature similarity. We used the cosine similarity to calculate the topological feature similarity, and the cosine similarity between drugs represented the similarity of two drug vectors in the topological feature space. Likewise, the cosine similarity of target-target topological features was predicted as the similarity of two target vectors in the topological feature space. The topological feature vectors of two drugs di and dj are denoted as xi and xj, both of which are d-dimensional topological features. Finally, the drug-drug topological feature similarity was measured with the help of cosine similarity using the sampling vertex sequence:

(1)
𝑆𝑖𝑚𝑑𝑡𝑝=xixjT||xi||||xj||

For ease of description, the drug-drug topological feature similarity matrix can be represented as PdRNd×Nd, where Pd(i,j) denotes the topological feature similarity between the i-th and the j-th drugs. Correspondingly, the target-target topological feature similarity matrix is represented by PtRNt×Nt, where Pt(i,j) denotes the topological feature similarity between the i-th and the j-th targets.

3.3SPLDMF

The goal of MF was to factorize the identified DTI matrix Y into two low-rank matrices A and B. The dimensionality of A and B are matrices of Nd×r and Nt×r, respectively, where r denotes the dimensionality of the feature space, A denotes the potential feature representation of the drug, and B denotes the potential feature representation of the target. As the DTI matrix Y can be factorized into A and B, the inner product of A and B is approximately equal to the DTI score, and Y is represented as:

(2)
YABT

First, A and B were calculated to obtain Y. Subsequently, the squared error of Eq. (2) was minimized to obtain:

(3)
argminA,B||Y-ABT||F2

where ||||F2 is the Frobenius norm.

Solving for Eq. (3) might directly lead to overfitting during training. Therefore, the L2 regularization term was added to solve the aforementioned problem. Then, Eq. (3) was rewritten as:

(4)
argminA,B||W(Y-ABT)||F2+λl(||A||F2+||B||F2)

where λl represents the regularization parameter.

Based on the idea that drugs with a higher degree of similarity tend to act on a similar set of targets, and vice versa, we integrated drug-related similarity matrices Sd and Pd and target-related similarity matrices St and Pt into the model to more accurately discover potential DTIs. Based on a previous study [33], the inner product of the corresponding two drug feature vectors and two target feature vectors was used to approximate the drug similarity and target similarity matrices, respectively. The detailed decomposition process was as follows:

(5)
SdAATStBBTPdAATPtBBT

Therefore, we added the drug similarity matrix Sd, the target similarity matrix St, the drug topological feature matrix Pd, and the target topological feature Pt into Eq. (5). The new equation was as follows:

argminA,B||W(Y-ABT)||F2+λl(||A||F2+||B||F2)+λd||Sd-AAT||F2
(6)
+λt||St-BBT||F2+λm||Pd-AAT||F2+λn||Pt-BB||F2

where λd, λt, λm, and λn are the regularization parameters.

The objective function of the most recent MF-based approaches for DTI prediction is nonconvex. As a result, the optimized objective function can be easily trapped in local minima, particularly when dealing with enhanced noise and a large amount of missing data. Many studies showed that SPL could alleviate the model falling into a bad local optimal solution because of its training strategy of selecting samples from easy to complex [44, 45]. Thus, we integrated the SPL algorithm into the MF model to improve its strength. Consequently, Eq. (3.3) could be modified as:

argminA,B||W(Y-ABT)||F2+λ1(||A||F2+||B||F2)+λd||Sd-AAT||F2
(7)
+λt||St-BBT||F2+λm||Pd-AAT||F2+λn||Pt-BBT||F2+γ2W+γk

where k and λ denote the model age and the weights assigned to the selected samples, respectively.

According to Zhao et al. [44], the optimal Wi,j was calculated using Eq. (8) when A and B were fixed.

(8)
Wij={1if lij1(k+1/γ)20if lij1k2γ(1lij-k)otherwise

where lij=[(Y-ABT)ij]2. When lij1/(k+1/γ)2, the corresponding weight was 1, implying that the sample was taken as a simple sample and selected by the model during the training; when lij1k2, the sample was considered a difficult sample and was temporarily not selected by the model; in other cases, the sample was assigned a non-zero weight and was considered an easy sample.

The alternative search strategy (ASS) was used to calculate A and B to overcome the problem of the potential feature vectors of the drug and the target to not easily solved as they tended to couple together. The potential feature vector of the drug was represented by ai, which is a row vector of matrix A. Furthermore, the potential feature vector of the target was represented by bj, which is a row vector of matrix B. The objective function was transformed as in Eq. (9) to implement the ASS algorithm.

(9)
L=i=1Ndj=1NtWij(Yij-aibjT)2+λl(i=1Nd||ai||2+j=1Nt||bj||2)+λdi=1Ndp=1Nd(Sd(di,dp)-aiapT)2+λtj=1Ntq=1Nt(St(tj,tq)-bjbqT)2λmi=1Ndp=1Nd(Pd(di,dp)-aiapT)2+λnj=1Ntq=1Nt(Pt(tj,tq)-bjbqT)2

We fixed B and computed the partial derivative of L with respect to ai to minimize L. Afterward, A was updated by Lai= 0. The updated equation obtained after derivation was as enumerated by the equation:

(10)
ai=j=1NtWijYijbj+λdp=1NdSd(di,dp)ap+λmp=1NdPd(di,dp)apj=1NtWijbjTbj+λlIk+λdp=1NdapTap+λmp1NdapTap

Similarly, we fixed A and computed the partial derivative of L with respect to bj. Then, B was updated using Lbi= 0. The updated equation obtained after derivation was as enumerated by the equation:

(11)
bj=i=1NdWijYijai+λtq=1NtSt(tj,tq)bq+λnq=1NtPt(tj,tq)bqi=1NdWijaiTai+λlIk+λtq=1NtbqTbq+λnq=1NtbqTbq

where lk in Eqs (10) and (11) is the identity matrix.

Algorithms 1 and 2 explain the process of assessing individual parameters. The potential drug characteristic representation A and the potential target characteristic representation B were obtained after several iterations using Eqs (10) and (11). We obtained the DTI prediction matrix F by reconstructing the DTI matrix Y, and the calculation procedure was as enumerated by the equation:

(12)
F=ABT

Algorithm 1: Pseudocode of parameter estimation for MF
Input:
Y: true drug-target interaction matrix; W: weight matrix; Sd, St: drug and target similarity matrices;
Pd, Pt: drug and target topological feature matrix; r: feature space; λl, λd, λt, λm, λn: regularization parameters
Ouput:
 drug potential representation A, target potential representation B and score matrix F
1: initial A and B randomly;
2: repeat
3:  Update A using Eq. (10);
4:  Update B using Eq. (11);
5:  Update F using Eq. (12);
6: until

The drugs (compounds) and targets (small molecules) could be determined based on the prediction result, that is, the scoring and ranking of matrix F. The workflow of the whole method is shown in Fig. 2.

Algorithm 2: Pseudocode of parameter estimation for SPLDMF
Input:
Y: true drug-target interaction matrix; Sd, St: drug and target similarity matrices;
Pd, Pt: drug and target topological feature matrix; r: feature space; λl, λd, λt, λm, λn: regularization parameters;
μ> 1: step size; k0; k𝑒𝑛𝑑
Ouput:
 score matrix F
1: initial solve the MF problem with all the observation equally weighted to obtain A0 and B0, calculate t0, kk𝑒𝑛𝑑
2: whilek>k𝑒𝑛𝑑do
3:  Update W using Eq. (8);
4:  Update A and B using Algorithm 1;
5:  Update F using Eq. (12);
6:  Compute currentd lij(Y-F);
7: tt+1, kk/μ;
8: end while

4.Results

Compared with other methods, the performance of the proposed model was assessed by simulating experiments under different missing rates and noise ratios. Then, compared with the performance of the advanced model, the performance was tested using four application scenarios. Further, two realistic and challenging extended datasets were selected for experimental comparison. We used four matrices such as root-mean-squared error (RMSE), mean absolute error (MAE), area under the receiver operating characteristic curve (AUC), and precision-recall curve (AUPR) to evaluate the effectiveness of SPLDMF.

4.1Simulation data experiment

Simulation experiments were carried out to test the robustness of the model under different missing rates and noise ratios. We compared the proposed SPLDMF with two popular DTI prediction methods: MF and SVD. According to the studies by Xia et al. [33], Zheng et al. [46], and Zhao et al. [44], a matrix Y following Gaussian distribution N(0,1) was developed randomly using n= 300, m= 200, and r= 3. We set three missing ratios (50%, 50%, and 90%) and five noise ratios (5%, 10%, 20%, 25%, and 40%) to verify the validity and robustness of the models. We determined that the noise property of Y was uniform noise in the range [-20,20]. Based on a previous study [47], the conversion between matrices Y and Y was possible, and Y with a well-fitting effect could help explore new DTIs. RMSE and MAE criteria were used for evaluating the performance of the three methods, where 𝑅𝑀𝑆𝐸=1mn||Y-ABT||F, 𝑀𝐴𝐸=1mn||Y-ABT||1, and m, n are the rows and columns of the matrix Y, respectively. We performed 30 replicate experiments for each method, and the performance of each method was qualified based on the average of the experimental results (Table 2). SPLDMF achieved the best RMSE and MAE performance in each case by comparing the three methods with three missing ratio levels and five noise ratio levels. For instance, when missing ratio = 10% and noise ratio = 10%, the RMSE and MAE of SPLDMF reached 0.886 and 0.296, respectively, which were much better compared with the values of MF (1.472 and 0.667, respectively) and SVD (1.970 and 0.935, respectively). The predictive performance of the models decreased as the deletion rate increased. The proposed SPLDMF imposed more regularization constraints on the self-similarity of drugs and targets, allowing more similar DTIs to be accurately predicted. Table 2 demonstrates that the best performance of our method could be obtained at all three data missing ratios. Additionally, the prediction error of all models increased with the increase in the noise ratio. However, the proposed SPLDMF was capable of adaptively weighting both clean and noisy samples due to the introduction of the SPL strategy. This learning strategy enabled the model to avoid falling into bad local optima and had better robustness to mitigate the effects of noise. Overall, the results of the simulation experiments revealed that the SPLDMF outperformed the MF and SVD methods under noise and missing data conditions.

Table 2

Performance comparison of MF, SVD, and SPLDMF on synthetic data in terms of MAE and RMSE

Missing_ratio (%)Noise_ratio (%)MAERMSE
CMFSVDSPLDMFCMFSVDSPLDMF
1050.497 (0.040)0.755 (0.048) 0.218 (0.005) 1.340 (0.040)1.804 (0.048) 0.743 (0.029)
100.667 (0.026)0.935 (0.034) 0.296 (0.009) 1.472 (0.038)1.970 (0.034) 0.886 (0.031)
200.864 (0.023)1.159 (0.048) 0.432 (0.016) 1.635 (0.038)2.164 (0.048) 1.078 (0.042)
250.930 (0.022)1.426 (0.045) 0.514 (0.020) 1.694 (0.026)2.230 (0.045) 1.189 (0.043)
401.113 (0.025)1.481 (0.049) 0.872 (0.032) 1.833 (0.036)2.411 (0.049) 1.710 (0.060)
5050.681 (0.039)0.795 (0.045) 0.259 (0.008) 1.776 (0.047)2.153 (0.045) 0.889 (0.047)
100.911 (0.033)1.018 (0.038) 0.351 (0.013) 1.970 (0.052)2.346 (0.038) 1.075 (0.053)
201.151 (0.022)1.297 (0.031) 0.552 (0.022) 2.231 (0.037)2.565 (0.031) 1.393 (0.063)
251.228 (0.032)1.411 (0.038) 0.659 (0.032) 2.289 (0.051)2.650 (0.038) 1.554 (0.085)
401.462 (0.026)1.734 (0.042) 1.094 (0.037) 2.469 (0.042)2.889 (0.042) 2.157 (0.078)
9050.656 (0.027)0.775 (0.078) 0.402 (0.019) 2.453 (0.072)2.571 (0.078) 0.996 (0.059)
101.247 (0.035)1.138 (0.034) 0.497 (0.017) 3.315 (0.080)2.881 (0.034) 1.337 (0.072)
202.027 (0.037)1.514 (0.032) 0.890 (0.042) 4.262 (0.064)3.186 (0.032) 2.171 (0.120)
252.307 (0.036)1.683 (0.037) 1.110 (0.045) 4.559 (0.083)3.322 (0.037) 2.520 (0.128)
402.940 (0.052)2.138 (0.044) 1.846 (0.079) 5.129 (0.085)3.649 (0.044) 3.540 (0.148)

4.2Benchmark data experiment

We used the same dataset and cross-validation technique to compare our method with state-of-the-art methods (i.e., 5-time-10-fold cross-validation using Yamanishi’s benchmark dataset in four different applications scenarios) to validate the performance of the model. Three cross-validation settings were used to better evaluate the model in these four scenarios: (1) CVP, which was based on the cross-validation of drug-target pairs; (2) CVR, which was based on cross-validation on rows; (3) CVC, which was based on cross-validation on columns; and (4) CV4S, which was based on random cross-validation. Table 3 depicts the application scenario as well as the optimal potential feature dimensionality settings in our experiments. We employed the CVP settings to predict known drug-known target interactions (i.e., scenario 1, named CVPS). Figure 3 illustrates the model’s AUPR and AUC values for several potential features. The findings revealed that a higher potential feature dimensionality was more consistent AUPR and AUC values. In the CVP scenario, the GPCR dataset also reached the optimal feature dimensionality at r= 80 (Fig. 3a). We used the CVR settings (i.e., scenario 3, named CVRS) for predicting a new drug-known target interaction. The model’s AUPR and AUC values were calculated for various potential features.

Table 3

Application scenarios and dataset settings and optimal feature dimensionality

CVPSCVCSCVRSCV4S
Dataset settingsCVPCVCCVRCVP/CVC/CVR
Best feature dimension80100100100

Figure 3.

Performance comparison of SPLDMF and other advanced models, and the influence and change of r on AUC and AUPR in different scenarios. (a) Changes in AUC and AUPR under different feature dimensions under CVPS. (b) Changes in AUC and AUPR under different feature dimensions under CVRS. (c) Variation in AUC and AUPR under different feature dimensions under CVCS. (d) Performance comparison of SPLDMF and other advanced models under the GPCR dataset in four scenarios.

Performance comparison of SPLDMF and other advanced models, and the influence and change of r on AUC and AUPR in different scenarios. (a) Changes in AUC and AUPR under different feature dimensions under CVPS. (b) Changes in AUC and AUPR under different feature dimensions under CVRS. (c) Variation in AUC and AUPR under different feature dimensions under CVCS. (d) Performance comparison of SPLDMF and other advanced models under the GPCR dataset in four scenarios.

The values are the average findings of 30 runs. The best results are shown in bold, and the values in parentheses are standard deviations.

The value was found to be the highest at r= 100. In the CVR scenario, the GPCR dataset also achieved the optimal feature dimensionality at r= 100 (Fig. 3c).

The CVC configuration was applied (i.e., scenario 2, named CVCS) for predicting new target-known drug interactions. Figure 3c illustrates the model’s AUPR and AUC values for several potential feature dimensionalities. The experimental findings revealed that the AUC curves in the CVC scenario differed significantly from those in the CVP and CVR scenarios, particularly with the possible feature dimensionality r= 70 (a variation amplitude of more than 0.2). In the CVC scenario, the GPCR dataset also had the best feature dimensionality at r= 100.

The fourth of the four scenarios (CV4S, new drug-new target) was the most difficult for DTI prediction. Since this sort of cross-validation was random and the training datasets and test datasets were also generated randomly, the test dataset might contain samples of fresh medications and fresh targets to aid in the inclusion of drug-target combinations in the new drug-new target category (D1T1 pairs in Fig. 1d). In the CVP situation, we performed 50 times of 5-time-10-fold cross-validation tests based on GPCR data. The optimal AUPR and AUC results were 0.651 ± 0.050 and 0.910 ± 0.012, respectively. The detailed calculation procedure was demonstrated in the code.

Table 4

Comparison of the matrices from the major algorithms in CVPS, CVRS, CVCS, and CV4S scenarios based on the GPCR dataset

ScenarioMethodAUCAUPR
CVPSNRLMF0.969 ± 0.0040.749 ± 0.015
DNILMF0.975 ± 0.0030.812 ± 0.009
SPLCMF0.976 ± 0.0120.779 ± 0.015
SPLDMF 0.982 ± 0.004 0.815 ± 0.015
CVRSNRLMF0.895 ± 0.0110.364 ± 0.023
DNILMF0.967 ± 0.0060.781 ± 0.050
SPLCMF0.967 ± 0.0020.784 ± 0.023
SPLDMF 0.971 ± 0.012 0.792 ± 0.050
CVCSNRLMF0.930 ± 0.0120.556 ± 0.038
DNILMF0.933 ± 0.0090.684 ± 0.036
SPLCMF0.931 ± 0.0100.675 ± 0.015
SPLDMF 0.941 ± 0.023 0.710 ± 0.050
CV4SNRLMF0.706 ± 0.0080.385 ± 0.006
DNILMF0.897 ± 0.0040.633 ± 0.025
SPLCMF0.856 ± 0.0080.645 ± 0.025
SPLDMF 0.910 ± 0.012 0.651 ± 0.050

Table 5

Top 10 drug-target relationship prediction scores and their validation

RankDrug nameTarget nameScoreDatabasesLiterature
1VerapamilSCN4A0.983C, D, K[49, 50]
2ClozapineDD5R0.978D[51]
3Mirtazapine5HR1A0.902D[52]
4DiethylstilbestrolESR10.896C, D, K[53, 54]
5NorehindroneESR10.894
6Methysergide5HR1D0.893C, D, K[55]
7FlunitrazepamGARSA10.891C, K[56]
8ClozapineADRA1A0.886C, D[57, 58]
9Loxapine5HR2B0.879C, D, K[59]
10IsofluraneGABRA10.876D[60]

Table 6

Comparison of the matrices from DNILMF, SPLCMF, and SPLDMF algorithms in four scenarios based on the Kuang and Hao datasets

DatasetScenarioAUCAUPR
DNILMFSPLCMFSPLDMFDNILMFSPLCMFSPLDMF
KuangCVP0.9410.933 0.949 0.6490.733 0.842
CVR0.8030.831 0.840 0.6020.491 0.710
CVC0.8620.886 0.888 0.6430.456 0.731
CV4S0.8970.826 0.903 0.6330.435 0.742
HaoCVP0.9430.935 0.943 0.7480.721 0.816
CVR0.8110.792 0.843 0.7360.740 0.741
CVC0.8520.868 0.881 0.6830.710 0.726
CV4S0.9010.816 0.912 0.6210.593 0.735

We conducted sufficient comparative experiments for the aforementioned four scenarios to verify the effectiveness of the proposed method. Specifically, we compared SPLDMF with three other state-of-the-art methods, and the results are depicted in Table 4. The results indicated that the AUC and AUPR of SPLDMF were currently the best among the comparison methods. Our method could deal with noisy data more robustly due to the introduction of the SPL strategy, thus achieving better performance. The result showed that SPLDMF under all scenarios outperformed NRLMF and DNILMF in AUC and AUPR, suggesting that the proposed SPLDMF was more robust when using ligand-based methods to anticipate the interactions between ligands and target proteins. Our method outperformed in all scenarios compared with SPLCMF, which also used SPL strategy. An insightful explanation was that we leveraged more drug-drug and target-target similarities to improve predictive capacity for unknown outcomes. The result also demonstrated that SPLDMF had an improvement of 0.054 and 0.006 in AUC and AUPR, respectively, in the most difficult scenario CV4S, compared with SPLCMF.

The prediction matrix was scored using Eq. (12). We took the top 10 DTI pairs with the prediction scores after synthesizing the DTI prediction scores of NR, GPCR, IC, and E. Data validation was performed using ChEMBL, DrugBank, and KEGG databases, labeled C, D, and K, respectively. We validated the partial prediction results based on previous studies. The fifth and sixth columns of Table 5 list the database used for data validation and the studies referred to for the validation method, respectively. Table 5 lists the top 10 predicted DTIs. The most anticipated interaction was between DB00661 (verapamil) and P35499 (SCN4A) with a predicted high score of 0.983. This predicted relationship was found in the three databases C, D, and K. Furthermore, they were also reported in previous studies (Shafi et al., 2022; Stee et al., 2020). Except for the fifth item, other predictions were found in relevant reports in the database and literature, which verified these predictions to a certain extent. The fifth pair, the relationship between norethindrone (DB00717) and ESR1 (P03372), had no relevant reports in the current database and literature.

According to the FDA, the drug norethindrone (DB00717), similar to the drug diethylstilbestrol (DB00255), is a progestin used for contraception, the prevention of endometrial hyperplasia in hormone replacement therapy, and the treatment of other hormone-mediated diseases such as endometriosis. Diethylstilbestrol is also used to treat diseases such as breast and prostate cancer, but it is listed as a known carcinogen. The predicted results indicated that norethindrone has the same target (ESR1) as diethylstilbestrol. Besides its proven contraceptive use, norethindrone may also be used to treat breast cancer, prostate cancer, and other diseases based on the target principle. We verified our speculation through the KEGG pathway analysis experiment.

4.3Expanded data experiment

Besides simulated data and common benchmark datasets, the proposed SPLDMF was also tested with additional expanded datasets (prepared by Kuang [39] and Hao [31]) to fully verify the effectiveness of the suggested model on various datasets. A total of 3681 known interactions, 786 drugs, and 809 targets were detected in the Kuang dataset. Moreover, 3688 known interactions, 829 drugs, and 733 targets were detected in the Hao dataset. Table 6 depicts the performance comparison of SPLDMF and other methods on the expanded dataset, indicating that SPLDMF achieved the best prediction performance on both augmented datasets. This was mainly attributed to the fact that the SPL strategy improved the generalization performance of the model, enabling it to perform more robustly on noisy data. Meanwhile, the use of more feature similarity also enhanced the prediction accuracy, which was conducive to the discovery of potential DTIs.

5.Discussion and conclusion

Several computational-based methods, including similarity-based methods, standard machine learning methods, and MF-based methods, have been developed in recent years to achieve efficient and accurate DTI prediction. A recent study by Shi et al. [48] revealed that MF-based methods had the best prediction accuracy. Existing MF-based methods, however, might easily fall into bad local minima due to noise and missing data, as well as the nonconvex pattern of MF models. Meanwhile, the lack of prior information made it challenging for the model to accurately predict more potential associations. Therefore, we proposed a DTI prediction model based on an SPL strategy and incorporated more similarity information. The novelty of SPLDMF might be attributed to a combination of several factors. First, introducing the SPL strategy enabled the model to avoid falling into a bad local optimum solution and thus had stronger robustness. The proposed SPLDMF had better prediction performance when the data were affected by noise. Moreover, we employed more prior similarity information to improve the feature extraction capability of the model, thus enabling the model to observe more potential DTIs accurately.

Extensive experiments on synthetic data and four benchmark datasets were performed to assess the validity of the proposed SPLDMF method, which was then compared with three state-of-the-art DTI prediction methods. Two extended datasets were also used to verify the validity of each method. Comprehensive analysis results demonstrated that our proposed SPLDMF outperformed other state-of-the-art approaches. SPLDMF, for example, was more robust for noisy and missing data based on synthetic data. Furthermore, it outperformed all four scenarios and two expanded datasets in terms of common machine learning evaluation matrices. The prediction results revealed that 9 of the top 10 DTI pairs were found in the database and literature, and they were proven or considered effective. An unproven DTI pair (DB00717-P03372) was also preliminarily proven using pathway enrichment experiments. These results suggested that SPLDMF might provide a useful tool for predicting new DTIs and redirecting the use of existing drugs.

Acknowledgments

This work was supported in part by the Macau Science and Technology Development (Grant no. 0056/2020/AFJ) from the Macau Special Administrative Region of the People’s Republic of China and the Key Project from the University of Educational Commission of Guangdong Province of China (Natural, grant no. 2019GZDXM005).

Conflict of interest

None to report.

References

[1] 

Hopkins AL. Predicting promiscuity. Nature. 167-168.

[2] 

Swamidass SJ. Mining small-molecule screens to repurpose drugs. Briefings in bioinformatics. 327-335.

[3] 

Iorio F, Rittman T, Ge H, Menden M, Saez-Rodriguez J. Transcriptional data: a new gateway to drug repositioning? Drug discovery today. 350-357.

[4] 

Luo Y, Zhao X, Zhou J, Yang J, Zhang Y, Kuang W, et al. A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information. Nature communications. 1-13.

[5] 

Quinn JG, Pitts KE, Steffek M, et al. Determination of affinity and residence time of potent drug-target complexes by label-free biosensing. Journal of Medicinal Chemistry.

[6] 

Ashburn TT, Thor KB. Drug repositioning: identifying and developing new uses for existing drugs. Nature reviews Drug discovery. 673-683.

[7] 

Ezzat A, Wu M, Li X-L, Kwoh C-K. Computational prediction of drug-target interactions using chemogenomic approaches: an empirical survey. Briefings in bioinformatics. 1337-1357.

[8] 

Huang S-Y, Li M, Wang J, Pan Y. Hybriddock: a hybrid protein-ligand docking protocol integrating protein-and ligand-based approaches. Journal of Chemical Information and Modeling. 1078-1087.

[9] 

Xue H, Li J, Xie H, Wang Y. Review of drug repositioning approaches and resources. International journal of biological sciences. 1232.

[10] 

Sousa SF, Ribeiro AJ, Coimbra J, Neves R, Martins S, Moorthy N, et al. Protein-ligand docking in the new millennium – a retrospective of 10 years in the field. Current medicinal chemistry. 2296-2314.

[11] 

Huang S-Y, Zou X. Advances and challenges in protein-ligand docking. International Journal of Molecular Sciences. (2010) ; 11: : 3016-3034.

[12] 

Ekins S, Williams AJ, Krasowski MD, Freundlich JS. In silico repositioning of approved drugs for rare and neglected diseases. Drug Discovery Today. (2011) ; 16: : 298-310.

[13] 

Sperandio O, Andrieu O, Miteva MA, Vo M-Q, Souaille M, Delfaud F, et al. Med-sumolig: A new ligand-based screening tool for efficient scaffold hopping. Journal of Chemical Information and Modeling. (2007) ; 47: : 1097-1110.

[14] 

Keiser MJ, Setola V, Irwin JJ, Laggner C, Abbas AI, Hufeisen SJ, et al. Predicting new molecular targets for known drugs. Nature. (2009) ; 462: : 175-181.

[15] 

Wang L, Ma C, Wipf P, Liu H, Su W, Xie X-Q. Targethunter: An in silico target identification tool for predicting therapeutic potential of small organic molecules based on chemogenomic database. The AAPS Journal. (2013) ; 15: : 395-406.

[16] 

Haupt VJ, Schroeder M. Old friends in new guise: Repositioning of known drugs with structural bioinformatics. Briefings in Bioinformatics. (2011) ; 12: : 312-326.

[17] 

Ma D-L, Chan DS-H, Leung C-H. Drug repositioning by structure-based virtual screening. Chemical Society Reviews. (2013) ; 42: : 2130-2141.

[18] 

Sousa SF, Ribeiro AJM, Coimbra JTS, et al. Protein-ligand docking in the new millennium – a retrospective of 10 years in the field. Current Medicinal Chemistry. (2013) ; 20: (18): 2296-2314.

[19] 

Ekins W, Krasowski AJ, Freundlich MD, et al. In silico repositioning of approved drugs for rare and neglected diseases. Drug Discov Today. 16: (7-8): 298-310.

[20] 

Sperandio O, Andrieu O, Miteva MA, Vo MQ, Souaille M, Delfaud F, et al. Medsumolig: A new ligand-based screening tool for efficient scaffold hopping. Journal of Chemical Information and Modeling. 1097-1110.

[21] 

Jarada TN, Rokne JG, Alhajj R. A review of computational drug repositioning: strategies, approaches, opportunities, challenges, and directions. BioMed Central.

[22] 

Pliakos K, Vens C, Tsoumakas G. Predicting drug-target interactions with multilabel classification and label partitioning. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 1-1.

[23] 

Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, et al. Drugbank: a knowledgebase for drugs, drug actions and drug targets. Nucleic acids research. D901-D906.

[24] 

Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, et al. From genomics to chemical genomics: new developments in kegg. Nucleic acids research. D354-D357.

[25] 

Chen B, Wild D, Guha R. Pubchem as a source of polypharmacology. Journal of chemical information and modeling. 2044-2055.

[26] 

Schomburg I, Chang A, Ebeling C, Gremse M, Heldt C, Huhn G, et al. Brenda, the enzyme database: updates and major new developments. Nucleic acids research. D431-D433.

[27] 

Gunther S, Kuhn M, Dunkel M, Campillos M, Senger C, Petsalaki E, et al. Supertarget and matador: resources for exploring drug-target relationships. Nucleic acids research. D919-D922.

[28] 

Yamanishi Y, Araki M, Gutteridge A, Honda W, Kanehisa M. Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics. i232-i240.

[29] 

Bleakley K, Yamanishi Y. Supervised prediction of drug-target interactions using bipartite local models. Bioinformatics. 2397-2403.

[30] 

Liu Y, Wu M, Miao C, Zhao P, Li X-L. Neighborhood regularized logistic matrix factorization for drug-target interaction prediction. PLoS computational biology. e1004760.

[31] 

Hao M, Wang Y, Bryant SH. Improved prediction of drug-target interactions using regularized least squares integrating with kernel fusion technique. Analytica chimica acta. 41-50.

[32] 

Yang X, Liu Y, He J, et al. Additional neural matrix factorization model for computational drug repositioning. BMC bioinformatics. 1-11.

[33] 

Xia L-Y, Yang Z-Y, Zhang H, Liang Y. Improved prediction of drug-target interactions using self-paced learning with collaborative matrix factorization. Journal of chemical information and modeling. 3340-3351.

[34] 

Yang M, Wu G, Zhao Q, Li Y, Wang J. Computational drug repositioning based on multi-similarities bilinear matrix factorization. Briefings in Bioinformatics. bbaa267.

[35] 

Ding Y, Tang J, Guo F, Zou Q. Identification of drug-target interactions via multiple kernel-based triple collaborative matrix factorization. Briefings in Bioinformatics.

[36] 

Wang Y, Zhang Y, Wang J, Xie F, Zheng D, Zou X, et al. Prediction of drug-target interactions via neural tangent kernel extraction feature matrix factorization model. Computers in Biology and Medicine. 106955.

[37] 

Kumar MP, Packer B, Koller D. Self-paced learning for latent variable models. In International Conference on Neural Information Processing Systems.

[38] 

Kumar MP, Turki H, Dan P, Koller D. Learning specific-class segmentation from diverse data. In International Conference on Computer Vision.

[39] 

Kuang Q, Xu X, Li R, Dong Y, Li Y, Huang Z, et al. An eigenvalue transformation technique for predicting drug-target interaction. Scientific reports. 13867.

[40] 

Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, et al. Fromgenomics to chemical genomics: new developments in kegg. Nucleic acids research. D354-D357.

[41] 

Schomburg I, Chang A, Ebeling C, Gremse M, Heldt C, Huhn G, et al. Brenda, the enzyme database: updates and major new developments. Nucleic acids research. D431-D433.

[42] 

Gunther S, Kuhn M, Dunkel M, Campillos M, Senger C, Petsalaki E, et al. Supertarget and matador: resources for exploring drug-target relationships. Nucleic acids research. D919-D922.

[43] 

Grover A, Leskovec J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 855-864.

[44] 

Zhao Q, Meng D, Jiang L, Xie Q, Xu Z, Hauptmann AG. Self-paced learning for matrix factorization. In Aaai. 3: : 4.

[45] 

Meng D, Zhao Q, Jiang L. A theoretical understanding of self-paced learning. Information Sciences. 319-328.

[46] 

Zheng X, Ding H, Mamitsuka H, Zhu S. Collaborative matrix factorization with multiple similarities for predicting drug-target interactions. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 1025-1033.

[47] 

Van Laarhoven T, Nabuurs SB, Marchiori E. Gaussian interaction profile kernels for predicting drug-target interaction. Bioinformatics. 3036-3043.

[48] 

Shi J-Y, Yiu S-M, Li Y, Leung HC, Chin FY. Predicting drug-target interaction for new drugs using enhanced similarity measures and super-target clustering. Methods. 98-104.

[49] 

Shafi O, Latief M, Hassan Z, Abbas F, Farooq S. Familial hypokalemic periodic paralysis: A case series and review. Hemoglobin (g/dL). 13-8.

[50] 

Stee K, Van Poucke M, Peelman L, Lowrie M. Paradoxical pseudomyotonia in English springer and cocker spaniels. Journal of veterinary internal medicine. 253-257.

[51] 

Von Coburg Y, Kottke T, Weizel L, Ligneau X, Stark H. Potential utility of histamine h3 receptor antagonist pharmacophore in antipsychotics. Bioorganic & medicinal chemistry letters. 538-542.

[52] 

Langham JJ, Cleves AE, Spitzer R, Kirshner D, Jain AN. Physical binding pocket induction for affinity prediction. Journal of medicinal chemistry. 6107-6125.

[53] 

Adam AHB, de Haan LH, Louisse J, Rietjens IM, Kamelia L. Assessment of the in vitro developmental toxicity of diethylstilbestrol and estradiol in the zebrafish embryotoxicity test. Toxicology in Vitro. 105088.

[54] 

Gomez AL, Delconte MB, Altamirano GA, Vigezzi L, Bosquiazzo VL, Barbisan LF, et al. Perinatal exposure to bisphenol a or diethylstilbestrol increases the susceptibility to develop mammary gland lesions after estrogen replacement therapy in middle-aged rats. Hormones and Cancer. 78-89.

[55] 

Wishart D, Arndt D, Pon A, Sajed T, Guo AC, Djoumbou Y, et al. T3db: The toxic exposome database. Nucleic Acids Research. 43: : D928-D934.

[56] 

Collins I, Davey WB, Rowley M, Quirk K, Bromidge FA, McKernan RM, et al. N-(indol-3-ylglyoxylyl) piperidines: high affinity agonists of human gaba-a receptors containing the subunit. Bioorganic & medicinal chemistry letters. 1381-1384.

[57] 

Gundlach M, Di Paolo C, Chen Q, Majewski K, Haigis A-C, Werner I, et al. Clozapine modulation of zebrafish swimming behavior and gene expression as a case study to investigate effects of atypical drugs on aquatic organisms. Science of The Total Environment. 152621.

[58] 

Masellis M, Basile V, DeLuca V, Meltzer H, Lieberman J, Potkin S, et al. Alpha-1a adrenergic (adra1a) and serotonin 6 (htr6) receptor gene polymorphisms and clinical response to clozapine. American Journal of Medical Genetics-Neuropsychiatric Genetics.

[59] 

Alaimo S, Bonnici V, Cancemi D, Ferro A, Giugno R, Pulvirenti A. Dt-web: a web-based application for drug-target interaction and drug combination prediction through domain-tuned network-based inference. BMC systems biology. (2015) ; 1-11.

[60] 

Hall AC, Rowan KC, Stevens RJ, Kelley JC, Harrison NL. The effects of isoflurane on desensitized wild-type and α1 (s270h) γ-aminobutyric acid type a receptors. Anesthesia & Analgesia. 1297-1304.