You are viewing a javascript disabled version of the site. Please enable Javascript for this site to function properly.
Go to headerGo to navigationGo to searchGo to contentsGo to footer
In content section. Select this link to jump to navigation

Prediction of Breast cancer using integrated machine learning-fuzzy and dimension reduction techniques

Abstract

In the last two decades, regardless of epidemiological, and clinical studies, the incidence of breast cancer (BC) is still increasing. However, so far, a lot of research has been done in this field to diagnose BC, and some of them have been discussed in the literature section. But still, happening major issues while dealing with fault feature matrix, generated from traditional feature extraction methods. As a result, the complexity of fault classification has raised, which will negatively impact fault identification’s accuracy and effectiveness. Thus, in this research, a novel hybridized machine learning-fuzzy and dimension reduction (MLF-DR) model has been proposed to improve the decision capabilities and efficiency of an ML model. A feature-based class-togetherness fuzzification method has been used for every feature. The novelty of our research work is to find all possibilities between cancerous and non-cancerous cells by implementing a fuzzy inference system (FIS) in the data analysis phase, and DR techniques at preprocessing phase to select the best optimizing features. This research tries to reduce the incidence of BC and prevent needless deaths, thus will probably follow necessary action to perform i.e. (i) FIS to interpret input values; (ii) principal component analysis (PCA), and recursive feature elimination (RFE) to select best features, and (ii) logistic regression (LR) and random forest (RF) models to predict BC with these features. Furthermore, all the experiments have been done on Wisconsin Breast Cancer Dataset (WBCD), freely available on the Kaggle repository using Python programming on Jupyter Notebook version 6.4.3. The key findings of this research are that the LR-PCA (8 components) model can reliably and successfully obtain the defect diagnosis results with 99.1% accuracy, as compared to individual LR and RF models.

1Introduction

The most often diagnosed disease and the primary cause of cancer death in women globally is breast cancer (BC), however, patterns and trends vary from country to country. According to a survey by the World Health Organization, half of the one million women who received a new diagnosis of BC had died since the disease was typically discovered too late [1]. Thus, prevention measures have failed, in part due to an increase in diagnoses brought on by the deployment of mammographic screening. However, an estimated 2.3 million cases and 685,000 deaths from BC worldwide in 2020 [2] and an expected increase to 4.4 million cases by 2070 [3], which has surpassed lung cancer as the most frequently diagnosed malignancy. In women, BC accounted for roughly 24.5% of all cancer cases and 15.5% of cancer deaths in 2020, making it the most common cancer in terms of incidence and mortality across much of the world. In this survey, it has been found that African citizens were badly affected by BC, accounting for 10.44 %, the second-most disease after lung cancer from 1900 to 2019 (Fig. 1).

Fig. 1

Cancer death rate from the year 1900 to 2019 in African women.

Cancer death rate from the year 1900 to 2019 in African women.

Implementing risk management will need to have a user-friendly device that allows women and physicians to recognize and manage risk in the breast areas. To make sure that women at higher risk are recognized early on, researchers and health systems should think about how models might be used for BC patient records. Given their many diagnostic criteria, the ideal intervention would target all significant features of BC, but this is improbable, thus models that predict the disease and enable the administration of future drugs that target specific features would be ideal. While benign masses cannot spread to other tissues and are therefore only able to expand within the benign mass, malignant tumours can spread to the nearby cells, which can result in breast cancer.

However, the use of machine learning (ML) for the classification of cancers has recently gained more attention [4, 5]. Thus, widely applicable in many healthcare firms for diagnosing diseases these days. Since, due to the increase in the availability of many features day by day, it is quite critical to handle ML models in classifying the cells into the right categories. Choosing the finest possible traits is crucial for more accurate disease prediction. Accordingly, the main goal of this study is to describe how this research presents fuzzy-machine learning, including a study of the fuzzy sets in the machine learning model. This method tries to improve the decision-making capabilities of an ML model by using the fuzzy membership function at the data analysis stage. Moreover, the focus of our research is highly correlated to the dimensionality reduction technique to find the best promising features. Thus, to enhance the model capabilities using both fuzzy and feature extraction techniques, we have proposed a methodology, namely MLF-DR in this paper. This technique is responsible for fuzzifying the BC data using an II-type membership function at the first stage and secondly applying the dimension reduction method to those fuzzified variables. That’s why we have taken the two most common feature-selection techniques as PCA and RFE for choosing the best ideal features and applying these features to ML models (here are LR and RF) to classify cells into either malignant or benign.

The rest of the paper is as follows: Section 2, describes the related works on BC; Section 3, specifies the material and methods for this work; Section 4, shows the performance result of five different techniques; Section 5, discusses all the useful insights that we have done in this work; and finally Section 6, concludes with a significant conclusion.

2Literature review

A powerful method for modeling ambiguity in clinical practice is fuzzy logic. Most medical perceptions are ambiguous in the world of medicine. These ideas are typically challenging to formalize and quantify. Making an analysis using fuzzy logic involves doing so in a situation that is inaccurate, ambiguous, and imprecise. The MYCI, INTERNIST, and DOCTORMOON applications are a few examples of how fuzzy set theory has been applied in the field of medicine. As a result, several sorts of studies have been done in the context of medical diagnostics. Therefore, the main goal of the current study is to investigate the studies that have used fuzzy logic approaches to study infectious disorders to identify prevalent patterns and strategies. This will be done by performing a comprehensive review of the literature.

3Materials & methods

It is crucial to understand that both women with and without a known genetic component to BC belong to the group of women who are at a high risk of getting the disease. Initial precaution is very much essential to avoid unwanted deaths due to this disease [25, 26]. Several different and novel techniques, including the FIS, genetic neuro-fuzzy, and FDSS, were used after 2004, as shown in Fig. 2. The most intriguing finding was that rule-based fuzzy logic, Adaptive Neuro-Fuzzy Inference System, Fuzzy Reed-Forest model, Neuro-fuzzy, Fuzzy Analytic Hierarchy Processes, and Gaussian-Fuzzy neural networks were more commonly implemented in different disease data analysis than any other fuzzy logic techniques in the studied articles. Although, selecting good features can improve the performance of an ML model significantly. As we have seen in the above figure fuzzy systems were widely applied in the research fields since the year 2004. There have been different applications generated so far using fuzzy logic till 2019. This helps in disease diagnosis, monitoring patients at regular intervals, and making better decisions these days. As per a report from NCBI (National Centre for Biotechnology Information), many research articles have been designed using fuzzy logic every year. However, a single symptom can sign many diseases causes difficult to identify the disease. Thus, to tackle the uncertainty and vagueness fuzzy logic has been introduced here.

Fig. 2

Fuzzy logic implementation since 2004.

Fuzzy logic implementation since 2004.

Thus, in this research, we have proposed machine learning-fuzzy and dimension-reduction techniques for BC prediction using the WBCD dataset, as shown in Fig. 3. The main idea behind these two is to implement a modeling rule-based fuzzy approach and PCA technique to select the best optimal features that are greatly responsible for BC, which has been discussed in this section.

Fig. 3

A Proposed methodology MLF-DR to classify abnormal cells into either ‘B’ or ‘M’ using fuzzification and dimension reduction techniques.

A Proposed methodology MLF-DR to classify abnormal cells into either ‘B’ or ‘M’ using fuzzification and dimension reduction techniques.
Table 1

Some literature review on BC disease classification and prediction using different techniques since 2018

Author &yearPurposeMethodsFindingsProsCons
Cardoso et al. 2018To focus on public awareness regarding metastatic breast cancer to avoid unnecessary deaths globallyCarried Multi-layered techniques to detect mBC in patients, nearly about 15,000 from 34 countries in the years 2015 and 2016.Most clinical decisions include the treatment of HR+mBC technique to minimize the BCHelps the researchers to make a qualitative decision when dealing with mBCRemedies can improve survival rates but cannot remove completely worldwide.
Caswell et al. 2018To find the survival of mBC patients after metastasis between two time periods.Meta-regression model to test the longevity of mBC patients over timeNo survival improvement from the year 1980 to 1990but slide changes from 1990 to 2010 in case of meta-regressionThese findings can help patients and doctors discuss the prognosis and course of treatment for mBC.Only 25% of individuals had recurring diseases examined, creating a major challenge.
Pilevarzadeh et al. 2019To assess the prevalence of depression among BC patients worldwide.The meta-publications analysis from 1 January 2000 to 30 March 2019 was evaluated using the Hoy tool.The Eastern region had the highest frequency of depression, and middle-income countries had double the prevalence of depression compared to developed countries.Helps in finding the quality of evaluation strategies, screening processes, and data extraction techniques used in those articlesGiven the high rate of depression in breast cancer patients, it is crucial to conduct screening within the allotted time frames.
Xie et al. 2019Implementing supervised and unsupervised deep convolutional neural networks (CNN) to assess BC histopathological images.Inception_V3 and Inception_ResNet_V2 [10, 11] methods to classify binary and multi-class BC imagesThe proposed model outperforms existing ML models in providing better clustering resultsThis pre-trained model works well when there has been fewer amount of images availableCreate challenges when evaluating this model on larger image BC datasets
Mao et al. 2019To determine whether radiology can enhance mammography’s diagnostic performance in comparison to what expert radiologists can produce.A predictive model was built using SVM, LR, k-NN, and NB, and also an independent testing data set was utilized to verify the model’s potential while predicting BCLR model results with 97.8% as accuracy, 97.5 % as specificity, and 98.3 % as sensitivityComparative analysis helps the researchers and also the physicians choose the best classifying modelOnly 173 patients were implemented which might be a bigger challenge for the medical field containing thousands of images.
Debelee et al. 2020To extract features, and detect unprocessed raw images collected using breast tomosynthesis, mammography, (MRIs), and ultrasound imaging modalities.Reviewed articles from 2004 to 2018 to evaluate and compare the BC imaging modalitiesBuilding large datasets with medical images and making them accessible to researchers in the first place will allow for the availability of various pre-trained models.Analysis helps the researcher to design an algorithm that would work on a real-time larger image dataset.Emphasizing real-time larger medical image datasets might be the cause for doing less research.
Ak. 2020To predict the breast tumour types, collected from Dr. William H. Walberg at Wisconsin HospitalML models such as LR [15–18], k-NN [19, 20], SVM [18, 21], NB [22], DT [17], and RF [21, 23], for comparative analysis and visualizationProvide significant benefits and impact cancer detection in the decision-making process, especially the LR model of has 98.1% accuracyThe decision-making process for cancer detection is impacted by various ML and data mining techniques.Results may vary on larger real-time imbalanced BC datasets.
Khandezamin, et al. 2020To analyze the proposed model Group Method Data Handling (GMDH) with three independent BC datasetFirstly implementing the LR model to extract features and secondly, GMDH to diagnose BCGMDH found as the best predictive model with a precision of 99.4% for WBCD, 99.6% for WDBC, and 96.9% for the WPBC dataset.The proposed model provides an improved model for BC prognosis and therapy.Implementing this proposed model might cause less accurate results but can improve with Dl models.

3.1Data collection

However, keeping major facts is accomplished by data collection. Thus, accurate data collection procedures are much more essential for building a high-performing model. So, in this work, we have tried to remove the null values from the Wisconsin Breast Cancer Dataset (WBCD), collected from the Kaggle repository. This dataset includes 569 individual patient records, which are classified into two different classes i.e. benign (i.e. 357 records) and malignant (i.e. 212 records) as displayed in Fig. 4.

Fig. 4

WBCD classification into benign and malignant.

WBCD classification into benign and malignant.

3.2Methods

3.2.1Data analysis

3.2.1.1. Outlier detection Working with data requires identifying and controlling outliers, which is a crucial factor in the creation and application of ML algorithms. Outliers have a significant impact on model accuracy and can affect patterns. Although, to maintain a model’s effectiveness an outlier detection is much needed after model deployment, as shown in Fig. 5.

Fig. 5

Outlier detection using box-plot in WBCD.

Outlier detection using box-plot in WBCD.

3.2.1.2. Selected joint and marginal feature distributions The probability distribution for two random variables (i.e. X and y) is represented as a joint probability distribution (JPD) and is calculated as follows in Equation 1.

(1)
fXy(x,y)=P(X=x,y=y)=P((X=x)and(y=y))

Also, we can define it jointly as follows in Equations 2, 3, and 4.

(2)
BCXy={(X,y)|fXy(x,y)>0},
(3)
BCX={X1,X2,.}andBCy={y1,y2,.},
(4)
BCXyBCX*BCy,(foreach{(Xj,yj)BCX,y}),

So, JPD for these two random variables is specified as in Equation 5.

(5)
(xj,yj)fXy(x,y)=1

Figure 6, shows the distribution of each of these distinct variables is referred to as a marginal distribution (MD) and calculated as follows in Equations 6 and 7.

Fig. 6

MD on target variable (here is diagnosis: ‘M’ and ‘B’).

MD on target variable (here is diagnosis: ‘M’ and ‘B’).

(6)
g(X)=yf(X,y),forXrandomvariables
(7)
h(y)=Xf(X,y),foryrandomvariables

3.2.1.3. Correlation matrix Apart from that, when there have been a large number of attributes in a dataset, the significance of data correlation becomes significant. It makes a potential relationship between two sets of data. However, some features are more crucial to accurately predicting the disease. Therefore, when there are more than two features, the data visualization technique can aid the researcher in identifying irrelevant features. Thus, the correlation matrix between all attributes in WBCD is as follows in Fig. 7.

Fig. 7

Correlation matrix between all attributes in WBCD.

Correlation matrix between all attributes in WBCD.

As we discussed, there may be chances of uncertainty in raw BC data, which can affect to recognize the useful insights like difficulty in identifying patterns as well. Therefore, it is much required to implement fuzzy logic to find all possibilities of BC in women [27].

3.2.1.4. Fuzzy inference system (FIS) Uncertainty is one of the biggest barriers to real-world problems, which results in imprecise knowledge about the training dataset for pattern recognition tasks. Nevertheless, it is essential to make sufficient provisions to deal with uncertainty. For the removal of fuzziness caused to redundant attributes of patterns within the dataset, a FIS is used [28]. In the MLF model, fuzzy values are used to feed the ML models instead of the usual crisp input value. The fuzzification procedure creates a membership matrix, and the resulting fuzzified matrix shows the total number of elements present [29, 30]. The number of features and classes present in the WBCD, which serves as the input to the ML models, is multiplied by the number of features in this matrix.

Fuzzification, membership function, and Defuzzification are the three major components of the FIS, as shown in Fig. 8. The correlations between input and output are expressed using the fuzzy IF and THEN rule, allowing for the simulation of the meaningful input and, in turn, producing an output. Fuzzification offers a way whereby each observation is given a level of affiliation with each of the fuzzy sets [31, 32]. This enables us to create a language summary of a set of numerical data and generate a sense of the underlying patterns [33–35]. This will be revealed by transforming all 30 observations and one target variable into textual information to define their classes as either benign, normal, or malignant as displayed in Fig. 9. Thus, it is much required before going for data preprocessing to build a good model.

Fig. 8

Flowchart design for FIS.

Flowchart design for FIS.
Fig. 9

Representation of Fuzzy variable set.

Representation of Fuzzy variable set.

3.2.2Data preprocessing and feature engineering

3.2.2.1. Data preprocessing Preprocessing of data is an important stage in the ML process because the quality of the data and the information that can be extracted from it directly influence how well our model can learn. For this reason, we must preprocess the data before introducing it to the model. Label encoding and dataset splitting are two major concerns in this section. Label encoding transforms the categorical values into their respective numeric values, so that model can easily understand. For this reason, we have used LabelEncoder () class that converts the ‘diagnosis’ field values from ‘M’ to ‘1’ and ‘B’ to ‘0’. And, we have found a total of 357 and 212 records belonging to classes ‘B’ and ‘M’ respectively. Another aspect of this research is feature engineering, which works on relevant features that are directly responsible for this BC disease.

3.2.2.2. Feature engineering The act of choosing, modifying, and converting unprocessed data into attributes that can be included in ML models is known as feature engineering. It may be important to create and train better features to make ML models perform effectively on new tasks. Thus, in this section we will discuss two major areas a) using RobustScaler () for removing outliers, and b) implementing PCA and RFE for selecting the best optimal features.

Scaler

Scaling numerical input variables to a common range is crucial since many ML algorithms are sensitive to features with varying sizes. Among scaling methods, Normalizer and StandardScaler are the most frequently used. In this instance, we find in boxplots that some features have a few very severe outliers that are difficult to replace or eliminate. Extreme outliers frequently harm the sample mean and variance. The two scalers mentioned above hence might not perform well in this situation. As a result, we utilize RobustScaler as an alternative, which, by deleting the median and scaling the data following the quantile range, is more resistant to outliers.

Dimensionality reduction

From Figs. 5 and 6, we can see that certain features, such as radius mean, perimeter mean, and the area mean, are significantly associated with the joint marginal distributions and correlation matrix plots. These characteristics virtually all affect the dependent variable in the same way. The “Curse of Dimensionality” represents more data, complex computation, and the risk of overfitting effects on classification algorithms in real-world problems where there are too many features in the dataset [36]. The fundamental feature selection methods mostly focus on the distinct features’ qualities and how they relate to one another. A more practical method, however, would base feature selection on how each feature impacts the performance of a specific model. These issues can be successfully avoided using both feature extraction and feature selection. Therefore, in this instance, we’ll test two distinct approaches: principal component analysis (PCA) and recursive features elimination (RFE), and compare the outcomes by applying each to classification algorithms individually. These can be plotted graphically by using the two most common methods such as (i) receiver operating characteristic (ROC), and (ii) precision-recall (PR) curves to visualize their performance. However, a ROC curve builds a relationship between the correctly positive and incorrectly positive instances for a predictive model for various probability thresholds. Besides that, a PR curve represents both the precision and recall values on the X and Y axis respectively using various probability thresholds.

PCA

PCA converts a piece of correlated variables (for example ‘c’) into a smaller collection of uncorrelated variables (i.e. k) s.t k < c as principal components, while preserving as much variation in the original dataset as possible. This is due to, how sensitive PCA approaches are to the amount of data present for analysis. However, choosing the optimal number of components for the given dataset is the most crucial method in PCA. Thus, by taking linear combinations of the original predictors, a new and smaller set of predictors can be generated, which can help to reduce the feature space’s dimensions. The original model is defined as in Equation 8.

(8)
Y=α+γ1X1+γ2X2++γcXc+ɛ

Where, γistheslope, Y - interceptα. Now, choose L1, L2  . . .  Lk, and k < c, for ‘c’ instances from (Xi, Yi), i = 1 . . .  c. However, for each constant Piq, the principal component (Li) can be determined as in Equation 9.

(9)
Li=q=1cpiqXi

After successfully evaluating PCA, we have found that after the seventh component, there is an elbow, where the first seven components account for 91% of the overall variance as shown in Fig. 10. Additionally, we can retain roughly 95% or even more than 99% of the overall variance if we keep the first 10 or 17 primary components.

Fig. 10

PCA plot design between explained variance and number of components using ROC curves.

PCA plot design between explained variance and number of components using ROC curves.

RFE (Recursive features elimination)

However, it is required to eliminate features with weights that are almost zero, to reduce model complexity. However, we must keep in mind that even the removal of one feature causes the coefficients of other features to alter. Therefore, by ranking the fitted model coefficients, we can remove them one at a time, starting with the feature with the lowest weight. It would be laborious to do this manually for 30 features, but thankfully Sklearn offers RFE. It employs a separate model that has been appropriately trained and removes the weakest characteristics one at a time. To evaluate how many significant features to include based on how well they function. Recursive Feature Elimination ross-validation (RFECV), a class offered by Sklearn, automatically determines the ideal number of features to keep. Although, it minimizes the complexity of a model by selecting important features and eliminating the less relevant ones. This evaluation process gradually excludes each of these less important features until it reaches the optimal number required to ensure high performance. And, finally, the experimental result shows that the optimal number of features is ‘4’, from where it gradually increases as displayed in Fig. 11.

Fig. 11

RFE plot design between cross-validation scores and number of features using ROC curves.

RFE plot design between cross-validation scores and number of features using ROC curves.

3.2.3Grid Search Cross-validation (GSCV)

Accordingly, the raw data will be splitted into two categories (i.e. train set and test set), before going for training into a model. Cross-validation (CV) subsequently divides the train data into the train data and the validation data. The model’s performance will be recorded at each iteration, and at the conclusion, the average of all the performances will be provided. As a result, performing a lengthy process. Thus, to analyze the optimum hyperparameters, GridSearch, and CV require a significantly aggregated amount of time. GSCV from Sklearn will let us input our preferred estimator, a grid of parameters, and the number of cross-validation folds. To improve model performance, it applies the Grid Search technique to identify the best hyperparameters. For that, GSCV uses GridSearchCv (), containing information about the estimator, param_grid, scoring, CV, and n_jobs as –1 (representing all available computing power for execution). In this work, we have implemented the GSCV technique to find the best parameters for both classifiers such as LR and RF, while using PCA and RFE.

Find the best hyperparameters

However, choosing a good model implies a successful evaluation of the data in the medical field. Thus, to measure the model performance, a performance metric has been drawn between the number of components in PCA and model accuracy as in Fig. 12.

Fig. 12

Depicting the ROC curve between the numbers of components in PCA versus model accuracy/training time to find the best hyperparameters.

Depicting the ROC curve between the numbers of components in PCA versus model accuracy/training time to find the best hyperparameters.

Besides that, to map the input variables to the target variables, it is necessary to choose the optimal parameters. Therefore, identifying the optimum hyperparameters would enable us to create the model that performs the best and regulates the learning process during the training phase.

3.2.4Model measures

3.2.4.1. Performance measure The performance for all models can be measured using a confusion matrix (CM) which classifies the predictions based on how closely they correspond to a true value. However, it summarizes the performance of the classification algorithm in their class either ‘0’ or ‘1’, in the form of Table 2. Four major parameters comprise the CM, which is used to provide the classifier’s performance as follows:

Table 2

Representation of CM

Predicted Label
Actual LabelB (0)M (1)
  B (0)TNFP
  M (1)FNTP

  • TP (True Positive), the number of patients whose ‘M’ nodes have been correctly identified as having BC is represented by this value.

  • TN (True Negative), is the proportion of correctly identified healthy patients.

  • FP (False Positives) are patients who were incorrectly diagnosed as having an illness when they were healthy.

  • False Negatives, or FNs, are patients who were mistakenly classified as healthy while they had BC.

3.2.4.2. Performance metrics Accuracy, precision, recall, and F1 score are the algorithm performance metrics as given in Equations 10, 11, 12, 13, and 14. These have been derived from the previously discussed TP, TN, FP, and FN.

  • Accuracy (A), the proportion of patients who were correctly classified as all patients serve as a measure of an algorithm’s accuracy as follows:

    (10)
    A=TP+TNTP+FP+TN+FN

  • Precision (P), is the percentage of positive values out of all projected positive cases or the positive predictive value.

    (11)
    P=TPTP+FP

  • Sensitivity (Sen), is the percentage of correctly identifying instances, also known as recall, or the TP rate (TPR).

    (12)
    Sen=TPTP+FN

  • Specificity (Spe), is the percentage of correctly identifying negative instances. The percentage of true negative cases that are correctly detected. FP rate is provided by (1 –specificity).

    (13)
    Spe=TNTP+FP

  • F1-score (F1 s), often called the F-score is calculated by the mean of P and Sen.

    (14)
    F1s=2*P*SenP+Sen

Although, for each test point, a classifier typically computes a prediction score, or “p,” which can usually also be regarded as a probability. Second, one selects a decision threshold, “T,” and predicts that all occurrences where p > T will result in 1 and all others will result in 0. T = 0.5 is an indication of such a threshold. There are numerous cases, nevertheless, where implementing T = 0.5 is not strictly necessary, and various T’s can produce better outcomes. Thus, ROC and PR curves have been designed in this article that compares all possible thresholds for a classifier.

3.2.5Model comparison

3.2.5.1. Logistic regression (LR) A classification model, to determine the relationship between dependent variables with one or more independent variables. The dependent variable, in this case, is a binary variable with data coded as 1 or 0. The basic idea behind this model is to predict the probability of class labels (Yi ∈ [0, 1]). It is calculated using an S-shaped function, namely the sigmoid function (Fig. 13) i.e. σ(k)=11+e-θTx=11+e-k , where - θTx as a linear function.

Fig. 13

Representation of Sigmoid function starting from ‘0’ to ‘1’.

Representation of Sigmoid function starting from ‘0’ to ‘1’.

Thus, the LR model finds the probability of malignant class (i.e. ‘1’) as P(Y=1|X)=11+e-θTx , and for benign class (i.e. ‘0’) as P (Y = 0|X) = 1- P (Y = 1|X). However, this model provides the result based on statistics that lie in between ‘0’ and ‘1’, instead of giving exact ‘0’ or ‘1’. For every instance of Xi (where i = 1, 2 . . .  30), the predicted output will be 1 if P (Xi)>0.5 and 0 if otherwise (Equation (15)).

(15)
P(Y)={0ifσ(k)<00.5ifσ(k)=01ifσ(k)>1

3.2.5.2. LR with PCA To accelerate training without significantly reducing the LR model’s capacity for prediction as compared to a model with all ‘p’ predictors [37]. It is very crucial to visualize how predictive the traits can be, particularly in the classification of tumours. Additionally, to increase the computational efficiency of fitting this model and decrease the dimensionality and multicollinearity. From Table 3, it has been found the best parameter and training scores for the LR model are ’C’: 10, ‘penalty’: ‘l2’, and 0.978 respectively. Fig. 14, displays a relation between the numbers of components in PCA versus model accuracy/training time using the ROC curve.

Table 3

Representation of LR-PCA model performance scores

No. of componentsBest parameter valueBest training ScorePCA_test_scorePCA_cv_ training_time
21000.9210.9730.009
410.9540.9820.010
6100.9650.9820.011
80.10.9670.9910.011
1010.9740.9910.013
1210.9690.9820.015
1410.9740.9820.017
1610.9780.9820.023
18100.9780.9820.024
20100.9760.9820.027
22100.9760.9820.027
24100.9780.9820.026
26100.9780.9820.031
28100.9780.9730.024
30100.9780.9730.025
Fig. 14

Depicting the ROC curve for the LR model between the number of components in PCA versus model accuracy/training time.

Depicting the ROC curve for the LR model between the number of components in PCA versus model accuracy/training time.

Data variability is captured by the first component alone to the extent of around 45 percent, by the second to the extent of about 20 percent, and so on. Together, the first 8 and 14 components account for roughly 93.15% and 98.44% of the data variability, respectively. Therefore, we would like to continue with 8 components in this project.

LR with PCA (8 components)

We now have the 8-component transformed dataset. To accomplish this, we must run PCA once again with n_components set to 8. Now, to plot ROC and PR curves to measure LR_PCA model performance between no_of_components in PCA (i.e. 8) and model accuracy score. At last, we found that our proposed model gives an accuracy of 99.1% (Fig. 15), which is better than the individual LR model.

Fig. 15

Depicting ROC curve for LR-PCA, representing eight components in PCA versus model accuracy/training time.

Depicting ROC curve for LR-PCA, representing eight components in PCA versus model accuracy/training time.

However, the transformed dataset comprises 8 components as compared to the original dataset’s 30 features. Only roughly 93.15% of the variability in the original dataset is preserved in the modified dataset. The two datasets’ corresponding values are entirely dissimilar. The original BC dataset contains certain variables that have a strong correlation with one or more of the other variables. There is no correlation between any variable in the converted dataset and any other variable. Also, Table 4 and Fig. 16 have been depicted here which describe the model performance for each threshold value (T) starting from 0.1 to 0.9.

Table 4

Adjusting thresholds for metrics

Threshold (T)AccuracySensitivitySpecificityPrecision
T=0.10.9121.0000.8590.811
T=0.20.9561.0000.9300.896
T=0.30.9560.9770.9440.913
T=0.40.9820.9770.9860.977
T=0.50.9910.9771.0001.000
T=0.60.9740.9301.0001.000
T=0.70.9560.8841.0001.000
T=0.80.9470.8601.0001.000
T=0.90.9300.8141.0001.000
Fig. 16

Representing confusion-matrix for LR-PCA model at different threshold values (where, T = 0.1 to 0.9).

Representing confusion-matrix for LR-PCA model at different threshold values (where, T = 0.1 to 0.9).

3.2.5.3. Random forest (RF) Another common ML algorithm, namely RF creates decision trees from data samples, then gets predictions from each one before choosing the best one. It’s an ensemble method that avoids overfitting by averaging the results rather than having a single decision tree. It predicts by averaging or averaging the output of various trees. It generates a forest at random, mixing numerous decision trees. Each tree attempts to estimate a ranking, which is referred to as a “vote,” resulting in a more accurate and consistent prediction.

Algorithm

  • 1. Select ‘k’ features randomly from ‘X’ features.

  • 2. Calculate the node ‘n’ from ‘k’ features using a best-fit algorithm.

  • 3. Split the node again

  • 4. Repeat the steps from 1 to 3 until reach one node.

  • 5. Continue steps from 1 to 4 for ‘N’ times to build a forest.

  • 6. Find the total votes for each class.

  • 7. Choose the class containing majority votes.

  • 8. The outcome is the predicted class

After successfully evaluating the RF model on the BC dataset, we have found the best parameters and training score as ’bootstrap’: True, ‘criterion’: ‘entropy’, ‘n_estimators’: 100 and 0.967 respectively.

3.2.5.4. Random Forest with PCA When features are periodic changes of other features, RF does not perform well. However, the same thing occurs when there are more features than samples, in which case our model will probably overfit the dataset. Nevertheless, dimensionality reduction is performed via PCA, which can lower the number of features that the RF must process. PCA aids in training efficiency and improves the RF model’s score as shown in Fig. 17.

Fig. 17

RF-PCA model performance score between no_of_components in PCA and model accuracy/Training Time.

RF-PCA model performance score between no_of_components in PCA and model accuracy/Training Time.

3.2.5.5. Random forest with RFE RFE is a feature selection method that chooses the most important features in a training dataset that are more crucial for predicting the target variable. The reason RFE is so well-liked is that it is simple to set up, straightforward to use, and efficient in identifying the features. Therefore, while implementing RFE, there are 2 major configuration options: the number of features to choose, from and using the RF technique to aid in feature selection. We compared the important scores obtained after running RF once and after running it again, using the initial RF as the first of the recursive runs in the RF-RFE strategy, to see whether RF-RFE outperformed RF alone.

3.2.6Model performance plot

And finally, a comparative analysis between all five models has been designed in our research, describing their performance scores concerning the accuracy, sensitivity, specificity, precision, and f1-score as shown in Fig. 18. This diagram has been drawn graphically using a bar plot, describing LR-PCA model results to a greater extent in all respect than others.

Fig. 18

Comparison of all model’s performance scores in terms of their accuracy, sensitivity, specificity, precision, and f1-score.

Comparison of all model’s performance scores in terms of their accuracy, sensitivity, specificity, precision, and f1-score.

4Results

In comparison to LR, RF, RF-PCA, and RF-RFE, the test accuracy utilizing LR-PCA has increased by 2%, 3%, and 4%, respectively. False results, both good and negative, have decreased as well. However, only roughly 93.15 percent and 98.44 percent of the variability in the original dataset is captured in the transformed dataset (the dataset produced by applying PCA for LR and RF). So, we discovered that the multicollinearity has been successfully reduced via LR-PCA. Furthermore, Table 6 demonstrates that our proposed model LR-PCA has outperformed all other four models.

Table 6

Comparison of all model performance

ModelAccuracySensitivitySpecificityPrecisionF1
LR0.9740.9770.9720.9550.966
LR-PCA(8 components)0.9910.9771.0001.0000.988
RF0.9650.9300.9860.9760.952
RF-PCA(8 components)0.9560.9530.9580.9320.943
RF-RFE(8 components)0.9560.9300.9720.9520.942

5Discussion

As we have discussed, Breast Cancer accounts for the second-largest number of deaths globally and mostly in African women. Thus, proper analysis and clinical decisions at the early stages can increase the survival rate of patients having a higher risk.

Figure 3, specifies the clear view of the proposed methodology that has been taken during this research work. However, WBCD has been collected from the Kaggle repository for our experiment, which contains 569 individual patient records (as shown in Fig. 4) and 31 unique features that help to implement ML models with PCA and RFE techniques.

Figures 5, 6, and 7 have resulted during the BC data analysis, where Fig. 5, detects the outliers using a box plot; Fig. 6, shows the JPD and MD for two random variables while classifying both M and F; and Fig. 7, represents the correlation matrix between all attributes to find the relationship between them.

Figure 8, describes the process for FIS, where the Fuzzifier block first turns crisp inputs into fuzzy sets, and the inference block then maps the input fuzzy sets into fuzzy output sets [38]. Furthermore, the type-reduction has been done which is reduced to type-1 fuzzy sets, and secondly, these fuzzy values are converted into crisp values at the Defuzzification stage.

Figures 10 and 11 give the results for two-dimensional reduction techniques such as PCA and RFE to determine the ideal number of features to keep for implementation with ML models (i.e. for LR and RF). PCA takes the linear combinations of the original predictors, where the x-axis represents the number of components and the y-axis for cumulative explained variances ranges from ‘0’ to ‘1’. Similarly, in Fig. 11, RFE represents the features having the lowest weight to remove, where the x-axis represents the number of components and the y-axis for cross-validation scores ranges from ‘0’ to ‘1’.

Figure 12, is used to find the best optimal hyperparameters for implementing with LR and RF models. Moreover, Table 2, shows the confusion–matrix, representing TP, FP, TN, and FN for both ‘B’ and ‘M’.

Table 3, displays the result of the LR-PCA model for 30 components with a difference of 2 and found the best training score is 0.978. Also, the ROC curve has been designed to represent the model accuracy concerning many components, as shown in Fig. 14. It has been found that the first 8 and 14 components account for roughly 93.15% and 98.44% of the data variability. Thus, we have taken 8 components to predict with LR and RF models.

Figure 16, describes the LR-PCA (8 components) model accuracy ranging from ‘0’ to ‘1’ in the form of ROC and PR curves and found with the highest accuracy of 99.1 %. Although, Table 4, shows the performance metrics for each threshold (T) value starting from ‘0.1’ to ‘0.9’ and found the greater accuracy value at T = 0.5. These ‘T’ values are plotted using a confusion matrix, containing actual and predicted results, as given in Fig. 15.

Similarly, Table 5, shows the result of the best training score of 96.7 % and PCA_test_score with 95.6% for the RF-PCA model. These results are plotted graphically using ROC curves as described in Fig. 17, where 17 (a) shows the best test score corresponding to 97 %, and 17 (b) shows the result for the model with the best training time closer to 0.8.

Table 5

Representation of RF-PCA model performance scores

No. of componentsBest Parameter valueBest Training scorePCA_test_ scorePCA_cv_training_time
BootstrapCriterionn_estimators
2Trueentropy1500.9190.9560.352
4TrueGini500.9410.9560.36
6FalseGini2000.9670.9560.422
8FalseGini2000.9650.9470.442
10Trueentropy2000.9630.9560.562
12Falseentropy1000.9670.9470.661
14Trueentropy200.9540.9640.581
16Falseentropy1000.9560.9560.553
18Falseentropy200.960.9380.548
20Falseentropy2000.9490.9380.550
22Falseentropy1000.9450.9380.528
24FalseGini200.9450.9290.557
26FalseGini200.9490.9290.612
28Trueentropy200.9430.9210.603
30Trueentropy500.9380.9380.639

However, it is necessary to perform performance metrics representing all models’ performance, to determine how well a model classifies the class data into either malignant or benign. Thus, we have designed a bar plot using Python programming language on the Jupyter Notebook platform to visualize all model performances, as shown in Fig. 18. And finally, we have found that LR-PCA has outperformed with 99.1 % accuracy, 97.7 % sensitivity, 100% specificity, and precision, and 98.8 % f1-score in corresponding to all other models, especially than single LR model.

6Conclusion

In this research work, our proposed hybrid machine learning-fuzzy and dimension reduction (MLF-DR) technique has been found most effective and successful model while implementing WBCD. However, MLF improves the decision-making capabilities of ML models and finds all possibilities between yes or no which means all possible values of abnormality in cells. From this experiment, we have found the application of the dimensionality reduction (DR) technique improves the performance of a model significantly as compared to an individual model. For this reason, two feature selection techniques (i.e. PCA and RFE) on LR and RF models are being proposed here. Besides that, at first, individual LR and RF models are implemented with all 30 features and provide accurate results of 97.4 % and 96.5% respectively. Meanwhile, the application of both PCA and RFE methods minimizes the linear correlation between data in WBCD to generate a matrix with low dimensionality. And finally, we have found the best optimal number of features for PCA as ‘8’ to make an input to LR and RF models to classify the class.

The following shows the key findings:

  • a) FIS uses fuzzy set values to map crisp inputs to fuzzy outputs during the fuzzification stage, a membership function to choose appropriate features by applying If-Then rules, and a defuzzification unit to convert fuzzy values into their respective crisp values, which further input to the data-preprocessing stage.

  • b) PCA enhances visualization, reduces the number of dimensions in the training dataset, and accelerates the performance of machine learning algorithms. Although eliminating the correlation between features, aids in resolving the overfitting problem.

  • c) RFE is popular because, as was previously said, it is simple to set up and use and because it is efficient at choosing the attributes in a training dataset that are particularly more suitable for predicting the model performance.

  • d) The LR and RF reduce the overfitting issues and the variance thus improving the model accuracy. These are capable of interpreting model coefficients as measures of the significance of a feature and involve zero hypotheses on the prevalence of classes in the feature space.

  • e) As compared to other models, LR-PCA effectively classifies the breast abnormal cells with either ‘M’ or ‘B’. This study’s results highlight the LR-PCA model’s strong advantage. Additionally, this strategy has good adaptation in the healthcare areas for binary classification problems regardless of features.

In this study, we conclude that our proposed MLF-DR methodology can reliably and successfully classify diagnosis field class values into benign or malignant with very less time and also with good accuracy scores. Soon, we will plan to collect the real-time BC dataset and implement a novel MLF-DR technique to find the best solution.

7Declarations

Competing interests

The authors declare that we have no conflict of interest relevant to this article.

Funding

The corresponding author states that there is no funding provided during this research work.

Availability of data and materials

The datasets presented in this study can be found online at Kaggle repositories and the link is https://www.kaggle.com/datasets/anacoder1/wisc-bc-data/download?datasetVersionNumber=1.

References

[1] 

Aaltonen L.A. , Salovaara R. , Kristo P. , Canzian F. , Hemminki A. , Peltomäki P. and Valkamo E. , Incidence of hereditary nonpolyposis colorectal cancer and the feasibility of molecular screening for the disease, New England Journal of Medicine 338: (21) ((1998) ), 1481–1487.

[2] 

Sung H. , Ferlay J. , Siegel R.L. , Laversanne M. , Soerjomataram I. , Jemal A. and Bray F. , Global cancer statistics: GLOBOCAN estimates incidence and mortality worldwide for 36 cancers in 185 countries, CA: A Cancer Journal for Clinicians 71: (3) ((2021) ), 209–249.

[3] 

Soerjomataram I. and Bray F. , Planning for tomorrow: Global cancer incidence and the role of prevention 2020–2070, Nature Reviews Clinical Oncology 18: (10) ((2021) ), 663–672.

[4] 

Montazeri M. , Montazeri M. , Montazeri M. and Beigzadeh A. , Machine learning models in breast cancer survival prediction, Technology and Health Care 24: (1) ((2016) ), 31–42.

[5] 

Prusty S. , Patnaik S. and Dash S.K. , SKCV: Stratified K-fold cross-validation on ML classifiers for predicting cervical cancer, Frontiers in Nanotechnology 4: ((2022) ), 972421.

[6] 

Cardoso F. , Spence D. , Mertz S. , Corneliussen-James D. , Sabelko K. , Gralow J. and Mayer M. , Global analysis of advanced/metastatic breast cancer: decade report (2005–2015), The Breast 39: ((2018) ), 131–138.

[7] 

Caswell-Jin J.L. , Plevritis S.K. , Tian L. , Cadham C.J. , Xu C. , Stout N.K. ,... and Kurian A.W. , Change in survival in metastatic breast cancer with treatment advances: meta-analysis and systematic review, JNCI Cancer Spectrum 2: (4) ((2018) ), pky062.

[8] 

Pilevarzadeh M. , Amirshahi M. , Afsargharehbagh R. , Rafiemanesh H. , Hashemi S.M. and Balouchi A. , The global prevalence of depression among breast cancer patients: a systematic review and meta-analysis, Breast Cancer Research and Treatment 176: (3) ((2019) ), 519–533.

[9] 

Xie J. , Liu R. , Luttrell IV J. and Zhang C. , Deep learning based analysis of histopathological images of breast cancer, Frontiers in Genetics 10: ((2019) ), 80.

[10] 

El Alaoui O. , Zerouaoui H. and Idri A. , Deep Stacked Ensemble for Breast Cancer Diagnosis, In World Conference on Information Systems and Technologies (pp. 435–445), (2022), Springer, Cham.

[11] 

Bandaru S.B. and Babu G. , A Review on Advanced Methodologies toIdentify the Breast Cancer Classification using the Deep LearningTechniques, International Journal of Computer Science & Network Security 22: (4) ((2022) ), 420–426.

[12] 

Mao N. , Yin P. , Wang Q. , Liu M. , Dong J. , Zhang X. and Hong N. , Added value of radiomics on mammography for breast cancer diagnosis: a feasibility study, Journal of the American College of Radiology 16: (4) ((2019) ), 485–491.

[13] 

Debelee T.G. , Schwenker F. , Ibenthal A. and Yohannes D. , Survey of deep learning in breast cancer image analysis, Evolving Systems 11: (1) ((2020) ), 143–163.

[14] 

Ak M.F. , A comparative analysis of breast cancer detection and diagnosis using data visualization and machine learning applications, In Healthcare (Vol. 8, No. 2, p. 111), (2020, April), MDPI.

[15] 

Gupta P. and Garg S. , Breast cancer prediction using varying parameters of machine learning models, Procedia Computer Science 171: ((2020) ), 593–601.

[16] 

Nourelahi M. , Zamani A. , Talei A. and Tahmasebi S. , A model to predict breast cancer survivability using logistic regression, Middle East Journal of Cancer 10: (2) ((2019) ), 132–138.

[17] 

Momenyan S. , Baghestani A.R. , Momenyan N. , Naseri P. and Akbari M.E. , Survival prediction of patients with breast cancer: comparisons of decision tree and logistic regression analysis, International Journal of Cancer Management 11: (7) ((2018) ).

[18] 

Prusty S. , Patnaik S. and Dash S.K. , Comparative analysis and prediction of coronary heart disease, Indonesian Journal of Electrical Engineering and Computer Science 27: (2) ((2022) ), 944–953.

[19] 

Htay T.T. and Maung S.S. , Early stage breast cancer detection system using glcm feature extraction and k-nearest neighbor (k-NN) on mammography image, In 2018 18th International Symposium on Communications and Information Technologies (ISCIT) (pp. 171–175), (2018, September), IEEE.

[20] 

Cherif W. , Optimization of K-NN algorithm by clustering and reliability coefficients: application to breast-cancer diagnosis, Procedia Computer Science 127: ((2018) ), 293–299.

[21] 

Aroef C. , Rivan Y. and Rustam Z. , Comparing random forest and support vector machines for breast cancer classification, TELKOMNIKA (Telecommunication Computing Electronics and Control) 18: (2) ((2020) ), 815–821.

[22] 

Witteveen A. , Nane G.F. , Vliegen I.M. , Siesling S. and IJzerman M.J. , Comparison of logistic regression and Bayesian networks for risk prediction of breast cancer recurrence, Medical Decision Making 38: (7) ((2018) ), 822–833.

[23] 

Wang S. , Wang Y. , Wang D. , Yin Y. , Wang Y. and Jin Y. , An improved random forest-based rule extraction method for breast cancer diagnosis, Applied Soft Computing 86: ((2020) ), 105941.

[24] 

Khandezamin Z. , Naderan M. and Rashti M.J. , Detection and classification of breast cancer using logistic regression feature selection and GMDH classifier, Journal of Biomedical Informatics 111: ((2020) ), 103591.

[25] 

Prusty S. , Dash S.K. and Patnaik S. , A Novel Transfer Learning Technique for Detecting Breast Cancer Mammograms Using VGG16 Bottleneck Feature, ECS Transactions 107: (1) ((2022) ), 733.

[26] 

Romualdo Cardoso S. , Gillespie A. , Haider S. and Fletcher O. , Functional annotation of breast cancer risk loci: current progress and future directions, British Journal of Cancer 126: (7) ((2022) ), 981–993.

[27] 

Chidambaram S. , Ganesh S.S. , Karthick A. , Jayagopal P. , Balachander B. and Manoharan S. , Diagnosing Breast Cancer Based on the Adaptive Neuro-Fuzzy Inference System, Computational and Mathematical Methods in Medicine 2022: ((2022) ).

[28] 

Mehmood M. , Ayub E. , Ahmad F. , Alruwaili M. , Alrowaili Z.A. , Alanazi S. and Alyas T. , Machine learning enabled early detection of breast cancer by structural analysis of mammograms, Comput Mater Contin 67: ((2021) ), 641–657.

[29] 

Rathnayake N. , Dang T.L. and Hoshino Y. , A novel optimization algorithm: Cascaded adaptive neuro-fuzzy inference system, International Journal of Fuzzy Systems 23: (7) ((2021) ), 1955–1971.

[30] 

Germashev I. , Dubovskaya V. , Losev A. and Popov I. , Fuzzy Inference of the Effectiveness Factors of The Computational Model for the Diagnosis of Breast Cancer, In 2021 3rd International Conference on Control Systems, Mathematical Modeling, Automation and Energy Efficiency (SUMMA) (pp. 528–533), (2021, November), IEEE.

[31] 

Tang L. , Wu D. , Wang H. , Chen M. and Xie J. , An adaptive fuzzy inference approach for color image steganography, Soft Computing 25: (16) ((2021) ), 10987–11004.

[32] 

Ullah A. , Ullah A. , Ahmad S. , Ahmad I. and Akgül A. , On solutions of fuzzy fractional order complex population dynamical model, Numerical Methods for Partial Differential Equations, (2020).

[33] 

Algehyne E.A. , Jibril M.L. , Algehainy N.A. , Alamri O.A. and Alzahrani A.K. , Fuzzy neural network expert system with an improved Gini index random forest-based feature importance measure algorithm for early diagnosis of breast cancer in Saudi Arabia, Big Data and Cognitive Computing 6: (1) ((2022) ), 13.

[34] 

Tabakov M. , Chlopowiec A. , Chlopowiec A. and Dlubak A. , Classification with Fuzzification Optimization Combining Fuzzy Information Systems and Type-2 Fuzzy Inference, Applied Sciences 11: (8) ((2021) ), 3484.

[35] 

Ahmad S. , Ullah A. , Akgül A. and Abdeljawad T. , Numerical analysis of fractional human liver model in fuzzy environment, Journal of Taibah University for Science 15: (1) ((2021) ), 840–851.

[36] 

Bahrami M. and Vali M. , Wise Feature Selection for Breast Cancer Detection from a Clinical Dataset, In 2021 28th National and 6th International Iranian Conference on Biomedical Engineering (ICBME) (pp. 160–164), (2021, November), IEEE.

[37] 

Okagbue H.I. , Adamu P.I. , Oguntunde P.E. , Obasi E. and Odetunmibi O.A. , Machine learning prediction of breast cancer survival using age, sex, length of stay, mode of diagnosis and location of cancer, Health and Technology 11: (4) ((2021) ), 887–893.

[38] 

Hanif A. , Butt A.I.K. , Ahmad S. , Din R.U. and Inc M. , A new fuzzy fractional order model of transmission of Covid-19 with quarantine class, The European Physical Journal Plus 136: (11) ((2021) ), 1–28.