You are viewing a javascript disabled version of the site. Please enable Javascript for this site to function properly.
Go to headerGo to navigationGo to searchGo to contentsGo to footer
In content section. Select this link to jump to navigation

Machine learning approach for corona virus disease extrapolation: A case study


Supervised/unsupervised machine learning processes are a prevalent method in the field of Data Mining and Big Data. Corona Virus disease assessment using COVID-19 health data has recently exposed the potential application area for these methods. This study classifies significant propensities in a variety of monitored unsupervised machine learning of K-Means Cluster procedures and their function and use for disease performance assessment. In this, we proposed structural risk minimization means that a number of issues affect the classification efficiency that including changing training data as the characteristics of the input space, the natural environment, and the structure of the classification and the learning process. The three problems mentioned above improve the broad perspective of the trajectory cluster data prediction experimental coronavirus to control linear classification capability and to issue clues to each individual. K-Means Clustering is an effective way to calculate the built-in of coronavirus data. It is to separate unknown variables in the database for the disease detection process using a hyperplane. This virus can reduce the proposed programming model for K-means, map data with the help of hyperplane using a distance-based nearest neighbor classification by classifying subgroups of patient records into inputs. The linear regression and logistic regression for coronavirus data can provide valuation, and tracing the disease credentials is trial.


In the modern human way of life and existence, people suffer from a wide variety of diseases for which they are accustomed to consult medical procedures. Nowadays, medical professionals rely on an assortment of clinical trials to diagnose and treat diseases. Clinical pathology is an important part of the causal study of disease and major areas in modern medicine and diagnosis. There are different kinds of pathologies. They can be identified as general medical pathology, anatomical pathology, dermatopathology, cytopathology, forensic pathology and neuropathology. Since pathology is an important part of the medical field, it will continue to grow in the near future. Due to the emergence of new diseases of the day, some innovative improvements are needed to diagnose, treat and classify the COVID-19 disease. Coronavirus disease is an infectious disease caused by a recently discovered coronavirus. Most people infected with the virus will experience mild to moderate respiratory illness and recover without requiring special treatment. In this direction, the pathologist would like to see genetic-based laboratory testing and diagnosis. Humanity is suffering from an epidemic problem with the use of this method like statistic-based machine learning. This compendium can be used to store a database of people affected by each hospital infection and to estimate the number of people affected. Therefore, it is useful for preliminary investigation in hospitals. It can inform health care organizations about the affected area. Infectious diseases like coronavirus are the major diseases causing more problems in the society. These infections are affecting the economic and health condition of mortality today.

In this study an extensive research effort be located complete to identify studies that apply to more than one monitored machine-learning process on a disease approximation. Machine Learning Evolve Predicting epidemic data proposes statistical inference that is useful to society. The coronavirus is not an organism, but a protein aota (RNA) encased in a protective layer of phospholipid (fat) in which cells of the optic, nasal, or buccal mucosa terminates their hereditary sign. Turn them into antagonists and multiplication cells. Since the virus is not an organism, a protein molecule is not killed but degraded on its individual. Failure time depends on humidity, temperature and type of material. The coronavirus is very mild. Therefore, the only thing that protects here is the thin surface layer of the portal. A good solution is to use any cleanser or detergent because the foam will cut the portable, so we have to rub it in for no more than 20 seconds or make more foam. By dissolving the fat layer, the protein molecule dissolves itself. The heat melts the fat. That is why it is advisable to use water above 25C to wash hands, clothes, and everything. In addition, warm water makes more foam, and it is more useful. Any alcohol or mixture greater than 65% will dissolve the outer sterol layer of any coronavirus. Any combination with 1-part bleach and five parts water directly dissolves the virus protein and breaks it down from the inside. Oxygenated water helps after soap, chlorine and alcohol, although peroxide dissolves the virus protein, we need to use it thoroughly and it hurts your skin. A virus is not an organism like bacteria. They do not destroy those that are not alive with antho-biotics but quickly break down its structure. Clothing, plates or clothing that has never been used or used in Vibration. Although it adheres to a smooth surface, it is very inert and fragments for 3 hours (fabric and porous), 4 hours (copper, because it is naturally antiseptic and wood, it removes all moisture and does not peel and break), 24 hours (cardboard), 42 hours (metal) and 72 hours (plastic). However, whether you move it or use a feather duster, the virus molecules will float in the air for up to 3 hours and stay in your nose. Virus molecules, such as air conditioners in homes and cars, are very stable in the cold or artificially outside. They need moisture to stay consistent and unusually dark. Therefore, dehumidified, dry, warm and bright weather deteriorates rapidly. If anyone has UV light, it breaks down the virus protein. For the sample case, it is appropriate to disinfect and reuse the mask. Be careful. It also breaks down a protein called collagen in the skin, which can eventually lead to wrinkles and skin cancer. In the real world, lumbering elephants are exposed by the aggression of speeding midgets. Such data may be offer information and evidence to make a decision. For instance, instruct the patients on medical condition in the COVID-19 database.

COVID-19 patient’s data can be analyzed, extracted, interpreted and tailored to prosecutions. Data mining is the enormous data of the data analysis and discovery process is the data stored in various databases such as the data granary is the extraordinary pattern that can be understood, the undetected, the valid, and the useful data. Data mining is a kind of classification and clustering methods are used to extract invisible samples from virtually large databases. The benefits of data mining include faster retrieval of data or information, retrieval of knowledge from several databases, detection of hidden patterns and undetected patterns, reduction of complexity level, saving time and so on. Data mining proficiency collects relevant information from Revenue Structured Patient Data. Then, it helps to achieve specific benefits. The purpose of a data mining effort is usually to create a detailed or live presence structure. Provides a detailed sample of the main features of the data set in abbreviated form. The uniqueness of the attendance model is that it allows estimating the unknown value of a specific variable for the data minor target variable. Our advantage of predictive and descriptive can be achieved using a variety of data mining methods. Data mining and machine learning come with a better public event tool than all fashions. We can apply this to numeric values for high dimensional inputs, characters, images, etc. Clustering-based non-supervised repetition in which includes code samples for some classic clustering techniques, such as K-Means. Unsupervised learning can extract important critical features from input data without additional data or guidance. These features can often be provided as input to more expansions; supervised practice is used and allows more effective learning for intricate tasks and compared to training on raw data. Usual instances are the extraction of attributes from X-ray images. These are important and critical capabilities for machine learning scaling, as these features are not far off, collected by experts in the human domain of craft and image processing, who have rare, slow and very expensive resources and often limited access to machine learning outcomes. One of the most upfront methods for feature extraction is to use auto encode. Autoencoder is a simple encoder-decoder building where encoder encodes input into compressed representation. The decoder then attempts to reconstruct the original input from the compressed encode representation. In this chapter, supervised /unsupervised machine learning procedures are a prevalent method in the field of data mining. Disease assessment using COVID-19 health data has recently shown the potential application area for these methods. This study classifies important tendencies in a variety of monitored machine learning procedures and their function and use for disease performance assessment.

2.Demographics of COVID-19 using Naive Bayes

In the COVID-19 trial, various teams of computational investigators applied Naïve Bayes classification method to a single cluster set of patient data gathered from hundreds of patients with severe Corona Virus. COVID-19 in which Assessment and Methods, is a platform for crowdsourced studies that focus on emerging computational tackles to solve biomedical problems. A rivalry serves as a large and long-standing. Severe Corona Virus presented a worthy challenge since there is no solitary genetic cause of the disease in which makes it hard to select treatments for patients suffering from the deadly breathing, continuous pain or pressure in your chest, bluish lips or face, and sudden confusion problem of the body. For COVID-19 in each is presented with training data from patients that included demographic information like age and gender and more complex data that describes signaling protein pathways believed to play a role in the disease. Assumed a demographic patients information record X to classify, the general approach is to output that class Ci whose probability of occurrence P(Ci|X) is maximum. To approximation the value of P(Ci|X), this classifier naively assumes that the attributes of X are independent of each other, therefore it is known as Naïve Bayes. Once independence has been assumed then the derivation is used to compute P(Ci|X) as follows:


Now, the record X contains attributes Aj with values xj. The denominator P(X) is ignored because it is common for all the classes. The last line of the derivation is obtained by assuming independence between the attributes. For classification is the values of P(Aj=xj|Ci) are pre-calculated and stored for all possible attribute values and classes. At the time of classification, these probability values are used to approximation P(Ci|X) as per the above derivation and the class with the all-out probability of incidence is amount produced in each group or cluster.

3.Analysis of trajectory of COVID-19 data clustering network

Figure 1.

COVID-19 patients trajectory clustering network.

COVID-19 patients trajectory clustering network.

Clustering is an effective way to calculate the built-in COVID-19 cluster data and undetectable confusion. The development of GPS devices is characterized by maximum-maximum number of features that are recorded as sources on airwaves. To concentrate on affecting disease is related for this method.

With this expertise, we initially identified as primary of moving particles. The resemblances and similarities of these cells are determined by the trajectories. Finally, the result given by this trajectory is checked to see if it is correct. A patient with coronavirus infection is recognized by these means and a cluster is formed for easy identification. By observing the network diagram above, we can easily understand the concept of trajectory clustering network includes various districts of Andhra Pradesh such as West Godavari, East Godavari and Krishna grouped one by one at this time.

4.K-Means approach for COVID-19 unsupervised learning

K-Means clustering is a modest and simple clustering approach to performance. How many clusters (or k) policies does this process have? The enormous amounts of data are in the dataset. This method reproduces k-centers and selects data points adjacent to that centroid in the cluster. Unsupervised machine learning is the allocation of inference to machine learning to create hidden alignment from unlabeled data in which is not are included in organization or classification observations. Using an unsupervised learning strategy typically involves statistics examination, outlier discovery, and linear discovery.

  • 1. Make the cluster centroids.


  • 2. Re-repetition until the impending together:

    • (a) For every i, set:


    • (b) For each j, set:

  • 3. μj:=i=1m1{c(i)=j}x(i)i=1m1{c(i)=j}

5.Structural risk minimization of COVID-19

Let us select the cluster family of classifiers {F(x,w)} and define a structure consisting of nested subsets of elements of the cluster family is S1S2S3Sn. By defining a structure to ensure that, the capacity hn of the subset of classifiers Sn is less than hn+1 of subset Sn+1. The method Structural Risk Minimization amounts to the finding subset Sopt for which classifier F(x,w*) in which minimizes the empirical risk within such subset yields the finest overall simplification presentation. There are two problems can arise in applying Structural Risk Minimization is following as follows first one is How to select Sopt? At second is How to find a good structure? The problem first one can arises because we have no direct access to Egen. In our demographic COVID-19 trials, we will use the minimum of either Etest or Egurante to select Sopt, and show that these two minima are very close. The designer must find the best compromise between two competing terms like Etrain and ϵ. Reducing h causes ϵ to decrease, but Etrain to increase. A good structure should be such that decreasing the dimension happens at the expenditure of the smallest possible upsurge in training error. Now inspect numerous ways in which such a structure can be built.


The unbiased of the hyperplane, Fredrick Jury, 2002, is to isolate unknown variables in the COVID-19 database for the disease detection process. For each cluster of the input patient data taken from the input devices. This input of COVID-19 patient apply for classification performance in which measure the feature of each patient in which is stored in the local database. Database D already contains the record where the feature was collected. At this time, when a user sends a request as input using the hyperplane computes the distance between request input and database of Corona virus disease based on distance. The record of the input patient with the minimum distance is now excited to the database by a hyperplane. Employ the use of hyper-planes to separate convex-free convex sets. Suppose that C and D are two convex sets; they converge, i.e., CD=. C input contains a collection of patient records namely D as database contains the collection of the COVID-19 record set. Then there exists a 0 and b such that aTxb for all xC and aTxb for all xD. In this function, aTx-b is non-positive on C and non-negative on D. A separating hyperplane exists for the sets C and D if and only if {x|aTx=b} is discussed by Stephen Boyd, 2004 discussed in Convex Optimization, ISBN: 978-0-521-83378-3, Cambridge University Press.

7.Association of COVID-19 cluster data using MapReduce based hyperplane

Table 1

COVID-19 dataset for India state wise

S. No.City namesConfirmedActiveRecoveredDeceased
1Andaman and Nicobar3318150
2Andhra Pradesh1332101428731
3Arunachal Pradesh1010
12Himachal Pradesh4014251
13Jammu and Kashmir5813811928
18Madhya Pradesh25611971461129
27Tamil Nadu2162925121027
30Uttar Pradesh2134158551039
32West Bengal75861212422

Software usefulness that works in the network of trajectories in parallel to find solutions to large of COVID-19 data and process it using the MapReduce procedure. Hyperplane based MapReduce is an indoctrination outline that allows us to do distributed and parallel processing on large data sets in a distributed setting. The first step in COVID-19 Data Processing using MapReduce is the Mapper Class. At this time, Record Reader processes each Input of COVID-19 patient record and generates the respective key-value pair. Mapper store protects this intermediate patient data into the local repository. It is the rational representation of COVID-19 data. It signifies a chunk of effort that contains a single map task in the Hyperplane based MapReduce Program. The Record Reader interacts with the COVID-19 patient data input split and converts the obtained data in the form of Key-Value Pairs. The Intermediary output generated from the mapper is nourished to the reducer in which processes it and makes the last diseased output in which is then protected in the Hyperplane. The main constituent in a MapReduce job is a Hyperplane namely as Driver Class. It is in control for setting up a MapReduce to run-in Hadoop. We stipulate the designations of Mapper and Reducer Classes extended with data kinds and their own job names.

8.Understanding COVID-19 data analysis with machine learning

This section fully describes the famous business problems approach with the help of Libra and can be used to perform machine-learning operations on the Hadoop platform to overcome some memory problems. The two important rules of machine learning are as follows, namely as Linear regression and Logistic regression. This is one of the most important machine learning techniques used to know the relationship between target variables and exploratory variables. We use this method to estimate the target variables in numerical form. To know about the two types of regression, we first need to know about the target variables and the descriptive variables. Target variables: The values of the variables in the problem to be estimated are considered as “target variables.” Descriptive variables: Variables that help to estimate the value of target variables are called “explanatory variables.”

A. Linear Regression: Regression refers to the machine learning method where “LINEAR” refers to a straight line. This means that when we draw a graph between the variables in a given problem, if the points are cream around the straight line, then those variables have “linear regression.” The main purpose of linear regression is to evaluate and evaluate values based on historical information. Two types of variables, namely target variables and descriptive variables, affect regression, and are key factors in achieving linear regression. Using linear relationships, we can identify the effect of target variables on descriptive variables and their modification. Given by the mathematical expression for regression:


Figure 2.

Linear regression for confirmed cases.

Linear regression for confirmed cases.

Figure 3.

Linear regression for active cases.

Linear regression for active cases.

Other formulas are also needed to calculate the slope of the regression line and for the intercept point of regression. The slope of regression is given below:


The intercept point of regression is given by:


Figure 4.

Linear regression for recovered cases.

Linear regression for recovered cases.

Figure 5.

Linear regression for deceased cases.

Linear regression for deceased cases.

Here, x and y are the variables that make up the dataset, and N is the total number of values. Consider the following table for each category of x-value and estimate the y-value.

These are the values obtained by taking x values and substituting those values into regression values to estimate y values. The same principle and process is followed to take the new ‘x’ value and find the new y value.

Table 2

Static approaches result values

Statistical approachConfirmedActiveRecoveredDeceased
MSE (Mean square error)64227611542311446861838532621.34206820.1711
RMSE (Root MSE)21043.7024838043.91862483.986756107.7077759

Procedure: This chapter deals with the implementation of linear regression. If we consider large data sets, it is impossible to manage the data. So, we use the same call() function summary for this model; however, in this case, the machine learning technique combines parallel linear regression with the help of Mapper and Reducer. It does not cause memory problems in the Hadoop computation nodes that divide and process the database.

# The MapReduce job can produce XT*Y Mapreduce ( Input = X.index, # Mapper Functioncan calculate and emitting XT*y  map = function(. , Xi) {  yi = y[Xi[, 1], ]  Xi = Xi[, -1]  Keyval(1, list(t(Xi) %*% yi))}, # Reducer function can Reducer the mapper output  value by performing # the sum operation process over them  Reduce = sum,  Combine = TRUE)))[[1]] For Output values purpose, we have to perform   the below process then solve(Xtx, Xty)

Figure 6.

Logistic regression for confirmed cases.

Logistic regression for confirmed cases.

Figure 7.

Logistic regression for active cases.

Logistic regression for active cases.

B. Logistic Regression: In statistics, logistic regression or logit regression is the type of probability classification model. Logistic regression plays a key role in the development and computation of many theories, as numerical units are used to assess the incidence and incidence of disease, including the treatment of logistic regression, and the presence of coronavirus. The principles for implementing logistic regression using the logistic functions listed below are as follows. To forecast the log odds ratios that can use the following formula is:


Figure 8.

Logistic regression for recovered cases.

Logistic regression for recovered cases.

Figure 9.

Logistic regression for deceased cases.

Logistic regression for deceased cases.

The probability formula is as follows:


Logit(p) is the linear function of the descriptive variable, x (x1, x2, x3, …, xn) in which is equivalent to linear regression. Therefore, the output of this function is in the range of 0 to 1. Based on the probability score, we can set its probability range from 0 to 1. In most cases, if the score is greater than 0.5, it is considered as 1 or otherwise 0. In addition, we can say that it provides a classification boundary for classifying the resulting variable.

Algorithm: Unending training dataset is based on the training dataset plot that we can say that there is a taxonomic boundary generated by the model. It defines logistics. Regression Map Minimize functions with the following input parameters. Calling this function starts executing the logistic regression of the Map-Reduce function.

In this algorithm, majorly follows four-step input, iteration, dims, and alpha as follows:

  • 1. Input: This is an input dataset.

  • 2. Iterations: This is the fixed number of iterations for calculating the gradient.

  • 3. Dims: This is the dimension of input variables.

  • 4. Alpha: This is the learning rate.

Let us see how to develop the logistic regression function.

# MapReduce job – Here MapReduce function   executing for logistic regression   Logistic.regression = Function (input, iterations,   dims, alpha) {  Plane = t(rep(0,dims))  g = function(z) 1/(1 = exp(-z))  for (I in 1:iterations) {   gradient =   values (    from.dfs (     mapreduce (      input,      map =,      reduce = lr.reduce,      combine = T)))    plane = plane + alpha * gradient }  plane}

9.Experimental result

The COVID-19 application uses of linear regression and logistic regression for corona virus data can assessment. To trace out the disease of Corona virus credentials from various sources and campaign. Consider a statistical technique to implement a regression model for the provided dataset. Assume that the given number of statistical units.

Its formula is as follows:


Here, Y is the target variable (response variable), xi is the descriptive variable, and e0 is the sum of the word squared error, which can be considered as noise. We can use the “call” function to reduce the error and get an accurate estimate.


This paper applied the structural risk minimization on data for linear classification, and then form a trajectory cluster, and the data prediction trial applies to each individual with supported data search, outline detection, and sample rearrangement. We have included it here to assess data search, outdoor detection, and pattern detection. Finally, three issues have been developed to develop a broader view of coronavirus on the trajectory data prediction trial to control linear classification efficiency and to issue clues for each individual.



WHO Director of General opening notes at the media consultation on COVID-19 – 11 March 2020. Retrieved from:—11-march-2020.


Coronavirus (COVID-19). World Health Organization. Retrieved from:


Coronavirus Resource Center. John Hopkins University. Retrieved from:


Yuan C, Yang H. Study on K-Value Selection Technique of K-Means Clustering Algorithm. (2019) .


Kadali DK, Jagan Mohan RNV. Estimation of Data Parameters Using Cluster Optimization. In: NCPQROCM-2019 – National Conference on Productivity, Quality, Reliability, Optimization and Computational Modeling. 18th to 20th Dec (2019) . ISBN: 978-93-5396-180-0, published by SRKR Engineering College, Bhimavaram-534204.


Kadali DK, Jagan Mohan RNV. Shortest Route Analysis for High level Slotting Using Peer-to-Peer. In: ICRTIB-2019 – International Conference on Recent Trends in IoT and Blockchain. 19th & 20th Oct (2019) . ISBN: 978-93-5391-198-0, published by GIET University, Gunupur-765022, Odisha.


Kadali DK, Jagan Mohan RNV, Srinivasa Rao M. Cluster Optimization for Similarity Process Using De-Duplication. IJSRD – International Journal for Scientific Research & Development. (2016) Aug; 4: (6): ISSN: 2321-0613.


Kadali DK, Jagan Mohan RNV. Optimizing the Duplication of Cluster Data for Similarity Process. ANU Journal of Physical Science. (2014) Jun–Dec; 2: : ISSN: 0976-0954.


Kadali DK, Jagan Mohan RNV, Vamsidhar Y. Similarity based Query Optimization on Map Reduce using Euler Angle Oriented Approach. International Journal of Scientific & Engineering Research. (2012) Aug; 3: (8): ISSN: 2229-5518.


Jagan Mohan RNV, Subbarao R, Raja Sekhara Rao K. Efficient K-Means Cluster Reliability on Ternary Face Recognition using Angle Oriented Approach. In: Proceedings of International Conference on Advances in Communication, Navigation & Signal Processing Technically Co-Sponsored by IEEE, Hyderabad Section, March 17th–18th, (2012) . Dept of ECE, Andhra University College of Engineering (A).


Sun Y, Yen G, Zhang Y. IGD indicator-based evolutionary algorithm for many objective optimization problems. IEEE Transactions on Evolutionary Computation. (2018) ; 23: (2): 173-187.


Wang R, Zhou Z, Ishibuchi H, Liao T, Zhang T. Localized weighted sum method for many-objective optimization. IEEE Transactions on Evolutionary Computation. (2018) ; 22: (1): 3-18.


Li WK, Wang WL, Li L. Optimization of water resources utilization by multi objective moth-flame algorithm. Water Resources Management. (2018) ; 47: (10): 3303-3316.


Bi X, Wang C. A niche-elimination operation based NSGA-III algorithm for many-objective optimization. Applied Intelligence. (2018) ; 48: (1): 118-141.


Kadali DK, Jagan Mohan RNV. Shortest Route Analysis for High level Slotting Using Peer-to-Peer. In: International conference ICRTIB-2019 – Springer. 19th–20th Oct (2019) , GIET University, Gunupur, Odisha.


Madhavi S, Rahnamayan S, Deb K. Opposition based learning: a literature review. Swarm and Evolutionary Computation. (2018) ; 39: : 1-23.


Mirjalili S, Jangir P, Saremi S. Multi-objective ant lion optimizer: a multi-objective optimization algorithm for solving engineering problems. Applied Intelligence. (2017) ; 46: (1): 79-95.


Zhou C, Dai G, Wang M. Enhanced dominance and density selection based evolutionary algorithm for many-objective optimization problems. Applied Intelligence. (2017) ; 1: : 1-21.


Xiang Y, Zhou Y, Li M, Chen Z. A vector angle-based evolutionary algorithm for unconstrained many-objective optimization. IEEE Transactions on Evolutionary Computation. (2017) ; 21: (1): 131-152.


Jiang S, Yang S. A strength pareto evolutionary algorithm based on reference direction for multiobjective and many-objective optimization. IEEE Transactions on Evolutionary Computation. (2017) ; 21: (3): 329-346.


Liu Y, Gong D, Sun X, Zhang Y. Many-objective evolutionary optimization based on reference points. Applied Soft Computing. (2017) ; 50: : 344-355.


Wang W, Ying S, Li L, Wang Z, Li W. An improved decomposition-based multiobjective evolutionary algorithm with a better balance of convergence and diversity. Applied Soft Computing. (2017) ; 57: : 627-641.