You are viewing a javascript disabled version of the site. Please enable Javascript for this site to function properly.
Go to headerGo to navigationGo to searchGo to contentsGo to footer
In content section. Select this link to jump to navigation

Mining association rules between stroke risk factors based on the Apriori algorithm

Abstract

BACKGROUND:

Stroke is a frequently-occurring disease and is a severe threat to human health.

OBJECTIVE:

We aimed to explore the associations between stroke risk factors.

METHODS:

Subjects who were aged 40 or above were requested to do surveys with a unified questionnaire as well as laboratory examinations. The Apriori algorithm was applied to find out the meaningful association rules. Selected association rules were divided into 8 groups by the number of former items. The rules with higher confidence degree in every group were viewed as the meaningful rules.

RESULTS:

The training set used in association analysis consists of a total of 985,325 samples, with 15,835 stroke patients (1.65%) and 941,490 without stroke (98.35%). Based on the threshold we set for the Apriori algorithm, eight meaningful association rules were obtained between stroke and its high risk factors. While between high risk factors, there are 25 meaningful association rules.

CONCLUSIONS:

Based on the Apriori algorithm, meaningful association rules between the high risk factors of stroke were found, proving a feasible way to reduce the risk of stroke with early intervention.

1.Introduction

The frequent occurrence of stroke has inspired researchers to investigate potential risk factors. Understanding the association between stroke and risk factors is important in understanding, preventing, and controlling the occurrence of stroke. An understanding of the early risk factors that lead to stroke, such as known hypertension, diabetes mellitus, and other common diseases, can lead to more accurate predictions of incidents of stroke. At present, multiple factors are used to diagnose strokes. However, if we are able to correlate risk factors, we can simplify the process of disease diagnosis and reduce medical costs.

A number of studies have been conducted on the risk factors of stroke. Zhang et al. found that coronary heart disease, hypertension, smoking, and obesity were all related to stroke occurrence [1]. Eman et al. investigated stroke risk factors in Egypt; their results showed that hypertension and diabetes mellitus are the chief risk factors of stroke [2]. Research has also indicated that atrial fibrillation increases the risk of stroke, and the occurrence rate of stroke is about 5% among atrial fibrillation patients [3, 4]. Knottnerus et al. found that family history of stroke is an independent risk factor for lacunars stroke [5].

Research on the risk factors of stroke has gradually deepened. Researchers have found that stroke risk factors include hypertension, diabetes mellitus, heart disease, dyslipidemia, smoking, excessive drinking, aging, and genetic factors [6, 7]. These risk factors can be divided into those that are fixed and those that are modifiable. Sex and age are fixed. Modifiable factors include biological factors (heart disease, hypertension, hyperlipidemia, diabetes mellitus, etc.) and behavioral factors (smoking, drinking, weight, depression, etc.).

This study analyzes the association between stroke and risk factors, to determine the combination of high risk factors that leads to strokes. Understanding how risk factors work in combination will improve prevention, early diagnosis, and early treatment.

This paper is organized as follows: in the next section, the research dataset and research methods are described in detail. In Section 3, the association rules are described, and an assessment of these rules is displayed. A discussion and conclusions are presented in Section 4.

2.Method

2.1Participants

The data used in this study was obtained from a stroke screening and prevention investigation that took place in 2012 and was provided by the Chinese People’s Liberation Army General Hospital Clinical Data Center. The research subjects of the stroke screening database were from a cluster sample of 16 provinces, municipalities and autonomous regions throughout China (including Beijing, Tianjin, Henan Province, Heilongjiang Province, Xinjiang Uygur Autonomous Region, and Sichuan Province). Hospitals, community health service centers, and township health centers throughout China were used as intake points. Every intake unit selected one project-screening site in an urban community and one in a rural township. At each screening site, all residents who were 40 or older (born before December 31, 1973) were registered as screening objects. Residents who lived outside of the screening site for more than half a year were excluded. The sixth national population census was used to determine the ratio of the number of urban communities to the rural townships. The stroke screening database consists of 1,196,422 screening subjects. All survey groups used the same questionnaire, the “Assessment form of paroxysm of high risk group and stroke patient recurrence risk.” Trained and qualified investigators filled in the form following a face-to-face interview with each participant. Information collected include basic demographic information (age, sex, nationality, and district), stroke risk screening items (hypertension, diabetes mellitus, atrial fibrillation, dyslipidemia, obesity, smoking, stroke family history, etc.), and other preliminary screening items.

2.2Data set characteristics

Our study considered whether or not patients experienced any of the following 8 stroke inducing factors: hypertension, atrial fibrillation, dyslipidemia, diabetes mellitus, smoking, exercise, overweight and family history of stroke.

The criteria for each of the 8 factors was:

  • Hypertension: blood pressure 140/90 mmHg or taking antihypertensive drugs.

  • Diabetes mellitus: diagnosed by a doctor with diabetes mellitus or taking drugs prescribed for treating diabetes.

  • Atrial fibrillation: medical history of atrial fibrillation.

  • Dyslipidemia: triglyceride 2.26 mmol/L, or total cholesterol 6.22 mmol/L, or low density lipoprotein cholesterol 4.14 mmol/L, or high density lipoprotein cholesterol < 1.04 mmol/L.

  • Overweight: body mass index (BMI) 26 kg/m2.

  • Smoking: smoking one or more cigarettes per day for at least one year including past history and current smoking.

  • Lack of exercise: exercising fewer than three times per week, with exercise time < 30 minutes/session; physically demanding work can be considered physical training.

  • Family history of stroke: Direct and collateral stroke within three generations of the patient.

Among 1,196,422 research subjects, 957,325 were chosen randomly as the training set for association analysis. The training set included 15,835 stroke patients (1.65% of the training set) and 941,490 without stroke (98.35%).

2.3Association rules

In 1993, R. Agrawal of the International Business Machines Corporation (IBM) Almaden Research Center first presented association rules mining between each item set in a customer transaction database [8]. This became known as the Apriori algorithm and has become the classic algorithm for association rules analysis. Researchers have conducted numerous follow-up studies on association rules mining, including algorithm optimization and expanding application areas. As an important project of data mining, association rules mining has received extensive attention, and has been widely applied in physical activities, business affairs, financial areas, medicine, and other fields.

The association rule algorithm involves two steps: First, all high frequency items in the set are listed; then, frequent association rules are generated based on high frequency items [9]. High frequency indicates that the term frequency of one item has reached or exceeded a certain level, and the term frequency is the Support degree [10]. It is defined as follows:

(1)
Support(A=>B)=P(AB)

When the support degree of {A, B} is greater than or equal to the minimum support degree, then {A, B} is put in the high frequency item group.

The second step of the association algorithm is the generation of association rules. According to the high frequency item group obtained in the first step, if a rule is satisfied within a minimum confidence degree, then the rule is an association rule. The confidence degree is defined as follows:

(2)
Confidence(A=>B)=P(B|A)

Common association rules algorithms include: the Apriori, Generalized Rule Induction (GRI), and Frequent Pattern-tree (FP-tree) algorithms. The Apriori algorithm is the classical mining algorithm of Boolean association rules frequency item sets [11]. The Breadth-first search strategy, which exploits the downward closure support property, is used to count the support of item sets and candidate generation function. The Apriori algorithm uses a “bottom up” approach, where frequent subsets are extended by one item at a time. This step is known as candidate generation, and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found. The core of the Apriori algorithm is the recursion of frequent item sets in two phases. These association rules belong to the monolayer, single-dimension, Boolean class. The Apriori algorithm can be simplified as follows: first, determine all of the frequent sets that are satisfied with a minimum support degree or minimum confidence degree; and then from these frequent item sets, generate strong association rules that meet minimum support degree and minimum confidence degree. Expected rules can be acquired by the above-mentioned frequent item sets [12]. Each rule results in only one item.

The Apriori algorithm, which is a basic algorithm of association rules, is one of the ten classic algorithms in the field of data mining. The Apriori algorithm can be used to mine potential relationships among data items in various fields. In our research, we introduce the Apriori algorithm to stroke datasets to discover possible associations between stroke risk factors.

2.4Assessment metric of association rules

Occasionally, the support degree and confidence degree are insufficient for filtering uninteresting rules. In these cases correlation measurements can be extended to the association rules frame to resolve the issue. Immediately, the correlation rule comes into being. It includes not only the support degree and confidence degree, but also the correlation measurement between item sets A and B.

Researchers have studied many assessment metrics, even before mining frequency patterns were written about extensively. Some the previously discovered model evaluation variables are still used frequently.

As a simple correlation measurement, Lift is defined as: if P (A B) = P (A)*P (B), then the appearance of item set A is independent of item set B; otherwise, item set A and item set B are dependent and correlated. The Lift of A and B can be computed by the following formula:

(3)
Lift(A,B)=P(AB)/P(A)*P(B)

If the result of the formula is less than one, then the appearance of A and B are negatively correlated, that is, when one is present the other is likely to be absent. If the result of the formula is greater than one, then A and B are positively correlated, that is if one is present, the other is likely to be present. If the result of the formula is equal to one, then A and B are independent, and there is no correlation between the two.

2.5Association rules model establishment

For the association rules model of the stroke risk factors (“suffering hypertension or not” (dfHypertension = 1), “suffering atrial fibrillation or not” (dfAF = 1), “smoking or not” (dfSmoking = 1), “dyslipidemia or not” (dfLDL = 1), “suffering diabetes mellitus or not” (dfGlycuresis = 1), “physical training frequently or not” (dfSportsLack = 1), “overweight or not” (dfOverweight = 1), “have a family stroke history or not” (dfStrokeFamily = 1)) the 8 risk factors are set as former items of the association model, and “has past medical history of stroke” (dfStroke = 1) is set as the consequent. The maximum of these former items was set as 8, which means determining all possible combinations of 8 risk factors.

For association rules models between risk factors, the above-mentioned 8 stroke risk factors are set as former item and consequent respectively. Likewise, the number 8 is set as the maximum former item.

3.Results

3.1All possible association rules between stroke and risk factors

In the initial experiment, in order to find out all possible association rules, the minimum support degree was set as 0.0%, and the minimum confidence degree was set as 1.0%. After executing the association model, 256 association rules, including 70 rules whose confidence degree are greater than 50%, were obtained.

Table 1 shows how these 256 association rules were organized. Among all the rules, when the number of former items is 8, the confidence degree is at its greatest, 86.03%. The maximum confidence degree of each association rule decreased as the number of former items decreased: when the number of former items was equal to 7, the maximum confidence degree was 85.66%; when the number of former items was equal to 6, the maximum confidence degree was 83.66%; when the number of former items was equal to 5, the maximum confidence degree was 78.48%; when the number of former items was equal to 4, the maximum confidence degree was 69.05%; when the number of former items was less than or equal to 3, the confidence degree of all rules was less than 50%, of which the maximum confidence degree was 42.87%.

Table 1

Association rules between risk factors and stroke: consequent = suffering from stroke

Number ofNumber of frequent item setsNumber of frequent item setsConfidence (%)
former item(confidence > 1%)(confidence > 50%)MaximumMinimumMean
81186.03
78885.6667.1978.87
6282583.8640.7966.49
5562978.4819.0549.69
470769.0510.8132.68
356042.877.4318.25
228017.824.019.33
1807.222.364.56

Most rules whose confidence degree is greater than 50% are rules in which the number of former items is 6 or 5. Twenty-five rules and 29 rules, account for 77.14% of rules whose confidence degree was greater than 50%. The above experiment results indicate that when a person has more risk factors of stroke, he or she is at increased risk for stroke.

3.2Meaningful association rules between stroke and risk factors

When the number of former items is fixed, the number of combinations formed by the 8 risk factors is fairly large. For instance, when the number of former items is equal to 5, the possible number of combinations is C85= 56. The ultimate purpose of our study is to identify meaningful rules among the vast association rules.

To find out the meaningful association rules, we divided the 256 association rules by the number of former items, from 1 to 8, into 8 groups. The confidence degree indicates the probability that the consequent (suffering stroke) takes place, in instances where the former item is present. Therefore, the rules with highest confidence degree in every group are viewed as the meaningful rules. Altogether, 8 rules are shown as Table 2. These 8 rules mean that when the number of former items is fixed, if the risk factors of the screening result agree with the following rules, then the risk of the person suffering stroke is larger than others.

Table 2

Meaningful association rules between risk factors and stroke: consequent = suffering from stroke

Number ofFrequent item set with maximum confidenceConfidence (%)Support (%)Lift
former item
8Eight risk factors0.0252.01
7Atrial fibrillation, diabetes mellitus, family stroke history, obvious overweight, smoking, lack of physical training, hypertension85.660.0340.62
6Atrial fibrillation, diabetes mellitus, family stroke history, obvious overweight, smoking, lack of physical training83.860.0250.70
5Atrial fibrillation, diabetes mellitus, family stroke history, obvious overweight, smoking78.480.0347.45
4Atrial fibrillation, diabetes mellitus, family stroke history, smoking69.050.0341.75
3Atrial fibrillation, diabetes mellitus, family stroke history42.870.0525.92
2Atrial fibrillation, family stroke history17.820.0810.78
1Atrial fibrillation7.220.224.37

Table 3

Meaningful association rules among risk factors (1): consequent = hypertension

Sequence numberFormer item 1Former item 2Former item 3Support (%)Confidence (%)
1Diabetes mellitusObvious overweightDyslipidemia0.80571.62
2Family strokehistoryObvious overweightDyslipidemia0.82968.26
3Diabetes mellitusObvious overweight1.45162.85
4Family stroke historyObvious overweight1.52560.90
5Obvious overweightSmokingDyslipidemia0.89860.57
6Diabetes mellitusLack of physical trainingDyslipidemia0.87359.39
7Obvious overweightLack of physical trainingDyslipidemia1.65758.77
8Diabetes mellitusDyslipidemia2.61457.47
9Atrial fibrillationDyslipidemia1.23557.36
10Diabetes mellitusLack of physical training1.61753.27
11Family stroke historyDyslipidemia2.83153.11
12Diabetes mellitus5.94650.74

When the risk factors of the screening results contain relatively more factors, the individual has a high risk of suffering from stroke. When the screening results contain relatively fewer factors, the individual has a lower risk. When the screening results contain 3–5 risk factors, the risk of suffering stroke is difficult to estimate by intuition; this is especially true for rules 7–8. For screening and prevention of stroke, rules 4–6 as listed in Table 2 are more meaningful. Rule 4: Among individuals who possess 5 risk factors, those who have atrial fibrillation, diabetes mellitus, smoking history, elevated body mass index, and a stroke family history, have a 78.48% higher risk of stroke than others. Rule 5: Among individuals possessing 4 risk factors, those patients who have atrial fibrillation, diabetes mellitus, smoking history, and a family history of stroke, have a high risk of stroke, the risk probability is 69.05%. Rule 6: Among individuals with 3 risk factors, those patients that have a family history of stroke, atrial fibrillation, and diabetes mellitus, have a 42.87% risk of stroke.

3.3Association analysis among risk factors

In the initial experiment, the minimum support degree was set at 0.0%, and the minimum confidence degree was set at 1.0%. After executing the association model, 1,016 association rules, including 407 rules whose confidence degrees were greater than 50%, were searched out. Then the minimum support degree was set at 0.3%, and minimum confidence degree was set at 50%, and 42 association rules were searched out by the association model. After that, the minimum support degree was set at 0.5%, and the minimum confidence degree was set at 50%, and 25 association rules were searched out.

Twenty-five rules were split into two classes, one with hypertension as the consequent, as shown in Table 3; the other with atrial fibrillation as the consequent, as shown in Table 4. These rules were ordered by confidence degree from largest to smallest.

Table 4

Meaningful association rules among risk factors (2): consequent = atrial fibrillation

Sequence numberFormer item 1Former item 2Former item 3Support (%)Confidence (%)
1Diabetes mellitusObvious overweightHypertension0.91263.28
2Family stroke historyObvious overweightHypertension0.92960.94
3Obvious overweightLack of physical trainingHypertension1.60260.77
4Diabetes mellitusLack of physical trainingHypertension0.86160.24
5Family stroke historyLack of physical trainingHypertension1.000358.16
6Family stroke historyLack of physical training2.13756.96
7Obvious overweightSmokingHypertension0.96856.19
8Diabetes mellitusObvious overweight1.45155.54
9Family stroke historyObvious overweight1.52654.37
10Diabetes mellitusLack of physical training1.61754.02
11Family stroke historyHypertension2.78553.98
12SmokingLack of physical trainingHypertension1.13852.10
13Lack of physical trainingHypertension6.19551.65

The results displayed in Table 3 show that individuals who simultaneously have diabetes mellitus, dyslipidemia, and obvious overweight, have a 71.62% probability of having high blood pressure. The second highest probability of suffering from hypertension is 68.26%, which occurs in those who have dyslipidemia, obvious overweight and a family history of stroke.

The results shown in Table 4 show that individuals who have diabetes mellitus, obvious overweight, and hypertension, are most likely to experience atrial fibrillation (63.28% probability). The group with the second largest probability of experiencing atrial fibrillation (60.94%), is those who have hypertension, obvious overweight, and a family stroke history.

4.Discussion

This population-based study of Chinese adults aged 40 years and over found that the most important stroke risk factor is atrial fibrillation, followed by diabetes mellitus, and family history of stroke.

In our training set, 1.65% of the respondents were stroke patients. According to the last association rule in Table 2, with atrial fibrillation as the former item and occurrence of stroke as the consequent, the confident degree was 7.22% – this is about 4.4 times the prevalence rate of the whole training set. Research has shown that atrial fibrillation is one of the highest risk factors of stroke, and is especially common in the elderly group. It has previously been shown that the occurrence rate of cerebral arterial thrombosis among atrial fibrillation patients is five to seven times higher than person who do not have atrial fibrillation [13]. This is similar to the results of our study. Some scholars have claimed that atrial fibrillation patients’ hearts may beat quickly and in a disorderly fashion, blood in the atrium cannot be pump out entirely, stasis of blood in atrium lead to thrombus. After the thrombus breaks off and moves to the brain through the bloodstream, the blood vessel is blocked, and causes a cerebral arterial thrombosis eventually [14].

Changes in insulin and plasma lipoprotein and glucose metabolism dysbolism caused by diabetes mellitus may contribute to the formation of arteriosclerosis and thrombus, which is why diabetes mellitus has been found to be the highest risk factor of stroke. One third of acute stroke patients suffered from diabetes mellitus; these patients tend to be young, and female [15]. The results of this paper are in accordance with the above conclusion.

Table 5 shows further analysis of the association rules among the risk factors. For 12 association rules, and hypertension as the consequent, dyslipidemia as the former item appeared 8 times, diabetes mellitus or obvious overweight as the former item appeared 6 times. For atrial fibrillation as the consequent, there were 13 interesting association rules. When hypertension appeared as the former item 9 times, lack of exercise as the former item appeared 7 times, and obvious overweight appeared as the former item 6 times. The above results indicate that dyslipidemia, diabetes mellitus. and obvious overweight are the most associated factors of hypertension, while hypertension, lack of exercise and obvious overweight are the most associated factors of atrial fibrillation.

Table 5

Summary of meaningful association rules among risk factors

Former itemMeaning of former itemConsequent = hypertensionConsequent = atrial fibrillation
FrequencyRankingFrequencyRanking
dfAFAtrial fibrillation14
dfHypertensionHypertension91
dfSmokingSmoking1426
dfLDLDyslipidemia8107
dfGlycuresisDiabetes mellitus6245
dfSportsLackLack of physical training3372
dfOverweightObvious overweight6263
dfStrokeFamilyFamily stroke history3354

A large number of medical studies have verified that there are effective interventions or hypertension, and it has the easiest interventions, and the most effective interventions among prime risk factors. If blood pressure can be lowered to a reasonable level, the morbidity of stroke will drop at least 35%–38% [16].

Prevention and intervention of high risk factors can reduce stroke morbidity greatly. The most direct means of prevention is changing of life style, such as consuming a healthy diet with little oil and salt, regular physical exercise, and reducing smoking or drinking.

Conflict of interest

None to report.

Acknowledgments

The authors would like to thank the project leader and research assistants who collected research data, and the people who participated in this study.

References

[1] 

Zhang J, Chen J, Wang Y, Huang Y. Risk factors in 174 patients with acute ischemic stroke. J Beijing Univ Tradit Chin Med. (2013) ; 36: : 417-420.

[2] 

Khedr EM, Elfetoh NA, Al Attar G, Ahmed MA, Ali AM, Hamdy A, et al. Epidemiological Study and Risk Factors of Stroke in Assiut Governorate, Egypt: Community-Based Study. Neuroepidemiology. (2013) ; 40: : 288-294.

[3] 

Kannel WB, Abbott RD, Savage DD, McNamara PM. Epidemiologic features of chronic atrial fibrillation: the Framingham study. N Engl J Med. (1982) ; 306: : 1018-1022.

[4] 

Wolf PA, Abbott RD, Kannel WB. Atrial Fibrillation as an Independent RiskFactor for Stroke: The Framingham Study. Stroke. (1991) ; 22: : 983-988.

[5] 

Knottnerus ILH, Gielen M, Jan L, Rouhl RPW, Julie S, Robert V, et al. Family history of stroke is an independent risk factor for lacunar stroke subtype with asymptomatic lacunar infarcts at younger ages. Stroke. (2011) ; 42: : 1196-1200.

[6] 

Marshall IJ, Wang Y, McKevitt C, Rudd AG, Wolfe CD. Trends in Risk Factor Prevalence and Management Before First Stroke Data From the South London Stroke Register 1995–2011. Stroke. (2013) ; 4: : 3298-3304.

[7] 

Sacco RL, Kasner SE, Broderick JP, Caplan LR, Connors JJ, Culebras A, et al. An updated definition of stroke for the 21st century: a statement for healthcare professionals from the American Heart Association/American Stroke Association. Stroke. (2013) ; 44: : 2064-2089.

[8] 

Agrawal R, Imielinski T, Swami A. Mining Association Rules between Sets of itemsin Large database. In: Bunemuu P, Jajodia S, eds. SIGMOD1993. Proceedings of the 1993 ACM SIGMODConference on Management of Data; 26-28 May 1993; Washington, DC., USA. New York: ACM Press; (1993) ; 207-216.

[9] 

Agrawal R, Srikant R. Fast algorithms for mining association rules. In: Bocca J, Jarke M, Zaniolo C, eds. VLDB94. Proceeding of the 20th International Conference on Very Large Databases; 12-15 September 1994; Santiago, Chile. San Francisco: Morgan Kaufmann Publishers; (1994) ; 487-499.

[10] 

Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques. 3rd ed. San Francisco: Morgan Kaufmann; (2011) .

[11] 

Minaei-Bidgoli B, Barmaki R, Nasiri M. Mining numerical association rules via multi-objective genetic algorithms. Inf Sci. (2013) ; 233: : 15-24.

[12] 

Luo M. The Research of Decision in Colleges Based on Apriori Algorithm for Boolean Association Rules. Computer Knowledge and Technology. (2014) ; 10: : 170-171.

[13] 

Ogawa S, Koretsune Y, Yasaka M, Aizawa Y, Atarashi H, Inoue H, et al. Antithrombotic therapy in atrial fibrillation: evaluation and positioning of new oral anticoagulant agents. Circ J. (2011) ; 75: : 1539-1547.

[14] 

Lin L, Huang Y, Lin Y, Cai R. Research Status of rt-PA Thrombolytic Therapy and Anticoaguant Therapy in Stroke Patients with Atrial Fibrillation. Med Recapitulate. (2013) ; 19: : 4112-4115.

[15] 

Piernik-Yoder B, Ketchum N. Rehabilitation Outcomes of Stroke Patients With and Without Diabetes. Arch Phys Med Rehabil. (2013) ; 8: : 1508-1512.

[16] 

Faraco G, Iadecola C. Hypertension: A Harbinger of Stroke and Dementia. Hypertension. (2013) ; 62: : 810-817.