ARCADE: A Prediction Method for Nominal Variables
Article type: Research Article
Authors: Costa, J.F.P.a; * | Lerman, I.C.b; 1
Affiliations: [a] Departamento de Matemática Aplicada (Fac. de Ciências) LIACC, Univ. do Porto, Rua das Taipas, 135, 4050 Porto, Portugal | [b] IRISA-INRIA-Rennes, France
Correspondence: [*] Corresponding author. E-mail: [email protected].
Note: [1] E-mail: [email protected].
Abstract: The main problem considered in this paper consists of binarizing categorical (nominal) attributes having a very large number of values (204 in our application). A small number of relevant binary attributes are gathered from each initial attribute. Let us suppose that we want to binarize a categorical attribute v with L values, where L is large or very large. The total number of binary attributes that can be extracted from v is 2L−1−1, which in the case of a large L is prohibitive. Our idea is to select only those binary attributes that are predictive; and these shall constitute a small fraction of all possible binary attributes. In order to do this, the significant idea consists in grouping the L values of a categorical attribute by means of an hierarchical clustering method. To do so, we need to define a similarity between values, which is associated with their predictive power. By clustering the L values into a small number of clusters (J), we define a new categorical attribute with only J values. The hierarchical clustering method used by us, AVL, allows to choose a significant value for J. Now, we could consider using all the 2L−1−1 binary attributes associated with this new categorical attribute. Nevertheless, the J values are tree-structured, because we have used a hierarchical clustering method. We profit from this, and consider only about 2×J binary attributes. If L is extremely large, for complexity and statistical reasons, we might not be able to apply a clustering algorithm directly. In this case, we start by “factorizing” v into a pair (v2, v2), each one with about L(v) values. For a simple example, consider an attribute v with only four values m1, m2, m3, m4. Obviously, in this example, there is no need to factorize the set of values of v, because it has a very small number of values. Nevertheless, for illustration purposes, v could be decomposed (factorized) into 2 attributes with only two values each; the correspondence between the values of v and (v2, v2) would be v (v1, v2)m1 1 1m2 1 2m3 2 1m4 2 2 Now we apply the clustering method to both sets of values of v1 and v2, defining therefore a new synthetic pair (v¯1,v¯2). Then, we “multiply” these new attributes and get another attribute v10 with J×J values; J1 (resp. J2) is the number of values of v¯1 (resp. v¯2). Now, we apply a final clustering to the values of v10, and proceed as above. The solution that we propose is independent of the number of classes and can be applied to various situations. The application of ARCADE to the protein secondary structure prediction problem, proves the validity of our approach.
Keywords: Decision trees, Binarization, Complexity reduction, Categorical attributes, Hierarchical clustering
DOI: 10.3233/IDA-1998-2402
Journal: Intelligent Data Analysis, vol. 2, no. 4, pp. 265-286, 1998