Measuring the interestingness of discovered knowledge: A principled approach

Hilderman, Robert J.; Hamilton, Howard J.

doi:10.3233/IDA-2003-7406

Measuring the interestingness of discovered knowledge: A principled approach

Article type: Research Article

Authors: Hilderman, Robert J. | Hamilton, Howard J.

Affiliations: Department of Computer Science, University of Regina, Regina, Saskatchewan, Canada S4S 0A2. E-mail: [email protected], [email protected]

Abstract: When mining a large database, the number of patterns discovered can easily exceed the capabilities of a human user to identify interesting results. To address this problem, various techniques have been suggested to reduce and/or order the patterns prior to presenting them to the user. In this paper, our focus is on ranking summaries generated from a single dataset, where attributes can be generalized in many different ways and to many levels of granularity according to taxonomic hierarchies. We theoretically and empirically evaluate twelve diversity measures used as heuristic measures of interestingness for ranking summaries generated from databases. The twelve diversity measures have previously been utilized in various disciplines, such as information theory, statistics, ecology, and economics. We describe five principles that any measure must satisfy to be considered useful for ranking summaries. Theoretical results show that the proposed principles define a partial order on the ranked summaries in most cases, and in some cases, define a total order. Theoretical results also show that seven of the twelve diversity measures satisfy all of the five principles. We empirically analyze the rank order of the summaries as determined by each of the twelve measures. These empirical results show that the measures tend to rank the less complex summaries as most interesting. Finally, we analyze the distribution of the index values generated by each of the twelve diversity measures. Empirical results, obtained using synthetic data, show that the distribution of index values generated tend to be highly skewed about the mean, median, and middle index values. Finally, we demonstrate a technique, based upon our principles, for visualizing the relative interestingness of summaries. The objective of this work is to gain some insight into the behaviour that can be expected from our principled approach in practice.

Keywords: data mining, diversity measures, theory of interestingness, statistics and probability, visualization

DOI: 10.3233/IDA-2003-7406

Journal: Intelligent Data Analysis, vol. 7, no. 4, pp. 347-382, 2003

Received 19 September 2002

Accepted 7 December 2002

Published: 27 August 2003

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

如果您在出版方面需要帮助或有任何建, 件至: [email protected]

Share this:

North America

Europe

Asia