Analysis and summarization of correlations in data cubes and its application in microarray data analysis

Chen, Chien-Yu; Hwang, Shien-Ching; Oyang, Yen-Jen

doi:10.3233/IDA-2005-9104

Analysis and summarization of correlations in data cubes and its application in microarray data analysis

Article type: Research Article

Authors: Chen, Chien-Yu^{; *} | Hwang, Shien-Ching | Oyang, Yen-Jen

Affiliations: Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, #1 Roosevelt Rd. Sec. 4, Taipei 106, Taiwan. E-mail: [email protected], [email protected]; [email protected]

Correspondence: [*] Correspondence author: C.-Y. Chen is now with the Graduate School of Biotechnology and Bioinformatics, Yuan Ze University, Chung-Li 320, Taiwan. Tel.: +886 3 463 8800 ext 2185, Fax: +886 3 463 8850; E-mail: [email protected].

Abstract: This paper presents a novel mechanism to analyze and summarize the statistical correlations among the attributes of a data cube. To perform the analysis and summarization, this paper proposes a new measure of statistical significance. The main reason for proposing the new measure of statistical significance is to have an essential closure property, which is exploited in the summarization stage of the data mining process. In addition to the closure property, the proposed measure of statistical significance has two other important properties. First, the proposed measure of statistical significance is more conservative than the well-known chi-square test in classical statistics and, therefore, inherits its statistical robustness. This paper does not simply employ the chi-square test due to lack of the desired closure property, which may lead to a precision problem in the summarization process. The second additional property is that, though the proposed measure of statistical significance is more conservative than the chi-square test, for most cases, the proposed measure yields a value that is almost equal to the z test, a conventional measurement of statistical significance based on the normal distribution. Based on the closure property addressed above, this paper develops an algorithm to summarize the results from performing statistical analysis in the data cube. Though the proposed measure of statistical significance avoids the precision problem due to having the closure property, its conservative nature may lead to a recall rate problem in the data mining process. On the other hand, if the chi-square test, which does not have the closure property, was employed, then the summarization process may suffer a precision problem. In this paper, we also show a possible application in bioinformatics. We applied the proposed mechanism on a microarray dataset, in order to identify groups of genes with similar expression patterns in subspaces of feature space.

Keywords: correlation, data cube, statistical test, segmentation, summarization

DOI: 10.3233/IDA-2005-9104

Journal: Intelligent Data Analysis, vol. 9, no. 1, pp. 43-57, 2005

Received 10 December 2003

Accepted 5 April 2004

Published: 30 March 2005

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

如果您在出版方面需要帮助或有任何建, 件至: [email protected]

Share this:

North America

Europe

Asia