Searching for just a few words should be enough to get started. If you need to make more complex queries, use the tips below to guide you.
Article type: Research Article
Authors: Shani, Guya; * | Gunawardana, Aselab | Meek, Christopherb
Affiliations: [a] Department of Information Systems Engineering, Ben Gurion University, Beer Sheva, Israel | [b] Microsoft Research, Redmond, WA, USA
Correspondence: [*] Corresponding author: Department of Information Systems Engineering, Ben Gurion University, Beer Sheva, 84105, Israel. E-mail: [email protected].
Note: [1] Parts of this paper appeared in the International Conference on Data Mining, ICDM, 2009 [18].
Abstract: Segmentation, the task of splitting a long sequence of symbols into chunks, can provide important information about the nature of the sequence that is understandable to humans. We focus on unsupervised segmentation, where the algorithm never sees examples of successful segmentation, but still needs to discover meaningful segments. In this paper we present an unsupervised learning algorithm for segmenting sequences of symbols or categorical events. Our algorithm hierarchically builds a lexicon of segments and computes a maximum likelihood segmentation given the current lexicon. Thus, our algorithm is most appropriate to hierarchical sequences, where smaller segments are grouped into larger segments. Our probabilistic approach also allows us to suggest conditional entropy as a measure of the quality of a segmentation in the absence of labeled data. We compare our algorithm to two previous approaches from the unsupervised segmentation literature, showing it to provide superior segmentation over a number of benchmarks. Our specific motivation for developing this general algorithm is to understand the behavior of software programs after deployment by analyzing their traces. We explain and motivate the importance of this problem, and present segmentation results from the interactions of a web service and its clients.
Keywords: Software analysis, sequence segmentation, probabilistic segmentation, multigram
DOI: 10.3233/IDA-2011-0479
Journal: Intelligent Data Analysis, vol. 15, no. 4, pp. 483-501, 2011
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA
Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]
For editorial issues, like the status of your submitted paper or proposals, write to [email protected]
IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands
Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]
For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]
Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China
Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]
For editorial issues, like the status of your submitted paper or proposals, write to [email protected]
如果您在出版方面需要帮助或有任何建, 件至: [email protected]