Efficient String Matching in Huffman Compressed Texts

Fredriksson, Kimmo; Tarhio, Jorma

Efficient String Matching in Huffman Compressed Texts

Article type: Research Article

Authors: Fredriksson, Kimmo | Tarhio, Jorma

Affiliations: Department of Computer Science, University of Joensuu, PO Box 111, 80101 Joensuu, Finland | Department of CSE, Helsinki University of Technology, PO Box 5400, 02015 Hut, Finland

Abstract: We present an efficient algorithm for scanning Huffman compressed texts. The algorithm parses the compressed text in O(n{[log2σ]/b}) time, where n is the size of the compressed text in bytes, σ is the size of the alphabet, and b is a user specified parameter. The method uses a variable size super-alphabet, with an average size of O({b/[H log2σ]}) characters, where H is the entropy of the text. Each super-character is processed in O(1) time. The algorithm uses O(2b) space and O(b2b) preprocessing time. The method can be easily augmented by auxiliary functions, which can e.g. decompress the text or perform pattern matching in the compressed text. We give three example functions: decoding the text in average time O(n{[log2σ]/[Hw]}), where w is the number of bits in a machine word; an Aho-Corasick dictionary matching algorithm, which works in time O(n{[log2σ]/b}+t), where t is the number of occurrences reported; and a shift-or string matching algorithm that works in time O(n{[log2σ]/b}⌈(m+s-1)/w⌉+t), where m is the length of the pattern and s depends on the encoding. The Aho-Corasick algorithm uses an automaton with variable length moves, i.e. it processes variable number of states at each step. The shift-or algorithm makes variable length shifts, effectively also processing variable number of states at each step. The number of states processed in O(1) time is O(b/[H log2σ]). The method can be applied to several other algorithms as well. Finally, we apply the methods to natural language taking the words (vocabulary) as the alphabet. This improves the compression ratio and allows more complex search problems to be solved efficiently. We conclude with some experimental results.

Keywords: Huffman compression, string matching, natural language

Journal: Fundamenta Informaticae, vol. 63, no. 1, pp. 1-16, 2004

Received 18 September 2004

Accepted 18 September 2004

Published: 2004

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

如果您在出版方面需要帮助或有任何建, 件至: [email protected]

Share this:

North America

Europe

Asia