High-Order Entropy-Compressed Text Indexes

View/ Open
Issue Date
2003Author
Grossi, Roberto
Gupta, Ankur
Vitter, Jeffrey Scott
Publisher
Society for Industrial and Applied Mathematics Philadelphia
Type
Article
Article Version
Scholarly/refereed, publisher version
Published Version
http://portal.acm.org/citation.cfm?id=644250Metadata
Show full item recordAbstract
We present a novel implementation of compressed su~x arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet E, where each symbol is encoded by lg ]E I bits. We show that compressed su1~x arrays use just
nHh + O(n lglg n~ lgl~ I n) bits, while retaining full text indexing functionalities, such as searching any pattern sequence of length m in O(mlg [E[ + polylog(n)) time. The term Hh < lg IEI denotes the hth-order empirical entropy of the text, which means that our index is nearly optimal in space apart from lower-order terms, achieving asymptotically the empirical entropy of the text (with a multiplicative constant 1). If the text is highly compressible so that H~ = o(1) and the alphabet size is small, we obtain a text index with
o(m) search time that requires only o(n) bits. Further results and tradeoffs are reported in the paper.
Collections
Citation
R. Grossi, A. Gupta, and J. S. Vitter. “High-Order Entropy-Compressed Text Indexes,” being partitioned with more recent work for submission to journals. An extended abstract appears in Proceedings of the 14th Annual SIAM/ACM Symposium on Discrete Algorithms (SODA ’03), Baltimore, MD, January 2003, 841–850.
Items in KU ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
We want to hear from you! Please share your stories about how Open Access to this item benefits YOU.