Abstract
We present a novel implementation of compressed su~x arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet E, where each symbol is encoded by lg ]E I bits. We show that compressed su1~x arrays use just
nHh + O(n lglg n~ lgl~ I n) bits, while retaining full text indexing functionalities, such as searching any pattern sequence of length m in O(mlg [E[ + polylog(n)) time. The term Hh < lg IEI denotes the hth-order empirical entropy of the text, which means that our index is nearly optimal in space apart from lower-order terms, achieving asymptotically the empirical entropy of the text (with a multiplicative constant 1). If the text is highly compressible so that H~ = o(1) and the alphabet size is small, we obtain a text index with
o(m) search time that requires only o(n) bits. Further results and tradeoffs are reported in the paper.
Citation
R. Grossi, A. Gupta, and J. S. Vitter. “High-Order Entropy-Compressed Text Indexes,” being partitioned with more recent work for submission to journals. An extended abstract appears in Proceedings of the 14th Annual SIAM/ACM Symposium on Discrete Algorithms (SODA ’03), Baltimore, MD, January 2003, 841–850.