I/O-efficient Compressed Text Indexes: From Theory to Practice (extended abstract)

View/ Open
Issue Date
2010Author
Chiu, Sheng-Yuan
Hon, Wing-Kai
Shah, Rahul
Vitter, Jeffrey Scott
Publisher
IEEE
Type
Article
Article Version
Scholarly/refereed, author accepted manuscript
Metadata
Show full item recordAbstract
Pattern matching on text data has been a fundamental field of Computer Science for nearly 40 years. Databases supporting
full-text indexing functionality on text data are now widely used by biologists. In the theoretical literature, the most popular
internal-memory index structures are the suffix trees and the suffix arrays, and the most popular external-memory index structure
is the string B-tree. However, the practical applicability of these indexes has been limited mainly because of their space
consumption and I/O issues. These structures use a lot more space (almost 20 to 50 times more) than the original text data
and are often disk-resident.
Ferragina and Manzini (2005) and Grossi and Vitter (2005) gave the first compressed text indexes with efficient query
times in the internal-memory model. Recently, Chien et al (2008) presented a compact text index in the external memory based
on the concept of Geometric Burrows-Wheeler Transform. They also presented lower bounds which suggested that it may be
hard to obtain a good index structure in the external memory.
In this paper, we investigate this issue from a practical point of view. On the positive side we show an external-memory
text indexing structure (based on R-trees and KD-trees) that saves space by about an order of magnitude as compared to the
standard String B-tree. While saving space, these structures also maintain a comparable I/O efficiency to that of String B-tree.
We also show various space vs I/O efficiency trade-offs for our structures.
Description
(c) 2010 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.
Collections
Citation
S.-Y. Chiu, W.-K. Hon, R. Shah, and J. S. Vitter. “I/O-efficient Compressed Text Indexes: From Theory to Practice,” in preparation. An extended abstract appears in Proceedings of the 2010 IEEE Data Compression Conference (DCC ’10), Snowbird, UT, March 2010. http://dx.doi.org/10.1109/DCC.2010.45
Items in KU ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
We want to hear from you! Please share your stories about how Open Access to this item benefits YOU.