Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching   Read More: http://epubs.siam.org/doi/abs/10.1137/S0097539702402354

Grossi, Roberto; Vitter, Jeffrey Scott

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching Read More: http://epubs.siam.org/doi/abs/10.1137/S0097539702402354

dc.contributor.author	Grossi, Roberto
dc.contributor.author	Vitter, Jeffrey Scott
dc.date.accessioned	2015-11-20T20:50:24Z
dc.date.available	2015-11-20T20:50:24Z
dc.date.issued	2005
dc.identifier.citation	Grossi, Roberto, and Jeffrey Scott Vitter. "Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching (extended Abstract)." Proceedings of the Thirty-second Annual ACM Symposium on Theory of Computing - STOC '00 (2000). http://dx.doi.org/10.1137/S0097539702402354	en_US
dc.identifier.uri	http://hdl.handle.net/1808/18962
dc.description	This is the published version. Copyright 2005 Society for Industrial and Applied Mathematics	en_US
dc.description.abstract	The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for space-efficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet $\Sigma$. The text T can be represented in $n \lg \|\Sigma\|$ bits by encoding each symbol with $\lg \|\Sigma\|$ bits. The goal is to support fast online queries for searching any string pattern P of m symbols, with T being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require $\Omega(n \lg n)$ additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need $\Omega(n)$ memory words, each of $\Omega(\lg n)$ bits. These indexes are larger than the text itself by a multiplicative factor of $\Omega(\smash{\lg_{\|\Sigma\|} n})$, which is significant when $\Sigma$ is of constant size, such as in \textsc{ascii} or \textsc{unicode}. On the other hand, these indexes support fast searching, either in $O(m \lg \|\Sigma\|)$ time or in $O(m +\lg n)$ time, plus an output-sensitive cost $O(\mathit{occ})$ for listing the $\mathit{occ}$ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast $\smash{O(m /\lg_{\|\Sigma\|} n + \lg_{\|\Sigma\|}^\epsilon n)}$ search time in the worst case, for any constant $0 < \epsilon \leq 1$, using at most $\smash{\bigl(\epsilon^{-1} + O(1)\bigr) \, n \lg \|\Sigma\|}$ bits of storage. Our result thus presents for the first time an efficient index whose size is provably linear in the size of the text in the worst case, and for many scenarios, the space is actually sublinear in practice. As a concrete example, the compressed suffix array for a typical 100 MB \textsc{ascii} file can require 30--40 MB or less, while the raw suffix array requires 500 MB. Our theoretical bounds improve \emph{both} time and space of previous indexing schemes. Listing the pattern occurrences introduces a sublogarithmic slowdown factor in the output-sensitive cost, giving $O(\mathit{occ} \, \smash{\lg_{\|\Sigma\|}^\epsilon n})$ time as a result. When the patterns are sufficiently long, we can use auxiliary data structures in $O(n \lg \|\Sigma\|)$ bits to obtain a total search bound of $O(m /\lg_{\|\Sigma\|} n + \mathit{occ})$ time, which is optimal.	en_US
dc.publisher	Society for Industrial and Applied Mathematics	en_US
dc.title	Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching Read More: http://epubs.siam.org/doi/abs/10.1137/S0097539702402354	en_US
dc.type	Article
kusw.kuauthor	Vitter, Jeffrey
kusw.kudepartment	Electrical Engr & Comp Science	en_US
dc.identifier.doi	10.1137/S0097539702402354
kusw.oaversion	Scholarly/refereed, publisher version
kusw.oapolicy	This item does not meet KU Open Access policy criteria.
dc.rights.accessrights	openAccess

Files in this item

Name:: Grossi_compressed_suffix2005.pdf
Size:: 293.2Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

The University of Kansas prohibits discrimination on the basis of race, color, ethnicity, religion, sex, national origin, age, ancestry, disability, status as a veteran, sexual orientation, marital status, parental status, gender identity, gender expression and genetic information in the University’s programs and activities. The following person has been designated to handle inquiries regarding the non-discrimination policies: Director of the Office of Institutional Opportunity and Access, IOA@ku.edu, 1246 W. Campus Road, Room 153A, Lawrence, KS, 66045, (785)864-6414, 711 TTY.