KUKU

KU ScholarWorks

  • myKU
  • Email
  • Enroll & Pay
  • KU Directory
    • Login
    View Item 
    •   KU ScholarWorks
    • Office of the Provost
    • Provost Office Published Articles
    • View Item
    •   KU ScholarWorks
    • Office of the Provost
    • Provost Office Published Articles
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

    Thumbnail
    View/Open
    vitter_2005.pdf (316.3Kb)
    Issue Date
    2005
    Author
    Grossi, Roberto
    Vitter, Jeffrey Scott
    Publisher
    Society for Industrial and Applied Mathematics
    Type
    Article
    Article Version
    Scholarly/refereed, publisher version
    Metadata
    Show full item record
    Abstract
    The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for space-efficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. The text T can be represented in n lg |Σ| bits by encoding each symbol with lg |Σ| bits. The goal is to support fast online queries for searching any string pattern P of m symbols, with T being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require Ω(n lg n) additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need Ω(n) memory words, each of Ω(lg n) bits. These indexes are larger than the text itself by a multiplicative factor of Ω(lg|Σ| n), which is significant when Σ is of constant size, such as in ascii or unicode. On the other hand, these indexes support fast searching, either in O(mlg |Σ|) time or in O(m+lg n) time, plus an output-sensitive cost O(occ) for listing the occ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast O(m/lg|Σ| n + lg | Σ| n) search time in the worst case, for any constant 0 < ≤ 1, using at most −1 + O(1) n lg |Σ| bits of storage. Our result thus presents for the first time an efficient index whose size is provably linear in the size of the text in the worst case, and for many scenarios, the space is actually sublinear in practice. As a concrete example, the compressed suffix array for a typical 100 MB ascii file can require 30–40 MB or less, while the raw suffix array requires 500 MB. Our theoretical bounds improve both time and space of previous indexing schemes. Listing the pattern occurrences introduces a sublogarithmic slowdown factor in the output-sensitive cost, giving O(occ lg | Σ| n) time as a result. When the patterns are sufficiently long, we can use auxiliary data structures in O(n lg |Σ|) bits to obtain a total search bound of O(m/lg|Σ| n + occ) time, which is optimal.
    Description
    AMS subject classifications. 68W05, 68Q25, 68P05, 68P10, 68P30 DOI. 10.1137/S0097539702402354
    URI
    http://hdl.handle.net/1808/7177
    DOI
    https://doi.org/10.1137/S0097539702402354
    Collections
    • Distinguished Professors Scholarly Works [918]
    • Electrical Engineering and Computer Science Scholarly Works [302]
    • Provost Office Published Articles [95]
    Citation
    R. Grossi and J. S. Vitter. “Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching,” SIAM Journal on Computing, 35(2), 2005, 378–407. An extended abstract appears in Proceedings of the 32nd Annual ACM Symposium on Theory of Computing (STOC ’00), Portland, OR, May 2000, 397–406. http://dx.doi.org/10.1137/S0097539702402354

    Items in KU ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.


    We want to hear from you! Please share your stories about how Open Access to this item benefits YOU.


    Contact KU ScholarWorks
    785-864-8983
    KU Libraries
    1425 Jayhawk Blvd
    Lawrence, KS 66045
    785-864-8983

    KU Libraries
    1425 Jayhawk Blvd
    Lawrence, KS 66045
    Image Credits
     

     

    Browse

    All of KU ScholarWorksCommunities & CollectionsThis Collection

    My Account

    LoginRegister

    Statistics

    View Usage Statistics

    Contact KU ScholarWorks
    785-864-8983
    KU Libraries
    1425 Jayhawk Blvd
    Lawrence, KS 66045
    785-864-8983

    KU Libraries
    1425 Jayhawk Blvd
    Lawrence, KS 66045
    Image Credits
     

     

    The University of Kansas
      Contact KU ScholarWorks
    Lawrence, KS | Maps
     
    • Academics
    • Admission
    • Alumni
    • Athletics
    • Campuses
    • Giving
    • Jobs

    The University of Kansas prohibits discrimination on the basis of race, color, ethnicity, religion, sex, national origin, age, ancestry, disability, status as a veteran, sexual orientation, marital status, parental status, gender identity, gender expression and genetic information in the University’s programs and activities. The following person has been designated to handle inquiries regarding the non-discrimination policies: Director of the Office of Institutional Opportunity and Access, IOA@ku.edu, 1246 W. Campus Road, Room 153A, Lawrence, KS, 66045, (785)864-6414, 711 TTY.

     Contact KU
    Lawrence, KS | Maps