Loading...
Thumbnail Image
Publication

Compressed Index for Dictionary Matching with One Error

Hon, Wing-Kai
Lam, Tak-Wah
Shah, Rahul
Siu-Lung, Tam
Vitter, Jeffrey Scott
Citations
Altmetric:
Abstract
Given a set D of d patterns of total length n, the dictionary matching problem is to index D such that for any query text T, we can locate the occurrences of any pattern within T efficiently. This problem can be solved in optimal O(|T|+occ) time by the classical AC automaton (Aho and Corasick, 1975) where occ denotes the number of occurrences. The space requirement is O(n) words. In the {approximate} dictionary matching problem with one error, we consider a substring of T[i..j] an occurrence of P whenever the edit distance between T[i..j] and P is at most one. For this problem, the best known indexes are by Cole et al. (2004), which requires O(n+ dlog{d}) words of space and reports all occurrences in O(|T|log{d}log{log{d}}+occ) time, and by Ferragina et al. (1999), which requires O(n^{1+epsilon}) words of space and reports all occurrences in O(|T|loglog n + occ) time. Recently, there have been successes in compressing the dictionary matching index while keeping the query time optimal (Belazzougui, 2010, Hon et al., 2010). However, a compressed index for approximate dictionary matching problem is still open. In this paper, we propose the first such index which requires an optimal nH_k+O(n)+o(nlogsigma)-bit index space, where H_k denotes the kth-order empirical entropy of D, and sigma is the size of alphabet set from which all the characters in D and T are drawn. The query time of our index is O(σ|T|log3 n log log n + occ).
Description
DOI: 10.1109/DCC.2011.18
Date
2011-03-29
Journal Title
Journal ISSN
Volume Title
Publisher
Institute of Electrical and Electronics Engineers
Research Projects
Organizational Units
Journal Issue
Keywords
Approximation algorithms, Automata, Computers, Dictionaries, Entropy, Pattern matching, Computational complexity, Data compression, Index, Query processing, String matching, AC automaton, O(n) word, Compressed dictionary matching, Kth order empirical entropy, Query text, Approximate matching, Compressed indexes, Dictionary matching, Suffix tree
Citation
W.-K. Hon, Tsung-Han Ku, R. Shah, S. V. Thankachan, and J. S. Vitter. “Compressed Dictionary Matching With One Error,” in preparation. An extended abstract appears in Proceedings of the 2011 IEEE Data Compression Conference (DCC ’11), Snowbird, UT, March 2011. http://dx.doi.org/10.1109/DCC.2011.18
Embedded videos