dc.contributor.author | Hon, Wing-Kai | |
dc.contributor.author | Lam, Tak-Wah | |
dc.contributor.author | Shah, Rahul | |
dc.contributor.author | Siu-Lung, Tam | |
dc.contributor.author | Vitter, Jeffrey Scott | |
dc.date.accessioned | 2011-07-25T21:23:21Z | |
dc.date.available | 2011-07-25T21:23:21Z | |
dc.date.issued | 2011-03-29 | |
dc.identifier.citation | W.-K. Hon, Tsung-Han Ku, R. Shah, S. V. Thankachan, and J. S. Vitter. “Compressed Dictionary Matching With One Error,” in preparation. An extended abstract appears in Proceedings of the 2011 IEEE Data Compression Conference (DCC ’11), Snowbird, UT, March 2011. http://dx.doi.org/10.1109/DCC.2011.18 | |
dc.identifier.uri | http://hdl.handle.net/1808/7806 | |
dc.description | DOI: 10.1109/DCC.2011.18 | |
dc.description.abstract | Given a set D of d patterns of total length n, the dictionary matching problem is to index D such that for any query text T, we can locate the occurrences of any pattern within T efficiently. This problem can be solved in optimal O(|T|+occ) time by the classical AC automaton (Aho and Corasick, 1975) where occ denotes the number of occurrences. The space requirement is O(n) words. In the {approximate} dictionary matching problem with one error, we consider a substring of T[i..j] an occurrence of P whenever the edit distance between T[i..j] and P is at most one. For this problem, the best known indexes are by Cole et al. (2004), which requires O(n+ dlog{d}) words of space and reports all occurrences in O(|T|log{d}log{log{d}}+occ) time, and by Ferragina et al. (1999), which requires O(n^{1+epsilon}) words of space and reports all occurrences in O(|T|loglog n + occ) time. Recently, there have been successes in compressing the dictionary matching index while keeping the query time optimal (Belazzougui, 2010, Hon et al., 2010). However, a compressed index for approximate dictionary matching problem is still open. In this paper, we propose the first such index which requires an optimal nH_k+O(n)+o(nlogsigma)-bit index space, where H_k denotes the kth-order empirical entropy of D, and sigma is the size of alphabet set from which all the characters in D and T are drawn. The query time of our index is O(σ|T|log3 n log log n + occ). | |
dc.language.iso | en_US | |
dc.publisher | Institute of Electrical and Electronics Engineers | |
dc.subject | Approximation algorithms | |
dc.subject | Automata | |
dc.subject | Computers | |
dc.subject | Dictionaries | |
dc.subject | Entropy | |
dc.subject | Pattern matching | |
dc.subject | Computational complexity | |
dc.subject | Data compression | |
dc.subject | Index | |
dc.subject | Query processing | |
dc.subject | String matching | |
dc.subject | AC automaton | |
dc.subject | O(n) word | |
dc.subject | Compressed dictionary matching | |
dc.subject | Kth order empirical entropy | |
dc.subject | Query text | |
dc.subject | Approximate matching | |
dc.subject | Compressed indexes | |
dc.subject | Dictionary matching | |
dc.subject | Suffix tree | |
dc.title | Compressed Index for Dictionary Matching with One Error | |
dc.type | Article | |
kusw.kuauthor | Vitter, Jeffrey Scott | |
kusw.kudepartment | Office of the Provost | |
kusw.oastatus | fullparticipation | |
dc.identifier.doi | 10.1109/DCC.2011.18 | |
kusw.oaversion | Scholarly/refereed, author accepted manuscript | |
kusw.oapolicy | This item meets KU Open Access policy criteria. | |
dc.rights.accessrights | openAccess | |