Compressed Index for Dictionary Matching with One Error

Hon, Wing-Kai; Lam, Tak-Wah; Shah, Rahul; Siu-Lung, Tam; Vitter, Jeffrey Scott

dc.contributor.author	Hon, Wing-Kai
dc.contributor.author	Lam, Tak-Wah
dc.contributor.author	Shah, Rahul
dc.contributor.author	Siu-Lung, Tam
dc.contributor.author	Vitter, Jeffrey Scott
dc.date.accessioned	2011-07-25T21:23:21Z
dc.date.available	2011-07-25T21:23:21Z
dc.date.issued	2011-03-29
dc.identifier.citation	W.-K. Hon, Tsung-Han Ku, R. Shah, S. V. Thankachan, and J. S. Vitter. “Compressed Dictionary Matching With One Error,” in preparation. An extended abstract appears in Proceedings of the 2011 IEEE Data Compression Conference (DCC ’11), Snowbird, UT, March 2011. http://dx.doi.org/10.1109/DCC.2011.18
dc.identifier.uri	http://hdl.handle.net/1808/7806
dc.description	DOI: 10.1109/DCC.2011.18
dc.description.abstract	Given a set D of d patterns of total length n, the dictionary matching problem is to index D such that for any query text T, we can locate the occurrences of any pattern within T efficiently. This problem can be solved in optimal O(\|T\|+occ) time by the classical AC automaton (Aho and Corasick, 1975) where occ denotes the number of occurrences. The space requirement is O(n) words. In the {approximate} dictionary matching problem with one error, we consider a substring of T[i..j] an occurrence of P whenever the edit distance between T[i..j] and P is at most one. For this problem, the best known indexes are by Cole et al. (2004), which requires O(n+ dlog{d}) words of space and reports all occurrences in O(\|T\|log{d}log{log{d}}+occ) time, and by Ferragina et al. (1999), which requires O(n^{1+epsilon}) words of space and reports all occurrences in O(\|T\|loglog n + occ) time. Recently, there have been successes in compressing the dictionary matching index while keeping the query time optimal (Belazzougui, 2010, Hon et al., 2010). However, a compressed index for approximate dictionary matching problem is still open. In this paper, we propose the first such index which requires an optimal nH_k+O(n)+o(nlogsigma)-bit index space, where H_k denotes the kth-order empirical entropy of D, and sigma is the size of alphabet set from which all the characters in D and T are drawn. The query time of our index is O(σ\|T\|log3 n log log n + occ).
dc.language.iso	en_US
dc.publisher	Institute of Electrical and Electronics Engineers
dc.subject	Approximation algorithms
dc.subject	Automata
dc.subject	Computers
dc.subject	Dictionaries
dc.subject	Entropy
dc.subject	Pattern matching
dc.subject	Computational complexity
dc.subject	Data compression
dc.subject	Index
dc.subject	Query processing
dc.subject	String matching
dc.subject	AC automaton
dc.subject	O(n) word
dc.subject	Compressed dictionary matching
dc.subject	Kth order empirical entropy
dc.subject	Query text
dc.subject	Approximate matching
dc.subject	Compressed indexes
dc.subject	Dictionary matching
dc.subject	Suffix tree
dc.title	Compressed Index for Dictionary Matching with One Error
dc.type	Article
kusw.kuauthor	Vitter, Jeffrey Scott
kusw.kudepartment	Office of the Provost
kusw.oastatus	fullparticipation
dc.identifier.doi	10.1109/DCC.2011.18
kusw.oaversion	Scholarly/refereed, author accepted manuscript
kusw.oapolicy	This item meets KU Open Access policy criteria.
dc.rights.accessrights	openAccess

Files in this item

Name:: HLS08compressedindex.pdf
Size:: 175.4Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

The current literature on propensity score matching is missing imperative information for educational researchers regarding the practical implications of utilizing this method with limited sample sizes. The purpose of this ...

We study the coexistence of monetary and credit transactions in a model where exchange is decentralized. Agents belong to different locations which are informationally separated. The equilibrium mix of monetary and credit ...

PURPOSE:We investigated how overt visual attention and oculomotor control influence successful use of a visual feedback brain-computer interface (BCI) for accessing augmentative and alternative communication (AAC) devices ...

The University of Kansas prohibits discrimination on the basis of race, color, ethnicity, religion, sex, national origin, age, ancestry, disability, status as a veteran, sexual orientation, marital status, parental status, gender identity, gender expression and genetic information in the University’s programs and activities. The following person has been designated to handle inquiries regarding the non-discrimination policies: Director of the Office of Institutional Opportunity and Access, IOA@ku.edu, 1246 W. Campus Road, Room 153A, Lawrence, KS, 66045, (785)864-6414, 711 TTY.