The problems involved in searching very large text databases are discussed. It is shown that conventional techniques for searching current databases cannot be scaled up to larger ones, and that it is necessary to build hardware to search the database in parallel if reasonable search times are expected. The part of the search process requiring the highest bandwidth is scanning the database to detect instances of search terms. Methods from the literature of doing this in hardware are examined, problems with using them in large systems are discussed, and design criteria to be met by a successful search architecture are defined.
A new model for text searching, using a nondeterministic finite state automaton (NFSA) to control matching, is introduced. First the NFSA model itself is discussed. Examples are given showing how it can be used to search for a wide variety of textual patterns. Next, implementation of the NFSA Searcher in hardware is discussed. By partitioning the nondeterministic state table on the basis of pairwise compatibilities and assigning blocks of states to interconnected sub-machines, it is shown how the NFSA can be built with simple logic in a manner that lends itself to LSI implementation.
It is of critical importance that it be possible to quickly partition that state table for a group of search patterns. Methods for partitioning tables efficiently are developed and their performance is analyzed. Methods of detecting instances of higher-level search expressions from instances of their component patterns detected by the NFSA Searcher are also discussed. Finally, the configuration and performance of the search system as a function of user load and other paramenters is discussed. By comparing the hardware required and response time afforded by the NFSA Searcher with that for an alternative implementation, it is seen that the NFSA Searcher is a significant advance in architecture for text search applications.
Cited By
- Iseman M and Shasha D (2019). Performance and Architectural Issues for String Matching, IEEE Transactions on Computers, 39:2, (238-250), Online publication date: 1-Feb-1990.
- Hollaar L The utah retrieval system architecture User-Oriented Content-Based Text and Image Handling, (1010-1019)
- Hollaar L (1983). Hardware systems for text information retrieval, ACM SIGIR Forum, 17:4, (3-9), Online publication date: 1-Jun-1983.
- Hollaar L Hardware systems for text information retrieval Proceedings of the 6th annual international ACM SIGIR conference on Research and development in information retrieval, (3-9)
- Haskin R and Hollaar L (1983). Operational characteristics of a harware-based pattern matcher, ACM Transactions on Database Systems (TODS), 8:1, (15-40), Online publication date: 1-Mar-1983.
Recommendations
Hardware for searching very large text databases
This paper discusses the problem of searching very large text databases. It is shown that conventional techniques for searching current databases cannot be scaled up to larger ones, and that it is necessary to build hardware to search the database in ...
Hardware for searching very large text databases
CAW '80: Proceedings of the fifth workshop on Computer architecture for non-numeric processingThis paper discusses the problem of searching very large text databases. It is shown that conventional techniques for searching current databases cannot be scaled up to larger ones, and that it is necessary to build hardware to search the database in ...