Content-Based Document Retrieval Using Natural Language

Abstract

A system for the content-based querying of large databases containing documents of different classes (texts, images, image sequences etc.) is introduced. Queries are formulated in natural language (NL) and are evaluated for their semantic contents. For the document evaluation, a knowledge model consisting of a set of domain specific concept interpretation methods is constructed. Thus, the semantics of both the query and the documents can be interconnected, i.e. the retrieval process searches for a match on the semantic level (not merely on the level of keywords or global image properties) between the query and the document. Methods from fuzzy set theory are used to find the matches. Furthermore, the retrieval methods associate information from different document classes. To avoid the loss of information inherent to pre-indexing, documents need not be indexed; in principle, every search may be performed on the raw data under a given query. The system can therefore answer every query that can be expressed in the semantic model. To achieve the high data rates necessary for on-line analysis, dedicated VLSI search processors are being developed along with a parallel high-throughput media-server.
In the sequel, we outline the system architecture and detail specific aspects of those two modules which together implement natural language search: the natural language interface NatLink which performs the syntactical analysis and constructs a formal semantical interpretation of the queries, and the subsequent fuzzy retrieval module, which establishes an operational model for concept-based NL interpretation.

Comment

This is a revised version of an earlier report.

Reference

I. Glöckner, S. Hartrumpf, H. Helbig and A. Knoll
Content-Based Document Retrieval Using Natural Language
In Tagungsband zum Workshop ``Virtuelle Wissensfabrik: Neue Techniken der Mensch-Maschine-Interaktion, der Strukturierung komplexer Informationen und deren Kommunikation in kooperativen Umgebungen'', Schloß Birlinghoven, Sankt Augustin, 23./24. Sept. 1999.

Ingo Glöckner, Ingo.Gloeckner@FernUni-Hagen.DE (Homepage)