A system for the content-based querying of large databases
containing documents of different classes (texts, images, image
sequences etc.) is introduced.
Queries are formulated in
natural language (NL) and are evaluated for their semantic contents. For
the document evaluation, a knowledge model consisting of a set of
domain specific concept interpretation methods is constructed. Thus,
the semantics of both the query and the documents can be
interconnected, i.e. the retrieval process searches for a match on the
semantic level (not merely on the level of keywords or global
image properties) between the query and the document. Methods from
fuzzy set theory are used to find the matches. Furthermore, the
retrieval methods associate information from different document
classes.
To avoid the loss of information
inherent to pre-indexing, documents need not be indexed; in principle,
every search may be performed on the raw data under a given query.
The system can therefore answer every query that can be expressed in
the semantic model. To achieve the high data rates necessary for
on-line analysis, dedicated VLSI search processors are being developed
along with a parallel high-throughput media-server.
In the sequel,
we outline the
system architecture and detail specific aspects of
those two modules which together implement
natural language search: the natural language interface NatLink
which performs the syntactical analysis and constructs a formal
semantical interpretation of the queries,
and the subsequent fuzzy retrieval module, which establishes
an operational model for concept-based NL interpretation.
This is a revised version of an earlier report.
I. Glöckner, S. Hartrumpf, H. Helbig and A. Knoll
Content-Based Document Retrieval Using Natural Language
In Tagungsband zum Workshop ``Virtuelle Wissensfabrik:
Neue Techniken der Mensch-Maschine-Interaktion, der
Strukturierung komplexer Informationen und deren Kommunikation in
kooperativen Umgebungen'', Schloß Birlinghoven, Sankt Augustin,
23./24. Sept. 1999.