Skip to main content


eCommons@Cornell >
College of Engineering >
Computer Science >
Computer Science Technical Reports >

Please use this identifier to cite or link to this item:
Title: Experiments in Automatic Phrase Indexing For Document Retrieval:A Comparison of Syntactic and Non-Syntactic Methods
Authors: Fagan, Joel L
Keywords: computer science
technical report
Issue Date: Sep-1987
Publisher: Cornell University
Abstract: In order for an automatic information retrieval system to effectively retrieve documents related to a given subject area, the content of each document in the system's database must be represented accurately. This study examines the hypothesis that better representations of document content can be constructed if the content analysis method takes into consideration the syntactic structure of document and query texts. Two methods of automatically generating phrases for use as content indicators have been implemented and tested experimentally. The non-syntactic (or statistical) method is based on simple text characteristics such as word frequency and the proximity of words in text. The syntactic method uses augmented phrase structure rules (production rules) to selectively extract phrases from parse trees generated by an automatic syntactic analyzer. Experimental results show that the effect of non-syntactic phrase indexing is inconsistent. For the five collections tested, increases in average precision ranged from 22.7% to 2.2% over simple, single term indexing. The syntactic phrase indexing method was tested on two collections. Precision figures averaged over all test queries indicate that non-syntactic phrase indexing performs significantly better than syntactic phrase indexing for one collection, but that the difference is insignificant for the other collection. More detailed analysis of individual queries, however, indicates that the performance of both methods is highly variable, and that there is evidence that syntax-based indexing has certain benefits not available with the non-syntactic approach. Possible improvements of both methods of phrase indexing are considered. It is concluded that the prospects for improving the syntax-based approach to document indexing are better than for the non-syntactic approach. The PLNLP system was used for syntactic analysis of document and query texts, and for implementing the syntax-based phrase construction rules. The SMART information retrieval system was used for retrieval experimentation.
Appears in Collections:Computer Science Technical Reports

Files in This Item:

File Description SizeFormat
87-868.pdf19.13 MBAdobe PDFView/Open
87-868.ps3.33 MBPostscriptView/Open

Refworks Export

Items in eCommons are protected by copyright, with all rights reserved, unless otherwise indicated.


© 2014 Cornell University Library Contact Us