Skip to main content


eCommons@Cornell

eCommons@Cornell >
College of Engineering >
Computer Science >
Computer Science Technical Reports >

Please use this identifier to cite or link to this item: http://hdl.handle.net/1813/7281
Title: Term Weighting Revisited
Authors: Singhal, Amitabh
Keywords: computer science
technical report
Issue Date: Mar-1997
Publisher: Cornell University
Citation: http://techreports.library.cornell.edu:8081/Dienst/UI/1.0/Display/cul.cs/TR97-1626
Abstract: Term weighting is an essential part of the modern information retrieval systems. Out of the three main components of a term weighting strategy --- term frequency, inverse document frequency, and document length normalization --- the term frequency factor has been investigated recently by researchers. In this work, we study the inverse document frequency, and document length normalization components of term weights. We observe that a document length normalization scheme that retrieves documents of all lengths with similar chances as their likelihood of relevance will outperform another scheme which retrieves documents with chances very different from their likelihood of relevance. We present {\em pivoted normalization\/}, a technique that can be used to modify normalization functions to reduce the gap between the relevance and the retrieval probabilities. We present two new normalization functions --- {\em pivoted unique normalization\/} and {\em pivoted byte size normalization}, both of which yield significant improvements over the previous state of the art normalization functions. When optical character recognition is used to create large information bases, term weighting schemes can be highly sensitive to the errors in the input text, introduced by the OCR process. This work examines the effects of the well known {\em cosine normalization\/} method in the presence of OCR errors, and proposes a new, more robust, normalization method. Experiments show that the new scheme is less sensitive to OCR errors and facilitates the use of more diverse basic weighting schemes. This study also explains why the use of cosine normalization in presence of the inverse document frequency factor is not advisable in large document collections. When a user types a natural language query for an IR system, certain keywords in the query are more pertinent to the user's information need than others. Most modern IR systems incorporate these distinctions by using an inverse document frequency ({\em idf\/}) factor in term weighting. Preliminary experiments show that the usefulness of an {\em idf\/} type function is high at low ranks. We observe that the main reason for this effect is the widened gap between the weights of the rare terms and the non-rare query terms. The standard {\em idf\/} function works very well across query sets. Experiments show that there is room for improvement in the {\em idf\/} function. Further studies are needed to discover a better replacement for the standard {\em idf\/} function.
URI: http://hdl.handle.net/1813/7281
Appears in Collections:Computer Science Technical Reports

Files in This Item:

File Description SizeFormat
97-1626.pdf1.52 MBAdobe PDFView/Open
97-1626.ps2.21 MBPostscriptView/Open

Refworks Export

Items in eCommons are protected by copyright, with all rights reserved, unless otherwise indicated.

 

© 2014 Cornell University Library Contact Us