Please use this identifier to cite or link to this item: http://hdl.handle.net/1813/7281
 Title: Term Weighting Revisited Authors: Singhal, Amitabh Keywords: computer sciencetechnical report Issue Date: Mar-1997 Publisher: Cornell University Citation: http://techreports.library.cornell.edu:8081/Dienst/UI/1.0/Display/cul.cs/TR97-1626 Abstract: Term weighting is an essential part of the modern information retrieval systems. Out of the three main components of a term weighting strategy --- term frequency, inverse document frequency, and document length normalization --- the term frequency factor has been investigated recently by researchers. In this work, we study the inverse document frequency, and document length normalization components of term weights. We observe that a document length normalization scheme that retrieves documents of all lengths with similar chances as their likelihood of relevance will outperform another scheme which retrieves documents with chances very different from their likelihood of relevance. We present {\em pivoted normalization\/}, a technique that can be used to modify normalization functions to reduce the gap between the relevance and the retrieval probabilities. We present two new normalization functions --- {\em pivoted unique normalization\/} and {\em pivoted byte size normalization}, both of which yield significant improvements over the previous state of the art normalization functions. When optical character recognition is used to create large information bases, term weighting schemes can be highly sensitive to the errors in the input text, introduced by the OCR process. This work examines the effects of the well known {\em cosine normalization\/} method in the presence of OCR errors, and proposes a new, more robust, normalization method. Experiments show that the new scheme is less sensitive to OCR errors and facilitates the use of more diverse basic weighting schemes. This study also explains why the use of cosine normalization in presence of the inverse document frequency factor is not advisable in large document collections. When a user types a natural language query for an IR system, certain keywords in the query are more pertinent to the user's information need than others. Most modern IR systems incorporate these distinctions by using an inverse document frequency ({\em idf\/}) factor in term weighting. Preliminary experiments show that the usefulness of an {\em idf\/} type function is high at low ranks. We observe that the main reason for this effect is the widened gap between the weights of the rare terms and the non-rare query terms. The standard {\em idf\/} function works very well across query sets. Experiments show that there is room for improvement in the {\em idf\/} function. Further studies are needed to discover a better replacement for the standard {\em idf\/} function. URI: http://hdl.handle.net/1813/7281 Appears in Collections: Computer Science Technical Reports

Files in This Item:

File Description SizeFormat