Skip to main content


eCommons@Cornell

eCommons@Cornell >
Faculty of Computing and Information Science >
Computing and Information Science >
Computing and Information Science Technical Reports >

Please use this identifier to cite or link to this item: http://hdl.handle.net/1813/5743
Title: Plagiarism Detection in arXiv
Authors: Sorokina, Daria
Gehrke, Johannes
Warner, Simeon
Ginsparg, Paul
Keywords: computer science
technical report
Issue Date: 28-Sep-2006
Publisher: Cornell University
Citation: http://techreports.library.cornell.edu:8081/Dienst/UI/1.0/Display/cul.cis/TR2006-2046
Abstract: We describe a large-scale application of methods for finding plagiarism and self-plagiarism in research document collections. The methods are applied to a collection of 284,834 documents collected by arXiv.org over a 14 year period, covering a few different research disciplines. The methodology efficiently detects a variety of problematic author behaviors, and heuristics are developed to reduce the number of false positives. The methods are also efficient enough to implement as a real-time submission screen for a collection many times larger.
URI: http://hdl.handle.net/1813/5743
Appears in Collections:Computing and Information Science Technical Reports

Files in This Item:

File Description SizeFormat
TR2006-2046.pdf171.46 kBAdobe PDFView/Open

Refworks Export

Items in eCommons are protected by copyright, with all rights reserved, unless otherwise indicated.

 

© 2014 Cornell University Library Contact Us