How to Find Near-Duplicate Text Documents
What is Near-Duplication?
Near-duplication of electronic documents is easy to describe, but difficult to define. Near-duplicate documents have similar, but not necessarily identical content. Document similarity is often estimated by a percentage, where 100% is identical. Whilst identical documents are easily defined and identified (via a variety of checksum algorithms which produce a string based on the byte content of the file), the definition of similarity depends on a deeper understanding of the document and its meaning. Existing methods of similarity estimation use statistical analysis of groups of words as a robust substitute for analysis of meaning. The variety of algorithms and the statistical nature of the analysis means that similarity measurements are not absolute – two documents measured as 75% similar using one algorithm or set of parameters will not have the same similarity using a different algorithm or parameter set. Hence, statements such as “40% of all text documents are similar” are highly qualified.
Why is Near-Duplication Detection Important?
Search engines are often used to try and detect all the documents related to particular issue or question, but the keyword-based approach tends to give unreasonably large numbers of results, and the ranking of these results does not always correspond to the wishes of the user.
Web search engine results frequently contain large numbers of duplicate and near-duplicate results, being able to filter these out would be a significant advantage. Google holds a number of patents for this process, and the Google page rank is decreased by the presence of duplicates and near-duplicates.
On the organization level, near-duplicate documents abound, often through multiple drafts of the same document being kept. Finding the latest version of a document may be straightforward if version control is used rigorously and universally in the document repository, but this is seldom the case.
How can Similarity be Estimated?
A reasonable estimate of similarity for images can be obtained by the resampling the images being compared to a very small number of pixels and then using the proportion of pixels which are identical, or within a specified color space tolerance(often using only a grey scale), as a similarity measure. This process is fast and is the known basis of several readily available image matching programs. Other programs are coy about the nature of their comparison algorithms, but may well use this method.
For text documents, the task is much more difficult, as the order of words is significant, as well as their meanings. Extraction of words from text documents is not a straightforward matter, although a multitude of text extraction components exist, as text extraction forms a key part of the process of building search engine indexes. It is often the case that using a different text extractor on the same document will give different results.
Assuming perfect text extraction, is the proportion of identical words in two documents a measure of similarity? Unfortunately not – two documents with the same words in different orders will appear identical by this measure. Synonyms are another complication – multiple words may describe the same thing. There are many approaches to similarity estimation but they fall into two groups. One divides the text into small, sometimes overlapping groups of sequential words called shingles, measures similarity by the proportion of identical shingles found in pairs of documents. The other constructs a vector of words which characterize the document and performs its comparison on the vectors. Both methods have a wide range of parameters and methods of comparison, some using highly sophisticated statistics. The variety of similarity algorithms and parameters mean that there is no absolute measure of text similarity.
A further problem is that every document needs to be compared with every other document in a collection, making comparisons very slow for large collections such as big websites.
Where are the Products?
Similarity estimation for text documents is the subject of many academic studies, as a search for “near-duplicate document detection” will indicate, but only a single study appears to have metamorphosed into a free-standing, unbundled product.
Whilst legal discovery is a well-known and lucrative area of demand for near-duplicate detection, issues arising from near-duplication of documents are encountered in many organizations, especially where multiple authors contribute to a single document whose drafts are exchanged via email which is to be submitted to an external agency. The location of the latest version of such a document (for example, a tender response) may not be known, resulting in the submission of a document without the most recent revisions. The consistent use of a document management system with version control used by all authors may guard against this situation, but such as system may not be implemented, or it may be used in such a way as to make it difficult to find the latest document version.
The other domain for near-duplicate document detection is website crawling. Identification of near-duplicate web pages can be very helpful in keeping large websites up to date by ensuring that modifications are applied to all pages where they are required, getting the highest possible Google Page Rank, and by reducing the volume of search results.
Near-duplicate Detection in Legal Discovery
Legal discovery is a pre-trial procedure where each party in a legal case may request the production of documents held by the other which are relevant to the case being considered. This may require the evaluation of very large numbers of electronic documents and emails for their relevance to a particular case and their export in a standardized format, a process commonly known as eDiscovery. If one document is considered relevant, other documents similar to it may also be relevant. As relevance is determined by highly paid legal and paralegal staff, any reduction in the number of documents being inspected or streamlining of the process by grouping similar documents and eliminating exact duplicates will yield substantial cost savings. The large savings, and other requirements of the legal discovery process, notably the efficient handling of emails, mean that software for this purpose is much more expensive than common consumer software, but many different packages are available.
One vendor (Casefleet) has a useful blog post on comparison criteria for eDiscovery tools. Enterprise Information Management vendor OpenText provides another. Both emphasize the importance of the availability of machine learning algorithms to answer questions like ‘Find documents like this one’, which near-duplicate detection provides, although it is not a learning algorithm. Vendor ImageMaker’s Discovery Assistant product includes a sophisticated document near-duplicate detection algorithm, but its price point and design restrict it to use as an eDiscovery tool.
Machine-learning algorithms operate using training sets. The process of manually collecting some relevant documents and using these as a training set to find other documents in a large collection via machine learning algorithms is a common approach. The application of machine learning algorithms to document classification is described by Google here. Due to its computational intensiveness, it is frequently implemented as a cloud service.
Do I Need to Worry About Near-duplicate Documents?
As the cost of storage has decreased and document retrieval via search has become more powerful, the efficiency gains and space savings made by removing duplicate and near-duplicate documents have become less significant. However, the retention of multiple drafts of a document may increase legal exposure in the event of an organization being served with a discovery order, as all documents which the organization has stored must be presented to the other party. Early drafts may contain content which is prejudicial to the organization and their identification and removal may reduce legal exposure.
Legal exposure is a key driver to moving organizational storage from shared drives, on which it is very difficult to apply a document disposal policy, to document management systems (DMSs). DMSs offer many advantages over file shares including:
- Definition document check-in date. This date provides a basis for retention periods and is not subject to unplanned resetting as may occur with filesystem date metadata.
- Definition of document ownership. Like check-in date, ownership is not subject to unplanned resetting or volatility when accounts are removed.
- Easy implementation of disposal policies and the application of a ‘legal freeze’ on document changes which has to be applied after a discovery order is served.
- Version control. Different versions of a document can be accessed systematically, but users may not take advantage of this.
Despite these advantages, and the availability of free versions of most DMS products, disk drives continue to be used for organizational document storage, sometimes without official sanction. Common reasons are performance and familiarity, as the requirement for additional hardware on which to run the DMS. DMS performance is often much poorer than a file share, especially for large files, and some applications (such as Excel linked files) rely on relative paths between files, which are not present in DMSs, which often store files in a database. Even cloud storage of files in folder structures can have problems in this area due to the use of absolute path names which differ between users. Users are also generally familiar with file operations on a file share and may find the check-in/check-out and compulsory metadata entry required by DMSs onerous. Popular DMS product Microsoft SharePoint has gone to some lengths to make the working environment as similar as possible to a file share.
Near-duplication in Web Sites
As Google page rank is reduced if Google determines that a website has a high level of duplication (their definition includes near-duplication), most website maintenance and search engine optimization services include duplicate detection as part of their reports, and a few (including OnCrawl and DeepCrawl) explicitly include near-duplication in their reports.
Unbundled Near-duplicate Detection
If you are not performing legal discovery or trying to optimize your website rank, there are a few software packages which can perform near-duplicate analysis on a collection of documents.
This is a Java command-line program from SoftCorporation, with a free 3-month license. It is more of a framework than a consumer program, requiring the installation of a number of free library packages in order to function. Its output is a collection of file clusters in XML format and the documentation indicates an academic origin. Potential users would have to be highly proficient with computers in order to apply it.
This is an ambitious consumer-grade Windows product from Aleka Consulting, offering near-duplicate detection, federated search, and tagging. Unlike Neardup, it does not provide a list of all document clusters in a static collection but finds near-duplicates of a particular document, or of particular text content by interfacing with Windows search indexes, which includes Outlook email messages as well as disk content.This allows it to automatically find near-duplicates in document and Outlook email collections which are continually being updated. This capability gives it the ability to find all the different versions of a document and then order them by date to find the most recent. 4 preset levels of similarity are provided for the clustering. FindAlike also offers federated search of multiple disk drives, and tagging of emails and documents, either manually or automatically, using statistical and rule-based classifiers. An Office add-in provides this functionality within Word, Outlook, Excel, and Powerpoint for the text content of the opened document. FindAlike costs $89 per year for a single user desktop license, with a free 30-day evaluation. Workgroup licenses are also available.