How to Find Near-Duplicate Text Documents

Updated on December 3, 2018
Source

What is Near-Duplication?

Near-duplication of electronic documents is easy to describe, but difficult to define. Near-duplicate documents have similar, but not necessarily identical content. Document similarity is often estimated by a percentage, where 100% is identical. Whilst identical documents are easily defined and identified (via a variety of checksum algorithms which produce a string based on the byte content of the file), the definition of similarity depends on a deeper understanding of the document and its meaning. Existing methods of similarity estimation use statistical analysis of groups of words as a robust substitute for analysis of meaning. The variety of algorithms and the statistical nature of the analysis means that similarity measurements are not absolute – two documents measured as 75% similar using one algorithm or set of parameters will not have the same similarity using a different algorithm or parameter set. Hence, statements such as “40% of all text documents are similar” are highly qualified.

Why is Near-Duplication Detection Important?

Search engines are often used to try and detect all the documents related to particular issue or question, but the keyword-based approach tends to give unreasonably large numbers of results, and the ranking of these results does not always correspond to the wishes of the user.

Web search engine results frequently contain large numbers of duplicate and near-duplicate results, being able to filter these out would be a significant advantage. Google holds a number of patents for this process, and the Google page rank is decreased by the presence of duplicates and near-duplicates.

On the organization level, near-duplicate documents abound, often through multiple drafts of the same document being kept. Finding the latest version of a document may be straightforward if version control is used rigorously and universally in the document repository, but this is seldom the case.

How can Similarity be Estimated?

A reasonable estimate of similarity for images can be obtained by the resampling the images being compared to a very small number of pixels and then using the proportion of pixels which are identical, or within a specified color space tolerance(often using only a grey scale), as a similarity measure. This process is fast and is the known basis of several readily available image matching programs. Other programs are coy about the nature of their comparison algorithms, but may well use this method.

For text documents, the task is much more difficult, as the order of words is significant, as well as their meanings. Extraction of words from text documents is not a straightforward matter, although a multitude of text extraction components exist, as text extraction forms a key part of the process of building search engine indexes. It is often the case that using a different text extractor on the same document will give different results.

Assuming perfect text extraction, is the proportion of identical words in two documents a measure of similarity? Unfortunately not – two documents with the same words in different orders will appear identical by this measure. Synonyms are another complication – multiple words may describe the same thing. There are many approaches to similarity estimation but they fall into two groups. One divides the text into small, sometimes overlapping groups of sequential words called shingles, measures similarity by the proportion of identical shingles found in pairs of documents. The other constructs a vector of words which characterize the document and performs its comparison on the vectors. Both methods have a wide range of parameters and methods of comparison, some using highly sophisticated statistics. The variety of similarity algorithms and parameters mean that there is no absolute measure of text similarity.

A further problem is that every document needs to be compared with every other document in a collection, making comparisons very slow for large collections such as big websites.

Where are the Products?

Similarity estimation for text documents is the subject of many academic studies, as a search for “near-duplicate document detection” will indicate, but only a single study appears to have metamorphosed into a free-standing, unbundled product.

Whilst legal discovery is a well-known and lucrative area of demand for near-duplicate detection, issues arising from near-duplication of documents are encountered in many organizations, especially where multiple authors contribute to a single document whose drafts are exchanged via email which is to be submitted to an external agency. The location of the latest version of such a document (for example, a tender response) may not be known, resulting in the submission of a document without the most recent revisions. The consistent use of a document management system with version control used by all authors may guard against this situation, but such as system may not be implemented, or it may be used in such a way as to make it difficult to find the latest document version.

The other domain for near-duplicate document detection is website crawling. Identification of near-duplicate web pages can be very helpful in keeping large websites up to date by ensuring that modifications are applied to all pages where they are required, getting the highest possible Google Page Rank, and by reducing the volume of search results.

Near-duplicate Detection in Legal Discovery

Legal discovery is a pre-trial procedure where each party in a legal case may request the production of documents held by the other which are relevant to the case being considered. This may require the evaluation of very large numbers of electronic documents and emails for their relevance to a particular case and their export in a standardized format, a process commonly known as eDiscovery. If one document is considered relevant, other documents similar to it may also be relevant. As relevance is determined by highly paid legal and paralegal staff, any reduction in the number of documents being inspected or streamlining of the process by grouping similar documents and eliminating exact duplicates will yield substantial cost savings. The large savings, and other requirements of the legal discovery process, notably the efficient handling of emails, mean that software for this purpose is much more expensive than common consumer software, but many different packages are available.

One vendor (Casefleet) has a useful blog post on comparison criteria for eDiscovery tools. Enterprise Information Management vendor OpenText provides another. Both emphasize the importance of the availability of machine learning algorithms to answer questions like ‘Find documents like this one’, which near-duplicate detection provides, although it is not a learning algorithm. Vendor ImageMaker’s Discovery Assistant product includes a sophisticated document near-duplicate detection algorithm, but its price point and design restrict it to use as an eDiscovery tool.

Machine-learning algorithms operate using training sets. The process of manually collecting some relevant documents and using these as a training set to find other documents in a large collection via machine learning algorithms is a common approach. The application of machine learning algorithms to document classification is described by Google here. Due to its computational intensiveness, it is frequently implemented as a cloud service.

Do I Need to Worry About Near-duplicate Documents?

As the cost of storage has decreased and document retrieval via search has become more powerful, the efficiency gains and space savings made by removing duplicate and near-duplicate documents have become less significant. However, the retention of multiple drafts of a document may increase legal exposure in the event of an organization being served with a discovery order, as all documents which the organization has stored must be presented to the other party. Early drafts may contain content which is prejudicial to the organization and their identification and removal may reduce legal exposure.

Legal exposure is a key driver to moving organizational storage from shared drives, on which it is very difficult to apply a document disposal policy, to document management systems (DMSs). DMSs offer many advantages over file shares including:

  • Definition document check-in date. This date provides a basis for retention periods and is not subject to unplanned resetting as may occur with filesystem date metadata.
  • Definition of document ownership. Like check-in date, ownership is not subject to unplanned resetting or volatility when accounts are removed.
  • Easy implementation of disposal policies and the application of a ‘legal freeze’ on document changes which has to be applied after a discovery order is served.
  • Version control. Different versions of a document can be accessed systematically, but users may not take advantage of this.


Despite these advantages, and the availability of free versions of most DMS products, disk drives continue to be used for organizational document storage, sometimes without official sanction. Common reasons are performance and familiarity, as the requirement for additional hardware on which to run the DMS. DMS performance is often much poorer than a file share, especially for large files, and some applications (such as Excel linked files) rely on relative paths between files, which are not present in DMSs, which often store files in a database. Even cloud storage of files in folder structures can have problems in this area due to the use of absolute path names which differ between users. Users are also generally familiar with file operations on a file share and may find the check-in/check-out and compulsory metadata entry required by DMSs onerous. Popular DMS product Microsoft SharePoint has gone to some lengths to make the working environment as similar as possible to a file share.

Near-duplication in Web Sites

As Google page rank is reduced if Google determines that a website has a high level of duplication (their definition includes near-duplication), most website maintenance and search engine optimization services include duplicate detection as part of their reports, and a few (including OnCrawl and DeepCrawl) explicitly include near-duplication in their reports.

Unbundled Near-duplicate Detection

If you are not performing legal discovery or trying to optimize your website rank, there are a few software packages which can perform near-duplicate analysis on a collection of documents.

Neardup

This is a Java command-line program from SoftCorporation, with a free 3-month license. It is more of a framework than a consumer program, requiring the installation of a number of free library packages in order to function. Its output is a collection of file clusters in XML format and the documentation indicates an academic origin. Potential users would have to be highly proficient with computers in order to apply it.

Output Clusters from Neardup
Output Clusters from Neardup | Source

FindAlike

This is an ambitious consumer-grade Windows product from Aleka Consulting, offering near-duplicate detection, federated search, and tagging. Unlike Neardup, it does not provide a list of all document clusters in a static collection but finds near-duplicates of a particular document, or of particular text content by interfacing with Windows search indexes, which includes Outlook email messages as well as disk content.This allows it to automatically find near-duplicates in document and Outlook email collections which are continually being updated. This capability gives it the ability to find all the different versions of a document and then order them by date to find the most recent. 4 preset levels of similarity are provided for the clustering. FindAlike also offers federated search of multiple disk drives, and tagging of emails and documents, either manually or automatically, using statistical and rule-based classifiers. An Office add-in provides this functionality within Word, Outlook, Excel, and Powerpoint for the text content of the opened document. FindAlike costs $89 per year for a single user desktop license, with a free 30-day evaluation. Workgroup licenses are also available.

Example of Similar Files found by FindAlike. The blue file is the original.
Example of Similar Files found by FindAlike. The blue file is the original. | Source

Comments

    0 of 8192 characters used
    Post Comment

    No comments yet.

    working

    This website uses cookies

    As a user in the EEA, your approval is needed on a few things. To provide a better website experience, turbofuture.com uses cookies (and other similar technologies) and may collect, process, and share personal data. Please choose which areas of our service you consent to our doing so.

    For more information on managing or withdrawing consents and how we handle data, visit our Privacy Policy at: https://turbofuture.com/privacy-policy#gdpr

    Show Details
    Necessary
    HubPages Device IDThis is used to identify particular browsers or devices when the access the service, and is used for security reasons.
    LoginThis is necessary to sign in to the HubPages Service.
    Google RecaptchaThis is used to prevent bots and spam. (Privacy Policy)
    AkismetThis is used to detect comment spam. (Privacy Policy)
    HubPages Google AnalyticsThis is used to provide data on traffic to our website, all personally identifyable data is anonymized. (Privacy Policy)
    HubPages Traffic PixelThis is used to collect data on traffic to articles and other pages on our site. Unless you are signed in to a HubPages account, all personally identifiable information is anonymized.
    Amazon Web ServicesThis is a cloud services platform that we used to host our service. (Privacy Policy)
    CloudflareThis is a cloud CDN service that we use to efficiently deliver files required for our service to operate such as javascript, cascading style sheets, images, and videos. (Privacy Policy)
    Google Hosted LibrariesJavascript software libraries such as jQuery are loaded at endpoints on the googleapis.com or gstatic.com domains, for performance and efficiency reasons. (Privacy Policy)
    Features
    Google Custom SearchThis is feature allows you to search the site. (Privacy Policy)
    Google MapsSome articles have Google Maps embedded in them. (Privacy Policy)
    Google ChartsThis is used to display charts and graphs on articles and the author center. (Privacy Policy)
    Google AdSense Host APIThis service allows you to sign up for or associate a Google AdSense account with HubPages, so that you can earn money from ads on your articles. No data is shared unless you engage with this feature. (Privacy Policy)
    Google YouTubeSome articles have YouTube videos embedded in them. (Privacy Policy)
    VimeoSome articles have Vimeo videos embedded in them. (Privacy Policy)
    PaypalThis is used for a registered author who enrolls in the HubPages Earnings program and requests to be paid via PayPal. No data is shared with Paypal unless you engage with this feature. (Privacy Policy)
    Facebook LoginYou can use this to streamline signing up for, or signing in to your Hubpages account. No data is shared with Facebook unless you engage with this feature. (Privacy Policy)
    MavenThis supports the Maven widget and search functionality. (Privacy Policy)
    Marketing
    Google AdSenseThis is an ad network. (Privacy Policy)
    Google DoubleClickGoogle provides ad serving technology and runs an ad network. (Privacy Policy)
    Index ExchangeThis is an ad network. (Privacy Policy)
    SovrnThis is an ad network. (Privacy Policy)
    Facebook AdsThis is an ad network. (Privacy Policy)
    Amazon Unified Ad MarketplaceThis is an ad network. (Privacy Policy)
    AppNexusThis is an ad network. (Privacy Policy)
    OpenxThis is an ad network. (Privacy Policy)
    Rubicon ProjectThis is an ad network. (Privacy Policy)
    TripleLiftThis is an ad network. (Privacy Policy)
    Say MediaWe partner with Say Media to deliver ad campaigns on our sites. (Privacy Policy)
    Remarketing PixelsWe may use remarketing pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to advertise the HubPages Service to people that have visited our sites.
    Conversion Tracking PixelsWe may use conversion tracking pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to identify when an advertisement has successfully resulted in the desired action, such as signing up for the HubPages Service or publishing an article on the HubPages Service.
    Statistics
    Author Google AnalyticsThis is used to provide traffic data and reports to the authors of articles on the HubPages Service. (Privacy Policy)
    ComscoreComScore is a media measurement and analytics company providing marketing data and analytics to enterprises, media and advertising agencies, and publishers. Non-consent will result in ComScore only processing obfuscated personal data. (Privacy Policy)
    Amazon Tracking PixelSome articles display amazon products as part of the Amazon Affiliate program, this pixel provides traffic statistics for those products (Privacy Policy)