What’s the Best Software for De-Duplicating Similar Photos and Text Documents?

Updated on September 15, 2018
Simon Kravis profile image

Simon has been involved in software development since the days of paper tape.

Why Does it Matter for Photos?

With the advent of digital cameras and cheap, abundant storage, many people taking photographs are trigger-happy, perhaps coming back from a holiday with thousands of digital images where they once might have had a few rolls of color slide film (which commonly held 36 images). Slightly different shots of the same scene frequently exist as groups within the thousands of digital images but the post-holiday intention to pick the best shot from each group is seldom realized.

While there is no software that can pick the best photo from a group, it can identify groups of similar photos and provide a facility to delete unwanted photos or move selected ones to another location. This can greatly reduce the effort required to edit large collections of digital images down to a tractable size.

Why Does it Matter for Text Documents?

Similar, but not identical, text documents are surprisingly common, especially in storage shared by a number of users who may collaborate on creating them. In the author’s studies in large and small organizations, it was not unusual to find that 40% of all text documents were members of a group of two or more with similar or identical content. Even for single domestic users, the process of saving an Office document to PDF format can create two documents which differ in their bit patterns but have the same text content.

Collaborative authoring is very common within organizations and there are often difficulties in finding the latest version of a collaboratively authored document before it is released outside the organization. “You didn’t pick up my changes!” is a frequent accusation in this situation. Document management systems address this issue with their check in/check out facility, but they are not universally deployed, and even if they are available, users may not make use of them.

Algorithms for Detecting Similar Photos

There are many possible algorithms for detecting similarity in photos and most software does not give any detail of how it operates. However, one that does (dupeGuru) works by creating a very low resolution 15 x 15-pixel version of each input image and comparing pixel color components. The proportion of these 175 pixels which match are used to determine the similarity. The process is simple, but compute intensive and slow: matching 1300 photos took 13 minutes on a medium spec laptop. Differences in program performance on the test image pair indicate that they use different algorithms.

What About the Web?

There now a number of image search engines (eg Google Images, Preposteo) that will find you an image similar to one which you upload or select. However, there does not appear to be any Web-based facility at present for finding and editing groups of similar photos within a large collection. This may change in the future as upload speeds increase and more computationally demanding matching methods are required.

Software for Finding Similar Photos

There are a large number of products available for de-duplication of various types of files, almost all dealing with exact duplication, where the duplicated files have the same bit pattern and thus the same checksum. Some also offer detection of similar images which do not have identical bit patterns and a selection of these are reviewed below. To evaluate the quality of similarity matching, the two images shown below were used as a test. To a human, they are very similar, but not to all the programs tested.

Test Images for Similarity Matching
Test Images for Similarity Matching | Source

Software download sites such as Softpedia and CNET are good sources for specialized software but many programs (especially shareware) have not been modified for years and support in the event of problems may be non-existent. Softpedia offers independent reviews of all downloadable software.

dupeGuru

This is a free, open source product offering various methods of file comparison as well as image analysis (or picture mode). These include file name, size, and checksum, which can rapidly identify identical files. It runs on Windows, Linux, and OS X. dupeGuru has a help option (dated 2016), and an API. The threshold similarity is set from the Options menu as Filter Hardness. Example output is shown below.

dupeGuru output detail
dupeGuru output detail | Source

A checkbox in the left-hand column for the non-reference files allows a file to be selected. Options for marked and selected files available under the Actions menu item include moving, copying deleting and many others.

There is no easy way of comparing similar images: if all images in a cluster are selected and Open with Default application is clicked, each image appears in a separate instance of the default program, making comparison difficult.

dupeGuru did not find any similarity between the two test images even at the Most Results threshold setting.

dupeGuru’s ability to find and manipulate duplicates of non-image files comes at the cost of ease of making selections from clusters of duplicate images.

Similar Image Finder

This is another free product (from Tago Software). Processing is somewhat faster than dupeGuru, taking 7.5 minutes to process 1288 images for the most accurate scanning option. It does allow comparison of similar images as shown below but does not offer any actioning options. Its clustering is very basic, with the same file appearing as a duplicate of two different originals. There is no help, and the About screen is dated 2012, so it seems probable that there has been no development for many years.

Similar Image Finder found a similarity of 74% between the two test images.

Similar Image Finder Interface
Similar Image Finder Interface | Source

Duplicate Photo Cleaner

This product, from WebMinds, is described as shareware in some download sites, but it is better described as a commercial product with an evaluation or demo mode. The evaluation mode has most features except scanning disabled, so it is not possible to take any action without product registration, which is actually license purchase. A license costs US$49.90.

Results from a Standard Scan are shown below. The scan is fast, taking 3.5 minutes for the same 1288 images used previously. The result screen in Multi-Viewer mode as shown below shows image thumbnails, allowing easy inspection of results, after clicking select All Originals. Table view mode displays images in pairs (as for other software) and Tree mode shows originals and duplicates as a tree.

Duplicate Photo Cleaner Interface
Duplicate Photo Cleaner Interface | Source

The quality of the grouping is generally very good on unprocessed camera images, but a failure of the algorithm is evident on the two clusters highlighted in red, which have similar content but have been splintered (not grouped together). The similarity between the two test images was 34%, indicating a more restrictive algorithm than other programs. However, any automated similarity algorithm will fail sometimes when compared to a human evaluator.

Actioning options are moving or deleting either Originals (as flagged) or Duplicates. There is an undo function if required. However, the action of moving both Originals and unduplicated files to a designated folder is not available, although this can be achieved by deleting all duplicates and copying or moving the folder to the designated location.

Duplicate Photo Cleaner has a number of other very useful features: adjusting the thumbnail size allows detailed inspection of clustered images, and changing the image marked as original (all of which can be exported) is simply a matter of ticking and unticking thumbnails.

Best results were obtained by multiple passes through the data, first with a high threshold and then with a lower one.

SimilarImages

This is freeware, but the downloaded version is dated 2013. The interface is unsophisticated and would be offputting to a naïve user. There is no help file. The button to start processing is labeled “Search”. The threshold value is interpreted differently from all other programs tested – reducing the threshold reduces the number of matches found.

SimilarImages Interface
SimilarImages Interface | Source

Processing is fast (3 minutes for the 1288 images), but comparison results are only displayed as a series of pairs of images, making it difficult to process clusters of more than two files.

Similar PhotoPair displayed by SimilarImages
Similar PhotoPair displayed by SimilarImages | Source

Actioning is by deleting one of the pair of images shown. Various automated deletion rules can be applied, based on file date, size, resolution or whether the image is in the right or left pane. An automated rule can be used to remove all duplicates.

SimilarImages hung when processing the folder containing only the two test images, so no estimate of performance could be obtained

Find.Same.Images.OK

This is freeware from a very enthusiastic developer based in Germany with a large number of free products. There is no detailed help file, but the product is dated 2018, and so is probably still under development. The interface is again unsophisticated, with a profusion of displays and settings that are likely to put off a naïve user. However, scanning is fast (< 3 minutes for 1288 images), and the scan results are displayed below:

Find.Same.Image.OK Interface
Find.Same.Image.OK Interface | Source

Results are displayed as pairs of matching files, based on a similarity threshold which can be set between 90 and 55% from the similarity dropdown above the results list. Other scanning options controlling detection of rotated, flipped or negative images can be set.

Files can be actioned by right-clicking on the selected file (or files) to move, copy or delete them.

The similarity measured between the two test images was less than 55%, which is the minimum value available.

Visual Similarity Duplicate Image Finder

This is a commercial product from MindGems. Its demo mode is that only the names of the first 10 duplicate groups are displayed and actioning of files is disabled. A license costs US$24.95. It has a help file and the product is dated 2017. The interface goes beyond showing duplicate pairs, addressing the need to view all files in a cluster before actioning, but contains much more functionality than a naïve user would wish to see. For the user willing to climb the learning curve, there are a large number of options and settings available.

After selecting the folder containing the images, and running the scan (which again takes less than 3 minutes for 1288 files), the following screen is shown.

Visual Similarity Duplicate Image Finder Showing  a Single Similar Cluster
Visual Similarity Duplicate Image Finder Showing a Single Similar Cluster | Source

The display shows thumbnails of all the images which have been grouped together as a similar cluster if the Multi-Preview option is chosen and any file in the group is selected. In Preview mode, only the first file in the group is shown, and the file selected. The group ID is shown in the rightmost column of the display.

A failure of the similarity algorithm is evident in the image shown above, where two clusters of similar files have been merged, all with a similarity of more than 90% with the first file in the group. This problem is of the opposite of the cluster splintering which occurs in other products, but it appears to be much more common. On the test image pair, Visual Similarity Duplicate Finder detected a similarity of 78%, which is consistent with the similarity algorithm being more prone to false positives than other programs.

Actioning is performed by selecting the Autocheck & Delete/Move or Copy tab as shown below and clicking the oddly named Perform button.

Visual Similarity Duplicate Image Finder Action Tab
Visual Similarity Duplicate Image Finder Action Tab | Source

Summary of Similar Photo Software

Product
Cost
Interface Quality
Speed
Performance on Test Images
Notes
dupeGuru
Free
2
1
1
No built-in viewing of matches
Similar Image Finder
Free
2
4
4
No actioning
Duplicate Photo Cleaner
US$49.90
5
5
3
Simple actioning & operation
SimilarImages
Free
1
4
1
Complex actioning, hangs on some folders
Find.Same.Images.OK
Free
1
3
<2
Idiosyncratic interface
Visual Similarity Duplicate Image Finder
US$24.95
3
4
5
Complex interface
Rating scale: 1 (Poor), 3 (Average), 5 (Excellent). Note that performance on the test image pair does not necessarily reflect performance on other images, as the false positive/negative rate will depend on the nature of the images being matched.

Overall, Duplicate Photo Cleaner would be the recommended product but you have to be prepared to pay the license fee. It tends to give false negative results, but this can be overcome by multiple passes, first with a high threshold and then with a lower one to pick up other matches. The free products have poor interfaces and require some patience from the user. SimilarImages is probably the best, but it hangs on some folders.

Finding Similar Text Documents

Software for detecting similar text documents is much less common than for photos. Currently, this capability is most commonly used in legal discovery, and many software packages intended for this purpose include some capacity for finding such documents. These packages are not generally available for download and test. The area is of considerable research interest as one the frontiers of Artificial Intelligence and there are many papers on methods of similarity estimation.

The task of finding the latest version of a document is straightforward if all documents are always stored in a document management system, but ‘off-system’ storage and processing frequently occur, making the latest version in the document management system not necessarily the actual latest version.

There appears to be only one similar text document detection product targeted more broadly than legal discovery and available for download and test.

FindAlike

This is a product from Aleka Consulting, an Australian company. It costs $89 for a single user license and downloads have a 30-day evaluation period. FindAlike operates by creating a document vector from the text content of documents and matching these vectors to estimate similarity and detect clusters of similar documents. Document creation and movement on local and shared filesystems are tracked using Microsoft Windows Indexing. FindAlike comprises a standalone component and an Office Add-in. When using the Office Add-in, files with text similar to the text of the currently open document are displayed, together with their Modified date, allowing easy detection of more recent versions of the open document. The standalone component allows selection of any file as the target for similarity matching. Both components support tagging (manual and automatic based on content) and search, and a suggestion for a container destination if used in conjunction with a document management system. Where similar files are attached to emails, the email sender and recipient are shown.

FindAlike features adjustable similarity tolerance and its scanning of disk storage can include local and network drives. The network drives do not necessarily have to be running a Windows operating system. It also provides indexed search over these drives (and local emails).

FindAlike Similar Files Results
FindAlike Similar Files Results | Source
FindAlike Word Add-in Interface
FindAlike Word Add-in Interface | Source

Comments

    0 of 8192 characters used
    Post Comment

    No comments yet.

    working

    This website uses cookies

    As a user in the EEA, your approval is needed on a few things. To provide a better website experience, turbofuture.com uses cookies (and other similar technologies) and may collect, process, and share personal data. Please choose which areas of our service you consent to our doing so.

    For more information on managing or withdrawing consents and how we handle data, visit our Privacy Policy at: https://turbofuture.com/privacy-policy#gdpr

    Show Details
    Necessary
    HubPages Device IDThis is used to identify particular browsers or devices when the access the service, and is used for security reasons.
    LoginThis is necessary to sign in to the HubPages Service.
    Google RecaptchaThis is used to prevent bots and spam. (Privacy Policy)
    AkismetThis is used to detect comment spam. (Privacy Policy)
    HubPages Google AnalyticsThis is used to provide data on traffic to our website, all personally identifyable data is anonymized. (Privacy Policy)
    HubPages Traffic PixelThis is used to collect data on traffic to articles and other pages on our site. Unless you are signed in to a HubPages account, all personally identifiable information is anonymized.
    Amazon Web ServicesThis is a cloud services platform that we used to host our service. (Privacy Policy)
    CloudflareThis is a cloud CDN service that we use to efficiently deliver files required for our service to operate such as javascript, cascading style sheets, images, and videos. (Privacy Policy)
    Google Hosted LibrariesJavascript software libraries such as jQuery are loaded at endpoints on the googleapis.com or gstatic.com domains, for performance and efficiency reasons. (Privacy Policy)
    Features
    Google Custom SearchThis is feature allows you to search the site. (Privacy Policy)
    Google MapsSome articles have Google Maps embedded in them. (Privacy Policy)
    Google ChartsThis is used to display charts and graphs on articles and the author center. (Privacy Policy)
    Google AdSense Host APIThis service allows you to sign up for or associate a Google AdSense account with HubPages, so that you can earn money from ads on your articles. No data is shared unless you engage with this feature. (Privacy Policy)
    Google YouTubeSome articles have YouTube videos embedded in them. (Privacy Policy)
    VimeoSome articles have Vimeo videos embedded in them. (Privacy Policy)
    PaypalThis is used for a registered author who enrolls in the HubPages Earnings program and requests to be paid via PayPal. No data is shared with Paypal unless you engage with this feature. (Privacy Policy)
    Facebook LoginYou can use this to streamline signing up for, or signing in to your Hubpages account. No data is shared with Facebook unless you engage with this feature. (Privacy Policy)
    MavenThis supports the Maven widget and search functionality. (Privacy Policy)
    Marketing
    Google AdSenseThis is an ad network. (Privacy Policy)
    Google DoubleClickGoogle provides ad serving technology and runs an ad network. (Privacy Policy)
    Index ExchangeThis is an ad network. (Privacy Policy)
    SovrnThis is an ad network. (Privacy Policy)
    Facebook AdsThis is an ad network. (Privacy Policy)
    Amazon Unified Ad MarketplaceThis is an ad network. (Privacy Policy)
    AppNexusThis is an ad network. (Privacy Policy)
    OpenxThis is an ad network. (Privacy Policy)
    Rubicon ProjectThis is an ad network. (Privacy Policy)
    TripleLiftThis is an ad network. (Privacy Policy)
    Say MediaWe partner with Say Media to deliver ad campaigns on our sites. (Privacy Policy)
    Remarketing PixelsWe may use remarketing pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to advertise the HubPages Service to people that have visited our sites.
    Conversion Tracking PixelsWe may use conversion tracking pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to identify when an advertisement has successfully resulted in the desired action, such as signing up for the HubPages Service or publishing an article on the HubPages Service.
    Statistics
    Author Google AnalyticsThis is used to provide traffic data and reports to the authors of articles on the HubPages Service. (Privacy Policy)
    ComscoreComScore is a media measurement and analytics company providing marketing data and analytics to enterprises, media and advertising agencies, and publishers. Non-consent will result in ComScore only processing obfuscated personal data. (Privacy Policy)
    Amazon Tracking PixelSome articles display amazon products as part of the Amazon Affiliate program, this pixel provides traffic statistics for those products (Privacy Policy)