What’s the Best Software for De-Duplicating Similar Photos and Text Documents? - TurboFuture - Technology
Updated date:

What’s the Best Software for De-Duplicating Similar Photos and Text Documents?

Simon has been involved in software development since the days of paper tape. He has developed niche software for information management.

Why Does it Matter for Photos?

With the advent of digital cameras and cheap, abundant storage, many people taking photographs are trigger-happy, perhaps coming back from a holiday with thousands of digital images where they once might have had a few rolls of color slide film (which commonly held 36 images). Slightly different shots of the same scene frequently exist as groups within the thousands of digital images but the post-holiday intention to pick the best shot from each group is seldom realized.

While there is no software that can pick the best photo from a group, it can identify groups of similar photos and provide a facility to delete unwanted photos or move selected ones to another location. This can greatly reduce the effort required to edit large collections of digital images down to a tractable size.

Why Does it Matter for Text Documents?

Similar, but not identical, text documents are surprisingly common, especially in storage shared by a number of users who may collaborate on creating them. In the author’s studies in large and small organizations, it was not unusual to find that 40% of all text documents were members of a group of two or more with similar or identical content. Even for single domestic users, the process of saving an Office document to PDF format can create two documents which differ in their bit patterns but have the same text content.

Collaborative authoring is very common within organizations and there are often difficulties in finding the latest version of a collaboratively authored document before it is released outside the organization. “You didn’t pick up my changes!” is a frequent accusation in this situation. Document management systems address this issue with their check in/check out facility, but they are not universally deployed, and even if they are available, users may not make use of them.

Algorithms for Detecting Similar Photos

There are many possible algorithms for detecting similarity in photos and most software does not give any detail of how it operates. However, one that does (dupeGuru) works by creating a very low resolution 15 x 15-pixel version of each input image and comparing pixel color components. The proportion of these 225 pixels which match are used to determine the similarity. The process is simple, but compute intensive and slow: matching 1300 photos took 13 minutes on a medium spec laptop. Differences in program performance on the test image pair indicate that they use different algorithms.

What About the Web?

There are now a number of image search engines (eg Google Images, Preposteo) that will find you an image similar to one which you upload or select. However, there does not appear to be any Web-based facility at present for finding and editing groups of similar photos within a large collection. This may change in the future as upload speeds increase and more computationally demanding matching methods are required. Similar.Pictures is a technically sophisticated web application for identifying groups of similar photos, and performing image search. It describes its similarity measurement algorithm in detail but lacks any capability for changing similarity thresholds or actioning groups of similar photos. Operating via a web browser, it can operate on any platform but is very slow to run on large groups of files.

Software for Finding Similar Photos

There are a large number of products available for de-duplication of various types of files, almost all dealing with exact duplication, where the duplicated files have the same bit pattern and thus the same checksum. Some also offer detection of similar images which do not have identical bit patterns and a selection of these are reviewed below. To evaluate the quality of similarity matching, the two images shown below were used as a test. To a human, they are very similar, but not to all the programs tested.

Test Images for Similarity Matching

Test Images for Similarity Matching

Software download sites such as Softpedia and CNET are good sources for specialized software but many programs (especially shareware) have not been modified for years and support in the event of problems may be non-existent. Softpedia offers independent reviews of all downloadable software.

dupeGuru

This is a free, open source product offering various methods of file comparison as well as image analysis (or picture mode). These include file name, size, and checksum, which can rapidly identify identical files. It runs on Windows, Linux, and OS X. dupeGuru has a help option (dated 2016), and an API. The threshold similarity is set from the Options menu as Filter Hardness. Example output is shown below.

dupeGuru output detail

dupeGuru output detail

A checkbox in the left-hand column for the non-reference files allows a file to be selected. Options for marked and selected files available under the Actions menu item include moving, copying deleting and many others.

There is no easy way of comparing similar images: if all images in a cluster are selected and Open with Default application is clicked, each image appears in a separate instance of the default program, making comparison difficult.

dupeGuru did not find any similarity between the two test images even at the Most Results threshold setting.

dupeGuru’s ability to find and manipulate duplicates of non-image files comes at the cost of ease of making selections from clusters of duplicate images.

Similar Image Finder

This is another free product (from Tago Software). Processing is somewhat faster than dupeGuru, taking 7.5 minutes to process 1288 images for the most accurate scanning option. It does allow comparison of similar images as shown below but does not offer any actioning options. Its clustering is very basic, with the same file appearing as a duplicate of two different originals. There is no help, and the About screen is dated 2012, so it seems probable that there has been no development for many years.

Similar Image Finder found a similarity of 74% between the two test images.

Similar Image Finder Interface

Similar Image Finder Interface

Duplicate Photo Cleaner

This product, from WebMinds, is described as shareware in some download sites, but it is better described as a commercial product with an evaluation or demo mode. The evaluation mode has most features except scanning disabled, so it is not possible to take any action without product registration, which is actually license purchase. A license costs US$49.90.

Results from a Standard Scan are shown below. The scan is fast - 18 images/sec on a local drive. The result screen in Multi-Viewer mode as shown below shows image thumbnails, allowing easy inspection of results, after clicking select All Originals. Table view mode displays images in pairs (as for other software) and Tree mode shows originals and duplicates as a tree.

Duplicate Photo Cleaner Interface

Duplicate Photo Cleaner Interface

The quality of the grouping is generally very good on unprocessed camera images, but a failure of the algorithm is evident on the two clusters highlighted in red, which have similar content but have been splintered (not grouped together). The similarity between the two test images was 34%, indicating a more restrictive algorithm than other programs. However, any automated similarity algorithm will fail sometimes when compared to a human evaluator.

Actioning options are moving or deleting either Originals (as flagged) or Duplicates. There is an undo function if required. However, the action of moving both Originals and unduplicated files to a designated folder is not available, although this can be achieved by deleting all duplicates and copying or moving the folder to the designated location.

Duplicate Photo Cleaner has a number of other very useful features: adjusting the thumbnail size allows detailed inspection of clustered images, and changing the image marked as original (all of which can be exported) is simply a matter of ticking and unticking thumbnails.

Best results were obtained by multiple passes through the data, first with a high threshold and then with a lower one.

SimilarImages

This is freeware, but the downloaded version is dated 2013. The interface is unsophisticated and would be offputting to a naïve user. There is no help file. The button to start processing is labeled “Search”. The threshold value is interpreted differently from all other programs tested – reducing the threshold reduces the number of matches found.

SimilarImages Interface

SimilarImages Interface

Processing is fast (7 images/sec), but comparison results are only displayed as a series of pairs of images, making it difficult to process clusters of more than two files.

Similar PhotoPair displayed by SimilarImages

Similar PhotoPair displayed by SimilarImages

Actioning is by deleting one of the pair of images shown. Various automated deletion rules can be applied, based on file date, size, resolution or whether the image is in the right or left pane. An automated rule can be used to remove all duplicates.

SimilarImages hung when processing the folder containing only the two test images, so no estimate of performance could be obtained

Find.Same.Images.OK

This is freeware from a very enthusiastic developer based in Germany with a large number of free products. There is no detailed help file, but the product is dated 2018, and so is probably still under development. The interface is again unsophisticated, with a profusion of displays and settings that are likely to put off a naïve user. However, scanning is fast (< 3 minutes for 1288 images), and the scan results are displayed below:

Find.Same.Image.OK Interface

Find.Same.Image.OK Interface

Results are displayed as pairs of matching files, based on a similarity threshold which can be set between 90 and 55% from the similarity dropdown above the results list. Other scanning options controlling detection of rotated, flipped or negative images can be set.

Files can be actioned by right-clicking on the selected file (or files) to move, copy or delete them.

The similarity measured between the two test images was less than 55%, which is the minimum value available.

Visual Similarity Duplicate Image Finder

This is a commercial product from MindGems. Its demo mode is that only the names of the first 10 duplicate groups are displayed and actioning of files is disabled. A license costs US$24.95. It has a help file and the product is dated 2017. The interface goes beyond showing duplicate pairs, addressing the need to view all files in a cluster before actioning, but contains much more functionality than a naïve user would wish to see. For the user willing to climb the learning curve, there are a large number of options and settings available.

After selecting the folder containing the images, and running the scan (which again takes less than 3 minutes for 1288 files), the following screen is shown.

Visual Similarity Duplicate Image Finder Showing  a Single Similar Cluster

Visual Similarity Duplicate Image Finder Showing a Single Similar Cluster

The display shows thumbnails of all the images which have been grouped together as a similar cluster if the Multi-Preview option is chosen and any file in the group is selected. In Preview mode, only the first file in the group is shown, and the file selected. The group ID is shown in the rightmost column of the display.

A failure of the similarity algorithm is evident in the image shown above, where two clusters of similar files have been merged, all with a similarity of more than 90% with the first file in the group. This problem is of the opposite of the cluster splintering which occurs in other products, but it appears to be much more common. On the test image pair, Visual Similarity Duplicate Finder detected a similarity of 78%, which is consistent with the similarity algorithm being more prone to false positives than other programs.

Actioning is performed by selecting the Autocheck & Delete/Move or Copy tab as shown below and clicking the oddly named Perform button.

Visual Similarity Duplicate Image Finder Action Tab

Visual Similarity Duplicate Image Finder Action Tab

Duplicate Cleaner Pro (ver 4.1.1)

This product from UK firm Digital Volcano includes duplicate detection for photos, audio files, and documents using either image, document or audio modes. Exact duplication can be estimated from a range of file metadata and from checksums of the binary content. Detection mode includes a variable similarity threshold for document and image scanning modes.

Identification of similar but not identical text documents is a major feature only found in a few consumer products (notably FindAlike). However, the program does not detect PDF versions of a Word document as being identical, and it does not identify Word documents saved at different times, or with small changes in the text content as being similar, even with a 10% similarity threshold. It appears that the term similar file content does not refer to the text content of documents.

The product bears some of the hallmarks of feature creep - there is very extensive functionality available, but not all of it is adequately documented, although the help and support facilities look very good, with an online forum available for problem resolution. Some experimentation is needed to use the product effectively, which may put off users without the inclination to explore and experiment with software. An example of the detail available in the Search (or rather matching) criteria is shown below.

Duplucate Cleaner Pro Search Criteria Window

Duplucate Cleaner Pro Search Criteria Window

The fixed image similarity categories of Very Close, Good and Loose correspond to similarities of 97%, 88% and 65%, but the method for estimating these is not specified. It is likely to be the same as that used by DupeGuru, where small changes in position have a dramatic effect on similarity measure as shown below.

Pairs of Images and Similarity Range

Pairs of Images and Similarity Range

All of the above pairs of images above would be rated as very similar by a human viewer but are not by the similarity algorithm.

Processing speed for image similarity is moderate - about 5 images/sec. Exact match processing is much faster. No estimate of time remaining for a scan is given after it has started.

Groups of images clustered by Duplicate Cleaner Pro are shown via a separate button, and different groups can be scrolled through, and files marked for deletion, movement or renaming. Folders with similar content can also be identified.

Actioning files within duplicate or near-duplicate clusters is well supported, with a number of options for deciding which files to action within a cluster group, and for actions to take, which include deleting, moving, copying and replacement by a link. The identification of folders with duplicated content is particularly useful. However, the sorting of files and folders by size, which is very useful in this process, does not work.

Despite these limitations, Duplicate Cleaner Pro offers a wide range of functionality at a reasonable price (List A$49, or US$35) and seems to have been rewarded by over 2 million downloads. It offers a free trial period, but with some performance limitations.

PictureEcho (v 2.0)

PictureEcho comes from Sorcim (Pvt) Ltd, a Pakistani company in Rawalpindi which offers a number of de-duplication and data management applications. PictureEcho claims to 'perform a human-like analysis of visually similar images'. Registration of the program cost US$39.97 per year but there is no indication of what facilities are made available by registration -the unregistered version may be limited in some fashion, but the limitations are not stated.

Whilst the Exact Match option detects identical images adequately, the Similar Match provides four options, three of which group images solely the basis of the differences between times of image capture. Scanning with these options is very fast. The fourth option does not include time comparisons and appears to use some form of image analysis. The scanning operation is much slower. Results are unimpressive.

Image pair found similar using PictureEcho  Image Analysis (left) and image pairs not found smilar using Image Analysis (middle, right)

Image pair found similar using PictureEcho Image Analysis (left) and image pairs not found smilar using Image Analysis (middle, right)

PictureEcho may be useful if near-duplicate status is indicated by the time difference between images, but its image analysis near-matching lacks control over the degree of similarity between images. The product is not recommended.

Summary of Similar Photo Software

Rating scale: 1 (Poor), 3 (Average), 5 (Excellent).

Note that performance on the test image pair does not necessarily reflect performance on other images, as the false positive/negative rate will depend on the nature of the images being matched.

ProductCostInterface QualitySpeedPerformance on Test ImagesNotes

dupeGuru

Free

2

1

1

No built-in viewing of matches

Similar Image Finder

Free

2

4

4

No actioning

Duplicate Photo Cleaner

US$49.90

5

5

3

Simple actioning & operation

SimilarImages

Free

1

4

1

Complex actioning, hangs on some folders

Find.Same.Images.OK

Free

1

3

<2

Idiosyncratic interface

Visual Similarity Duplicate Image Finder

US$24.95

3

4

5

Complex interface

Duplicate Cleaner Pro

US$35

4

2

3

Includes audio and document exact matching.Exploration and experimentation needed.

Overall, Duplicate Photo Cleaner would be the recommended product but you have to be prepared to pay the license fee. It tends to give false negative results, but this can be overcome by multiple passes, first with a high threshold and then with a lower one to pick up other matches. Its interface is simple and well-designed. The free products have poor interfaces and require some patience from the user. SimilarImages is probably the best, but it hangs on some folders. Duplicate Cleaner Pro includes matching for audio and exact matching for documents at an attractive price. Its interface is comprehensive but may be daunting for a naive user.

Finding Similar Text Documents

Software for detecting similar text documents is much less common than for photos. Currently, this capability is most commonly used in legal discovery, and many software packages intended for this purpose include some capacity for finding such documents. These packages are not generally available for download and test. The area is of considerable research interest as one the frontiers of Artificial Intelligence and there are many papers on methods of similarity estimation.

The task of finding the latest version of a document is straightforward if all documents are always stored in a document management system, but ‘off-system’ storage and processing frequently occur, making the latest version in the document management system not necessarily the actual latest version.

There appears to be only one similar text document detection product targeted more broadly than legal discovery and available for download and test.

FindAlike

FindAlike is a product from Aleka Consulting, an Australian company. It costs $89 for a single user license and downloads have a 30-day evaluation period. FindAlike operates by creating a document vector from the text content of documents and matching these vectors to estimate similarity and detect clusters of similar documents. Document creation and movement on local and shared filesystems are tracked using Microsoft Windows Indexing. FindAlike comprises a standalone component and an Office Add-in. When using the Office Add-in, files with text similar to the text of the currently open document are displayed, together with their Modified date, allowing easy detection of more recent versions of the open document. The standalone component allows selection of any file as the target for similarity matching. Both components support tagging (manual and automatic based on content) and search, and a suggestion for a container destination if used in conjunction with a document management system. Where similar files are attached to emails, the email sender and recipient are shown.

FindAlike features adjustable similarity tolerance and its scanning of disk storage can include local and network drives. The network drives do not necessarily have to be running a Windows operating system. It also provides indexed search over these drives (and local emails).

FindAlike Similar Files Results

FindAlike Similar Files Results

FindAlike Word Add-in Interface

FindAlike Word Add-in Interface

This article is accurate and true to the best of the author’s knowledge. Content is for informational or entertainment purposes only and does not substitute for personal counsel or professional advice in business, financial, legal, or technical matters.

Related Articles