Is De-Cluttering Your Electronic Storage Worth It?
More storage or clean up?
In 1958, Professor C. Northcote Parkinson observed that ‘Work expands to fill the time available for its completion’. A digital derivative of this adage is that ‘Data expands to fill the storage space available’. As the marginal cost of storage moves towards zero, so does the incentive to tidy up storage. It’s far more appealing to add a few more terabytes of storage or buy a new, more capacious piece of hardware than to address the task of de-cluttering by deleting or archiving unwanted files.
So why should you bother tidying up your storage? It will
- Save you the trouble and expense of a hardware upgrade.
- Improve the performance of your hardware if its storage is almost full
- Help you locate data – it’s easier to find one photo out of 100 rather than one out of 10,000.
With search now ubiquitous, if you want it, you can find it with a few clicks - or so vendors would like you to believe. Search works excellently on text but has a long way to go on photos and videos, which constitute the bulk of domestic storage use nowadays. Time ordering, geo-location and automatic image analysis such as that provided by Google Photos, certainly help in this task and will improve over time, but their limitations become apparent when you start on an actual task. Reducing the number of the files you have to look through will always help.
Once you’ve decided to clean up, how do you do it? With their focus on apps, mobile phones give the storage use of files associated with an app and the app itself. This makes it obvious what is using most of the space and the answer is almost always photos and videos. Going through these and deleting those you don’t want to keep gives a quick and easy increase in space. If you want to archive them, first connect the phone to a computer. Photos and videos can usually be seen directly and copied to archive media before deleting them. For iPhones connecting PCs, you will need to have iTunes installed and running.
Tablets are half-way between mobile phones and PCs. They provide a view of the internal folder structure and an analysis of storage use from a built-in app. Budget tablets often have little storage space (as little as 8 Gbytes) and although they provide extra storage potential via an SD card socket, this storage is not usable for all purposes in the way that the built-in storage is. The amount of space used by apps is often significant and the easiest savings may be from removing unwanted apps. Use of cloud storage can relieve storage pressure on your tablet or phone, but you will be using your mobile data allowance every time you retrieve a file and this can end up being quite costly.
Modern PCs often come with a terabyte or more of rotating disk storage, but solid-state storage is becoming more widespread, especially for laptops where its use can result in a slimmer device. Solid-state storage provision is generally less than for rotating disks, but half a terabyte is now not uncommon, although older devices had much less. PC operating systems have grown over the years, with Windows 10 64 bit now requiring 20 GBytes, but storage provision has increased much more, so if you’re short of space on a modern PC, the major culprits will be applications and media files. The Windows Disk Cleanup utility does a good job. Windows upgrades often leave large files behind, which are identified by Disk Cleanup as possible candidates for removal.
Right-clicking on the OS (C:) icon in This PC shows the amount of space free on the C: drive:
If Disk Cleanup doesn’t offer much in the way of savings, then Windows Search can locate large files using the Size: gigantic option as shown below.
The results of a search with a size filter are not ordered. if you want to see files ordered by descending size, the entire search will be run again, which may take some time. Large files tend to take up most of the space in electronic storage, so identifying these can be very helpful. The largest files are frequently C:\pagefile.sys and C:\ hiberfil.sys, which are Windows system files which grow with each operation you perform. They are rebuilt each time the system restarts, so simply restarting your computer may free up many gigabytes of space.
If you see a large file that seems to be a candidate for deletion, type its name into Google before you remove it to see what its purpose is. Removing it may have unintended consequences.
Windows offers the option to compress large files to save storage space – folders with their contents compressed are shown in blue. Compression may result in large space savings for log files and files with a high proportion of repeated content, but savings will be much smaller for media files, which are often in compressed format anyway.
If you have folders containing large numbers of non-gigantic files which consume significant space, these folders can be difficult to identify with native Windows functionality. However, there are a number of applications which can help you find these. TreeSize Free gives a very rapid overview of the amount of space used by folders:
Clicking on any of the folders shows the space usage of subfolders. This application can rapidly home in just what is consuming your storage space. The Professional version (costing 46.95 Euros) includes actioning, age profiles and many other features.
WinDirStat is a free open-source application giving similar data, but with the addition of visualization of file and folder size:
Networked PCs and Shared Storage
Most workplaces now have networked PCs, usually with some storage shared between all users, often as a group drive, or drives, which are accessible to all users or a group of users, and Home drives, accessible only to individual users. Storage quotas may be applied to restrict the amount of space available for group and home drives. A common scenario is that only shared storage is backed up. Storage on local machines may or may not be accessible to individual users. Making the only accessible storage on a shared drive can ensure that all documents created and stored are backed up. The role of individual PCs then becomes similar to that of the ‘dumb terminals’ which were widely used before the advent of the PC and which had no local storage.
File ownership may be problematic on shared storage. The tie between file ownership and Windows accounts on file servers means that when accounts are removed as staff leave, ownership of large numbers of files may become indeterminate.
Cloud-based systems have the advantage that capacity can be easily increased, and content is available via the Web. Both of these features are attractive to organizations but come at a significant cost, both for software licensing and data movement.
Wherever shared storage exists, its management is the responsibility of IT staff rather than individual users. Management is often by exhortation, often on the lines of “The G: drive is 98% full. Can users please remove any unneeded files”. Such exhortations may result in massive amounts of time being wasted by users as they examine small files whose removal will only minimally reduce storage demand. Many users have no idea of file size, further complicating management. Storage quotas may curb the profligate use of shared storage by the handful of users with large holdings, but most users have very small shared storage usage, so a quota policy makes poor use of available capacity and may result in users storing important documents being stored outside the backup umbrella on local drives or removable storage.
Another approach to shared storage management is to remove all files, or files which are commonly large (such as Microsoft Access databases, or media files) at a specified date and restoring only those that users request to be restored. This process is certainly effective but may cause considerable disruption and reveal problems with the backup/restore process.
Given the difficulties of managing shared storage, the path of expanding capacity is usually taken. The only pressure to clean up usually comes from legal departments of organizations who are concerned about liability – if an organization becomes involved in legal action they may be required to disclose all the relevant documents in their possession, whether or not they were required to retain them. The default policy of ‘keep everything forever’ can lead to increased legal exposure and is one of the drivers for the introduction of document management systems, where disposal of documents which no longer need to be retained is straightforward.
Document Management Systems
Document management systems may also be used to store files and may run into storage limitations. These systems are often cloud-based, sometimes completely replacing the user desktop so that all documents and communications are stored automatically. Document ownership and permissions can be managed more effectively in most document management systems, but performance may be poor for large media files, which are increasingly common. Migration from a shared drive to a document management system may be problematic due to difficulty in mapping permissions. File and folder naming rules also may be much more restricted in a document management system. Poor performance may result in an increase in ‘off-system’ processing which may negate the advantages of the document management systems.
However, one virtue of document management systems is that they record the check-out dates of files but users, making it possible to record file usage. The first check-in date can be used to set the start of a retention period (or sentence) for the files, making it possible to implement a policy to remove documents in a particular category after a set period of time. This automatic removal process means that storage problems are less likely to arise, as files whose retention period has expired can be detected easily and removed. However, storage for document management systems is commonly in a database, which requires much higher performance than a shared disk drive and capacity addition is likely to be much more expensive.
Duplication and De-Duplication
A common concern of many computer users is file duplication or the retention of multiple copies of the same electronic document. The most exact definition of duplication is that duplicate files contain the same pattern of bits. Duplication can be simply established by calculating the checksum of the file binary content and comparing the checksums of two file. Duplicated files will have the same checksum.
Given that most files stored are small, high levels of exact duplication seldom affect storage use. The author’s experience in profiling shared storage in commercial, government and not-for-profit organizations would rarely see storage reductions of more than 15% from the removal of all duplicate files. As the numbers of duplicate clusters of files are extremely large, deciding which files in a cluster to keep and which to delete is a highly laborious task yielding only a small increase in available storage.
A complication of de-duplication is that humans may perceive as duplicates electronic documents which do not have the same bit pattern. Two photos taken from a hand-held camera of the same scene will have different pixels due to the camera position being slightly different for each, and thus will not have the same bit pattern. The same Word document saved by two different people will have different bit patterns, as Word stores metadata about the user and time of saving inside the file. A PDF document containing electronic text will not have the same bit pattern as a scan of the same document, and two different scans of the same document will not have the same bit pattern due to different placement of the original on the scanner.
DupeGuru is a sophisticated free program for identifying duplicate files, and folder. It uses file names and checksums for comparisons. It does not flag as duplicates files with the same text content but different bit patterns. DupeGuru can detect folders with duplicated content, which can easily be created and are the “low-hanging fruit” of de-duplication. Picture mode addresses the problem of detecting visually identical photos with different bit patterns by creating a very low-resolution version of the photo and comparing pixels to give a percentage match.
Files which are no longer being accessed are a much greater problem than duplication. From the author’s experience over many types of shared storage and domestic computers, 50% of files are likely to have a Modified date more than 3 years before the current date. In one instance, 50% of files had a Modified date of more than 8 years before the date of scanning. An example of file date and count profiles is shown below:
Unfortunately, the Modified date of a file does not indicate the date at which it was last accessed by a user, only the date at which it was last changed. Examples of files which are in frequent use but not changed include office floor plans, document templates, and logos. These may have old Modified dates but are frequently used as read-only files. Any action to increase storage space by removing files with old Modified dates runs the risk of removing such files unless its application is restricted to folders where these files are unlikely to be found. However, the volume savings are substantial: removing files with a Modified date before May 2014 in the example shown above would save 50% of storage volume.
If local email archive file storage is used, the opposite problem may occur with email archive files, that can be very large and whose Modified date is updated to the current date each time the Mail application checks for new mail.
Files do store a Last Accessed date, but this access date may be set from a backup program, virus scanner or even the operating system as well as the parent application. This makes it of little use for storage management.
If you want to do more than guess an age threshold for an archive-by-modified date policy, the FolderSizes program provides comprehensive filesystem analysis, including a file age histogram as shown below:
FolderSizes also provides a visualization of folder size similar to WinDirStat and many other useful displays for storage analysis and management. FolderSizes costs US$60 for a single user license, with discounts for multiple users. It has a 15-day free evaluation period.
To De-clutter or not to De-clutter?
If you are a domestic PC user seeing the message “There is not enough disk space to complete this operation” then de-cluttering is probably the best way to go. On Windows PCs, Disk Cleanup should be your first step, followed by some the actions described in this article if required. With terabyte removable disk drives readily available at very modest cost, deleting some files or moving them onto removable storage is the way to go if you’re unable to create enough space.
If you manage a collection of networked PCs in an organization with shared storage running out, any de-cluttering action needs to be carefully thought out to minimize disruption to users and avoid discouraging them from using backed-up storage. If a disruption occurs, the cost of this can exceed the cost of expanding storage capacity. If a security breach occurs from increased use of off-system storage or processing the consequences may be serious.