Digital Preservation for Future Genealogists
Digital storage of anything is remarkably cheap and efficient: the price paid for this is the separation of the digital storage medium from the delivery mechanism, which is an application running on a separate voice. Sound recording has lived with this since its inception, but the simplicity of the encoding (wobbles in a spiral groove, or magnetization on a wire or tape) meant that machines to decode the recording were fairly simple, even if the fidelity was poor. Anything that's written or printed on paper needs nothing other than light to make it accessible but digital storage of data requires a means to access the bits comprising the data and then an application running on a computing device and a visual or audio display to make the stored data accessible
Advice from Archiving Institutions
Cultural institutions including archiving in their briefs sometimes include advice to individuals on digital preservation, usually concerning file naming conventions and the necessity to keep multiple copies in different locations. Examples of such advice can be found the USA's Library of Congress, the University of Michigan, and Tufts University. None of the advice appears to address the contemporary issue of image display devices not providing access to file names or the difficulty of maintaining access to cloud storage.
The rapid evolution of storage media and applications makes it difficult to preserve anything digital for more than a decade or so. Floppy discs, Zip drives, and DAT tapes were once commonplace, but now retrieving any data on them is a specialized and costly activity. CDs and DVDs will probably go the same way - whilst Blu-Ray disks are backward-compatible with DVDs and CDs, there's no guarantee that future storage media will be.
The most 'future-proof' storage medium is probably the one with least electronics. Optical storage on a DVD disc meets that criteria. However, there are still choices to make - single layer or double layer, archival or standard discs. These are discussed in more detail later.
The best archival storage format is simply the most widely used which meets any technical requirements, as any migration necessary will be most readily available for these formats. For images, this will mean JPEG format, for videos, MP4, and sound recordings MP3. Microsoft Word and Adobe PDF are the two commonest document formats. From an archival perspective, Word is superior, but PDF is a fact of life and conversion from it to other formats relies on performing Optical Character Recognition (OCR) on each page, which can give poor results.
Word processing was one of the first computer applications. Once there was a slew of them– Wikipedia lists 63 in its historical section. Each had its own vociferous adherents. Now the choice is often Microsoft Word or Microsoft Word. To its credit, Word is able to read a wide variety of document formats, even though the capability was probably part of its plan to gain market share.
Word has also subsumed many of the capabilities of its rivals, to the extent that many of the features requested for Word are actually present already - the problem is that users can't find them. Although Word's capacity to read legacy formats was reduced with the introduction of Office 2007, it can still read Word Perfect files. Fortunately, older formats are generally unsophisticated and the task of extracting text, if not formatting, from obsolete word processing documents is not difficult.
For more applications with a smaller user base, such as video editing, backward compatibility may be considerably worse, with users having to maintain an elderly computer for the sole purpose of running a particular version of an application.
Media File Formats
The massive shakeout of storage formats and applications that has happened for word processing has not extended to media files (images, videos and audio recordings). Hundreds of formats are available but a few have become common. For images, JPEG files tend to be the standard for non-professional digital cameras, including those in mobile devices. This format includes a variable level of compression, which makes it attractive for designers.
Image viewing is now an essential part of all operating systems and applications for this purpose are built into all modern devices. All of them can read JPEG files. As well as image formatting, consideration should be given to adding information about the image - who, when and where. This can be added as file metadata, or more robustly, added to the image pixels as a caption - see this TurboFuture article on the topic. The web page All About Digital Photos - Genealogy deals with the issues of long-term preservation.
For videos, compressed storage is essential, especially as video resolution continues to increase. The compression factor is thus much more important than for images and constant evolution of algorithms for this mean that storage formats are constantly changing, with manufacturers of the same brand of video camera even changing their storage format between models. Devices capable of recording videos, such as mobile phones, can display videos they have recorded, but general purpose video display applications such as those built into Windows 10, will fail to decode some of the formats. The MPEG -4 or MP4 format is possibly the most widely used format, readable on nearly all systems. AVI is another popular format.
For sound recordings, the MP3 format has assumed a leading position amongst a host of options, some using compression and some not. The MP3 format is compressed, and for many years audiophiles regarded MP3 recordings as inferior as compression artifacts appear at low bit rates. However as bit rates have increased, artifacts have diminished to the point where the objections to the format are no longer valid. For home recordings of speech, MP3 recordings using a bit rate of 192 kbps provides more than adequate quality.
The best format for documents is a more complex question. The commonest formats are currently Adobe PDF (Portable Document Format), and Microsoft Word.
The PDF format touts itself as facilitating the presentation and exchange of documents reliably, independent of software, hardware, or operating system. It offers standards for engineering, printing and archival use, and has recently added digital signing to its capability and it can contain embedded images. The archival option excludes some features which may be difficult to maintain in future. Backed by a large company, the Adobe PDF Reader is widely available and free, and PDF files have become extremely common on Web sites and within organizations.
Despite being based on an open standard, unreadable PDF files are not uncommon, as a Google search for this phrase will indicate. Organizers of conferences receiving hundreds of PDF file will attest to this. There are many programs, which can write, edit and annotate PDF files and some of them may produce unreadable output files. Microsoft Word has finally acknowledged the prevalence of PDF and offers it as a native save format option. For unreadable PDF files, there is very little that can be done without specialist help.
It would be highly desirable to ensure that PDF files are readable before adding them to a consignment for the distant future, but at present, there does not seem to be any application for checking the accessibility of large numbers of PDF files as a batch.
Microsoft Word internal format went from a proprietary (although well known) binary format to Office Open XML format in 2007. This format used by all Microsoft Office authoring application documents with 4 letter extensions (eg .docx, .docm, .xlsx) is actually a Zip file containing a number of XML files describing the document. If you have a Word or Office authoring application document that you are unable to open or which renders incorrectly, the elements of it are far more accessible than they are in a PDF file and for this reason, it is preferable as an archive format. There is little advantage to be gained in saving documents created in Word to a PDF/A (Archive) format.
What is Data Compression?
Data compression is a means of using fewer bits to represent digital data. Lossless compression means that the original data can be recovered with perfect fidelity. Lossy compression means that the original data cannot be recovered, but if the quality of the compression algorithm is good enough and the degree of compression is small enough, the loss of data is not perceptible or minimally perceptible.
Should I Scan Documents?
Scanning of documents and images for long-term preservation has many advantages. For color prints, prolonged exposure to light removes the red component of the color, resulting in the green-blue appearance of many color prints displayed in homes and offices. Scanning the original allows the recreation of the original print at modest cost when fading becomes apparent The colors may not be exactly the same as in the print as scanned, but the result will be much better than a faded original. There is even some scope for restoring missing red elements in the image.
Scanned images can also be shared easily and it may be that recipients' copies are still available when the images on the sender's computer are lost.
Scanners usually offer PDF format as an output option, as the format can accommodate images and a multi-page format. These PDF files have little or no electronic text and there is less likelihood of the files being unreadable than PDFs generated by other means.
However, scanned documents store the pages as images rather than electronic text. Optical Character Recognition (OCR) is required to generate searchable text. Scanners may have this facility built-in and OCR is available from a number of applications running on desktops or as Web applications. The quality of OCR has improved greatly in recent years but high-quality source documents are generally required. If you want to ensure that your scanned PDF files are searchable, you'll need to run them through an OCR program.
Whether a PDF file contains searchable text or not can be determined by opening the PDF file with Acrobat Reader and clicking on a page. If all of it is highlighted, and the cursor changes to an arrow, the text exists only as an image and is not searchable as shown below:
If the cursor changes to the text tool as shown below, and highlighting only occurs when the cursor is moved, then the document contains electronic text and can be searched.
Storage Media Options for Digital Data Archiving
In the pre-digital era, a shoebox or scrapbook provided a natural way of keeping family photos – additional information could be written on the back of photos or below them on a scrapbook page. Most families have a box of old photos, often dating from the early 20th century and the photos in it are often very well preserved, thanks to the excellent archival qualities of the paper used in that era and the gelatin silver process used for black and white printing. These prints may in better condition than color prints from 20 years ago, which frequently stick together in the envelopes they were received in.
Paper prints can have issues from insect attack, fungi, poor archival qualities of paper and glues, and the fading of color images. These can result in the degradation or even destruction of photos but generally, these are less serious than the total inaccessibility which can afflict stored digital data.
For today’s digital images, there are choices to be made for storage format and storage media. Whatever is chosen now is unlikely to be current in 20 years’ time – the challenge is to make it easy to migrate to whatever is current in the future. This approach is taken by government archival institutions, whose brief is usually to make records of Government decisions available in perpetuity. Their approach to preservation is to store at least two copies of each electronic document on a storage system isolated from the Internet in a secure environment. Copies are generally stored in different physical locations. At least one copy is left in its original state and another is updated to whatever format is current. The storage system is updated as required. These institutions have far more resources available than domestic users but the challenges facing both are similar.
How can I Change the Format of Multiple Digital Documents?
If you want to change the format of any digital document you want to keep to one with better archival properties, there are a number of tools that may be helpful. For images, the popular free image editing program IrfanView has batch rename facilities. For videos, there are a number of batch conversion programs available both as desktop and web applications. A 2018 review can be found at https://www.techradar.com/au/news/the-best-free-video-converter. A similar review of batch audio conversion tools can found at https://www.lifewire.com/free-audio-converter-software-programs-2622863.
Options for Storage Media
Despite their likely future obsolescence as media with higher data capacities and transfer rates become available, DVD disks, which encode data optically are not subject to the kinds of failures which can occur with rotating magnetic disk drives and as removable media, drivers are only needed for the devices which read them. DVD-R and DVD+R disks can only be written (or burned once). For the discs, the laser burning process changes the opacity of a dye layer above a reflective metal layer in a small pit to encode bit values.
For rewritable disks (DVD-RW, DRV+RW) the burning process changes the phase of metal alloy layer and data can be erased after being written. The archival qualities of all types of DVD are described in detail at http://www.cd-info.com/archiving/longevity/index.html. Dual Layer disks (DVD-R DL) have two recordable layers within each disk and offer storage capacities of 8.5 GBytes instead of the 4.7 GBytes for single layer disks. This additional capacity may be useful, but the blank discs are more expensive and problems with recording and playback are more common.
For recorded single-layer DVD-Rs a lifetime of up to 30 years is predicted, but with considerable variation between brands. Rewritable DVDs are predicted to have a shorter lifetime, and repeated rewriting can also diminish their performance. This and the additional cost, make of write-once rather than rewritable DVD disks recommended for archival use. Whether the additional cost of using gold rather an aluminum as the reflective layer is justified is not clear as technological obsolescence is likely to affect them before physical degradation. Use of a reputable brand (such as Verbatim) and avoidance of 'No Name' brands is also recommended. To minimize degradation, disks should be kept out of ultra-violet light, handled carefully and not rewritten excessively. High humidity may also cause damage.
To allow early detection of any problems with DVD writing, any disks created should be read after creation.
External USB drives (using magnetic disks) offer multi-terabyte storage capacity at very low cost but these devices require driver software which needs to be compatible with the operating system of the computer they are attached to. Lack of driver software is a common reason for technological obsolescence of the devices. Mechanical and electronic failures can also give rise to the dreaded “USB device not recognized” message.
Solid state USB drives (also called flash drives or thumb drives) are fast approaching magnetic disk drives in capacity and falling rapidly in price, but they are intended only for data transfer and are unsuitable for long-term storage. Solid state drives installed internally as replacements for magnetic disk drives are generally of high quality and have a similar lifetime. Lower grade solid state storage devices are limited to 3-5000 read-write cycles and may suffer mechanical failure, as well as the failure of components other than the memory itself. Most households have at least one solid state USB drive which is unreadable by the operating system.
Cloud storage is widely advertised as a convenient solution for backup and it can be used for archival storage. Cloud storage is commonly on magnetic disks in a remote data center (possibly in another country), where maintenance and updates are performed by skilled personnel. Many cloud providers (such as DropBox and OneDrive) offer a free storage quota of a number of gigabytes, with charges applying if the quota is exceeded. This may be adequate for photos but modern high-resolution video files can be very large.Other cloud providers only offer paid storage, but usually at a lower cost per gigabyte. Data transfer to and from the cloud may be a problem for large volumes if your monthly data quota is exceeded - upload and download speed may be drastically reduced or additional charges incurred.
One risk of cloud storage is the cloud provider going out of business or being taken over. Takeovers may result in increased charges, reduced quotas or even the loss of stored data. However, quotas may also be increased after a takeover, or charges reduced in the face of competition. Outages may result in access to archived data being delayed. There is some risk of the provider being taken offline for legal reasons - this happened in spectacular fashion to the MegaUpload service in 2012 for copyright violation, but there has been no comparable event since then.
Over the long term, loss of access credentials may result in difficulty in accessing cloud storage. It can be difficult to keep access credentials over long periods of time and password complexity requirements may increase.
A more serious problem is the cessation of payments for paid cloud storage. Most cloud storage providers will terminate access if regular payments are not made, and eventually delete stored data if the arrears are large enough. As anyone who has dealt with administering a deceased estate will know, it can be very difficult to establish the ongoing financial commitments of the deceased and the credentials for accessing cloud storage, especially if the death was unexpected. If regular payments are made via a credit card, and the credit card and email accounts are closed, email reminders about overdue fees will not be acted on. This may result in the deletion of archive data stored in the cloud.
While the majority of email messages are ephemeral or of little interest to future generations, email may be included in the digital data which you wish to preserve.
If you use a local email client such as Microsoft Outlook, messages may be kept on the mail server or downloaded to a local archive file when mail is received or sent. The mail server capacity is usually limited, so downloading messages ensures that the mail server capacity is never exceeded. If you leave emails on the server, you will need to download them for archiving. Local archive files may be very large and may be updated each time a mail server is checked for new mail, making backup difficult, but they can be archived in the same way as photos and videos. Large email archive files are prone to corruption, and Microsoft supply a utility (scanpst.exe) for detecting and correcting errors in PST email archive files used by the Outlook email client. Other email clients may store individual messages as individual files.
If a web email service such as Gmail is used and accessed via web browser, the messages themselves are stored on a remote server as cloud storage, not on your local machine. This arrangement is highly convenient but again relies on your credentials for access. Changes in email addresses due to takeovers and mergers may result in loss of access to emails even if the credentials are valid.
The only insurance against loss of web emails is to regularly download your email archive from your web email provider and treat the downloaded file in the same way as your photos and videos. Gmail advises users to back up their messages regularly from their servers and provide instructions for doing this. The downloaded archive file will require a local email client to read messages.
All kinds of activities take place to try and earn money on the Internet - mostly legal and mostly selling goods and services. Services to facilitate selling (like PayPal) go to great lengths to inspire confidence in a transaction with someone or an organization quite unfamiliar to a buyer, quite possibly in another country. Illegal methods of obtaining money have also arisen - these can include fake invoices, fictitious blackmail and attempts to steal credentials by requesting you to go to a fake website (phishing). All of these require some level of gullibility on the part of the user in order to succeed.
Other malware does not require any action by a user - malicious programs to log all keystrokes (such as those used for Internet banking or financial services) and transmit them to a cyber-criminal may be inadvertently loaded by visiting a compromised or malicious web site or opening an email attachment. This is the reason that operating systems such as Windows make installation of any program so involved.
If malware succeeds in evading the detection systems now built into Windows and any other antivirus (or security) programs that you use, one of the most insidious threats is denial of access to your data unless a ransom is paid. This form of malware is known as ransomware and the combination of powerful encryption built into modern computers and the untraceable financial transactions provided by cryptocurrencies such as Bitcoin have made it a very attractive proposition for cyber-criminals. Ransomware usually operates by encrypting data files, such as digital photos, and then charging a ransom to download an application to decrypt them. Most people with large collections of encrypted family photos and documents will simply pay up to regain access.
To minimise the chance of this happening, there are a few simple guidelines to follow:
- Always keep your operating system updated. Many updates are plugging security holes which malware can exploit. It may be tedious to wait while updates run, but recent versions of Windows contain anti-malware software which tries to identify and quarantine malicious software by distinctive bit patterns present in malware. As new malware appears, the signatures are shared amongst the providers of software for detecting malware. This is one of the reasons why updates are so frequent.
- If your operating system does not have anti-malware built in, use a 3rd party application for this purpose.
- Be very suspicious of opening email attachments from an unknown source.
- Back up frequently and keep your backup media disconnected from your computer. Ransomware will encrypt your backups if they are accessible from your computer. Your backup of un-encrypted data is about the only thing which will save you from ransomware. Sometimes the encryption can be broken by a security company wizard, which will then release a decryption application, but this does not always happen.
What Storage Media Should I use for Digital Data?
For modest storage volumes that can be accommodated on a manageable number of DVD discs, this probably represents this best local storage option for domestic users, as long as DVD disc reading and writing is readily available. The custodians of the discs need to be aware of their potential obsolescence and be prepared to copy data onto a different storage medium if necessary. DVD disks will generally survive immersion in dirty water in case of flooding, as the data is stored in reflective pits inside the polycarbonate body of the disk. However, they will be destroyed by fire, unless they are kept in a fireproof safe. High levels of humidity and dust may also affect readability.
If you have terabytes of data, then a removable USB drive with a magnetic disk may be the best storage option. Use of a popular brand and type makes it likely that drivers will be available in future operating systems. Only have the drive connected to the computer while copying data, as permanent connection means that ransomware on the host computer may be able to encrypt your archive data. Keeping the drive unpowered when not in use will minimize the risk of mechanical and electronic failure. How long USB disks last when kept mostly unpowered is an occasionally asked question in technical forums. One response to such a question in SuperUser observes that the magnetization in the disk platters decays over time with a half-life of about 70 years, which will render the disk unreadable and suggests rewriting the data every few years, but another response disagrees with this. Other responses give personal experiences with long-term storage issues.
Removable USB drives using magnetic disks will not work after immersion in water, as they have internal moving parts and circuitry, but data on them is generally recoverable by specialist service providers. Like DVDs, removable drives will be destroyed by fire unless they are kept in a fireproof safe. Data on fire-damaged USB drives may be recoverable, depending on the degree of damage.
Cloud storage provides geographic diversity and insurance against physical destruction of storage media, as can happen through fire or flood. Its use, in conjunction with storage on physical media probably represents the best long-term solution for domestic users, but careful storage of access credentials is required, together with robust payment arrangements if paid cloud storage is used. Google provides for download of stored data via a trusted email address after a period of account inactivity using its Inactive Account Manager. Physical objects, such as storage media, are much more easily kept over a period of decades than intangible items such as credentials.
Storing any data on a cloud platform does mean that you lose absolute control of it: your data can be accessed by cloud storage staff and potentially by Government agencies or hackers. Whilst this scenario is not particularly threatening for family photos, there may be circumstances in which you don't want anyone you haven't authorized to see your data. In this situation, avoid using cloud storage.
Help! I Can't Read From my Storage Device.
DVD reading and writing is usually the first functionality to be lost from laptops due to the mechanical demands of tracking the very narrow data strips on optical storage media by the DVD drive, and the restricted space for the drive available in a laptop. If DVDs are unable to be read, this may be the cause rather than DVD degradation. Try reading on another machine or purchase an external DVD drive which can be connected via a USB port.
For other problems, type your question into a Web search engine and you may find a way of dealing with your problem. If you don’t find one or don’t feel able or willing to do what it suggests, search for “data recovery”. Companies specializing in this area may be able to help, but their services aren’t cheap. If your data is on a medium no longer widely supported (such as a floppy disc or Zip drive at present) you will probably be able to find a company which will copy data onto a device or medium that you can read from.
This article is accurate and true to the best of the author’s knowledge. Content is for informational or entertainment purposes only and does not substitute for personal counsel or professional advice in business, financial, legal, or technical matters.