File Formats and Preservation

File formats are the rock stars of digital preservation.  After all, one of the goals of digital preservation is to prevent a loss of access to files due to file format obsolescence.  If you are using a file format migration strategy for preservation, then you will be refreshing the digital files over time to keep the content stored in formats that are readable by the current technology.  If you are practicing a software emulation strategy for preservation, then you are maintaining software that will be able to read the old file formats.

When a digital object is deposited into a digital repository, the type of file that it is will be declared by its extension (.jpg, .pdf, etc.).  The type of file you are dealing with has big implications for how preservation practices can be applied to it now and in the future.  This is because being able to access the contents of a digital object depends on the ability to store, read, and edit the digital files – actions that are products of the file format’s specifications and the software that’s necessary to understand that file format.  The specification is a description of the file format that includes basic building blocks and technical byte-by-byte descriptions of the file format’s layout. Cornell’s digital preservation tutorial says a bit more about it.


dinosaur bones
Photo by Charles Tilford, CC license

As you know, when a software program creates a file, the program can re-open the file to view it, edit it, etc.  This is because the program knows the file format’s specifications and was designed to be able to work with it.  As software programs get upgraded or disappear, the ability to read the files that it created becomes riskier.  Software upgrades happen all the time, and it is usually possible to open a file created with the previous version of a program.  But over time and numerous updates, this might not always be the case.  And it certainly won’t be possible if the software stops getting upgraded and will eventually not be capable of running on new machines.

To illustrate this point, let’s look at the old Mac program MacPaint, which was a basic painting program that shipped with Apple computers from 1984-1988.  Files created with this program were “MacPaint bitmap images,” and received the extension .mac (there were a few other extensions for this format, but let’s focus on this one).  MacPaint won’t run on modern machines, and there are certainly no programs from after 1988 that were designed to read this format.  So all .mac files became orphaned, and the only way to read them was to boot up an old machine with MacPaint on it.  (Happily, Apple released the source code of MacPaint to the Computer History Museum, meaning that with a little work these files are readable.)

Open & Proprietary Formats

But we’ve come to an interesting juncture in this discussion.  File formats can be clumped in to two categories: open and proprietary.  Open file formats are those in which the file format specifications are publicly available.  When this information is available, programs other than the one that created the file can be made to interpret the file’s format (or migrate an old file into a newer format), and we are not dependent on the original program.  This implies a more guaranteed longevity for the file in its original format.   Some open file formats that I’m sure you’ve come to love include .pdf, .jpg, and .tif.

When a file format is proprietary, the format’s specifications are not available because they are usually guarded as property of the company that created the program that creates the files.   If the .mac file format had been open, then it is far less likely that content would have ever gotten trapped in this extinct format.

With digital preservation, the rule of thumb is to move your content into file formats that are 1. open, and/or 2. popular.  When a file format is open, we can get inside its structure and know what’s going on, even if the software that a file was originally created on no longer functions.  The thinking behind going with a popular file format over one that is used less frequently, is that a way to “get inside” the format will be inevitable since so many people will have invested their content into that format.  Someone will find a way in, and hopefully share their secret.

Here is a case demonstrating the issue of open versus proprietary formats.  The University of Michigan’s University Digital Conservancy explicitly determines how much preservation action they can put into specific files based on their format:

More extensive actions will be taken to preserve usability for objects in file formats that are fully disclosed, well documented, widely adopted, and are most accessible for migration, emulation, or normalization actions. Fewer actions will be taken to preserve usability for file formats that are proprietary and/or undocumented, and those that are considered working formats (e.g., Photoshop .psd) and/or are not widely adopted.

You can view the tables outlining their levels of preservation support per file format here.
I also liked this table of recommended formats put together by the Florida Digital Archive (PDF).

File Format Resources

PRONOM is a remarkable project of the UK’s National Archives.  They have created a comprehensive directory of file formats and the programs that can understand them.  It’s truly a great resource for digital archivists because a search for a file format will yield information about its origin, its particular specification signatures, associated rights, and more.  The National Archives also developed DROID to work in conjunction with PRONOM.  DROID can automatically identify file formats in batch operations.

Growing from a partnership between PRONOM and the Global Digital Format Registry (GDFR) is the forthcoming Unified Digital Format Registry (UDFR).  The aim of this project is to create a larger, open registry to which formats can be added by community participants and is based on the PRONOM database.

If you’re looking for new fodder for your RSS feed, here is a blog that is entirely devoted to discussing file formats in the context of digital preservation.  It’s written by Gary McGath, who worked on the JHOVE and JHOVE2 projects, which validate file format claims upon repository ingest.  Here is an older post about the projects.


When curating digital files for storage in a digital repository, being certain of an object’s file type format is very important for preservation purposes and for future accessibility.  JHOVE is an open-source, Java-based framework that will identify, validate, and characterize the formats of digital objects.  This tool can be integrated into an institution’s workflow associated with populating a digital repository.  If the repository is OAIS-compliant, the workflow integration would occur during the creation and validation phase of an information package in the digital object’s ingestion.

JHOVE’s three steps of identification, validation, and characterization will result in us knowing a great deal about a digital object’s technical properties.  We’ll know the object’s format (identification), we’ll know that it is what it says it is (validation), and we’ll know about the significant format-specific properties of the object (characterization).

JHOVE’s name comes from the partnership that spawned between JSTOR and the Harvard University Library in 2003 to create the software, and stands for JSTOR/Harvard Object Validation Environment.

The Digital Curation Centre did a case study in 2006 which provides a concise background of the JHOVE project.


The September 2008 Library of Congress Digital Preservation Newsletter reports on the development of JHOVE2.  The users and creators of JHOVE decided to address what they saw as shortcomings and improve the tool for JHOVE2.  However, the project has moved to the guidance of the California Digital Library, Portico, and Stanford University.jhove2

The most notable change to come in JHOVE2 is the shuffling around of the original  three-step process outlined by JHOVE.  For JHOVE2, the whole process is now considered to characterize a digital object by identifying, validating, and reporting the inherent properties of the object that would be significant to its preservation.  Added to this process is an assessment feature that determines the digital object’s acceptability for an institution’s repository, based on locally-defined policies.

A really exciting and valuable improvement in JHOVE2 will be the ability to address the characterization of digital objects that are comprised of more than one type of format.

One of the developers of JHOVE and JHOVE2 is Stephen Abrams of the California Digital Library, who has been designated as a Digital Preservation Pioneer by the Library of Congress.

JHOVE2 is expected to be released early 2010, but the prototype code was made available last month for viewing and comments.  Here are the FAQ from the project’s site.  The completed product, like its predecessor, will also be available under an open source license.

There will be a JHOVE2 Workshop following iPRES 2009.

Cloud Computing

Let’s talk about cloud computing.

At its simplest, things that are in the “cloud” are things that float around in a sort of digital airspace and don’t exist on your computer. They exist on remote servers which can be accessed from many computers.

Photo by mansikka under a Creative Commons license

For this reason, the cloud is a good metaphor for the Internet. For most of us, keeping things in a cloud results in a convenient and logical way to make life simple.

You can access things in the cloud from anywhere that is connected to the Internet…depending on the service and its security (private cloud or public cloud). It’s kind of like your email or Facebook account. You have lots of stuff stored in these accounts that is specific to you, but you can log in from anywhere. And it will always look the same and have all your stuff in it. Your stuff is always just…there.

Photo by AJC1 under a Creative Commons license

Getting a bit more technical, your stuff is actually physically stored somewhere as bits on servers that are run by whoever is providing the service. For example, some institutions have servers dedicated to an institutionally-based digital repository. These servers might live on the campus and will store everything that is added to the repository. But the whole repository will not exist on the specific machine that you might use to access documents stored there. Your computer will connect to the remote server to access the repository.

What makes this fun for digital preservationists is that cloud computing can really increase the scale and sharing of preservation duties. Maureen Pennock, the Web archive preservation Project Manager at the British Library, recognizes this in her blog: “This minimises costs for all concerned, addresses the skills shortage, and produces a more efficient, sustainable and reliable preservation infrastructure.”

In the future – and as we are seeing with DuraCloud – all the tech work behind producing ways to store and retrieve data may be provided as part of a single repository product. (This type of service, by the way, is referred to as IaaS – Infrastructure as Service.) This would be excellent news for a great deal of institutions that don’t have the means or skills to set up a repository themselves.

Cloud computing offers a huge potential for an off-site alternative to the out-of-the-box repository products that most institutions currently must use. Instead, external organizations will be able to do the tech work while the institutions will be able to focus on non-technical repository maintenance.

Images by Flickr users mansikka and AJC1

Original publication date: 7/15/09