File Formats and Preservation

File formats are the rock stars of digital preservation.  After all, one of the goals of digital preservation is to prevent a loss of access to files due to file format obsolescence.  If you are using a file format migration strategy for preservation, then you will be refreshing the digital files over time to keep the content stored in formats that are readable by the current technology.  If you are practicing a software emulation strategy for preservation, then you are maintaining software that will be able to read the old file formats.

When a digital object is deposited into a digital repository, the type of file that it is will be declared by its extension (.jpg, .pdf, etc.).  The type of file you are dealing with has big implications for how preservation practices can be applied to it now and in the future.  This is because being able to access the contents of a digital object depends on the ability to store, read, and edit the digital files – actions that are products of the file format’s specifications and the software that’s necessary to understand that file format.  The specification is a description of the file format that includes basic building blocks and technical byte-by-byte descriptions of the file format’s layout. Cornell’s digital preservation tutorial says a bit more about it.

Extinction

dinosaur bones
Photo by Charles Tilford, CC license

As you know, when a software program creates a file, the program can re-open the file to view it, edit it, etc.  This is because the program knows the file format’s specifications and was designed to be able to work with it.  As software programs get upgraded or disappear, the ability to read the files that it created becomes riskier.  Software upgrades happen all the time, and it is usually possible to open a file created with the previous version of a program.  But over time and numerous updates, this might not always be the case.  And it certainly won’t be possible if the software stops getting upgraded and will eventually not be capable of running on new machines.

To illustrate this point, let’s look at the old Mac program MacPaint, which was a basic painting program that shipped with Apple computers from 1984-1988.  Files created with this program were “MacPaint bitmap images,” and received the extension .mac (there were a few other extensions for this format, but let’s focus on this one).  MacPaint won’t run on modern machines, and there are certainly no programs from after 1988 that were designed to read this format.  So all .mac files became orphaned, and the only way to read them was to boot up an old machine with MacPaint on it.  (Happily, Apple released the source code of MacPaint to the Computer History Museum, meaning that with a little work these files are readable.)

Open & Proprietary Formats

But we’ve come to an interesting juncture in this discussion.  File formats can be clumped in to two categories: open and proprietary.  Open file formats are those in which the file format specifications are publicly available.  When this information is available, programs other than the one that created the file can be made to interpret the file’s format (or migrate an old file into a newer format), and we are not dependent on the original program.  This implies a more guaranteed longevity for the file in its original format.   Some open file formats that I’m sure you’ve come to love include .pdf, .jpg, and .tif.

When a file format is proprietary, the format’s specifications are not available because they are usually guarded as property of the company that created the program that creates the files.   If the .mac file format had been open, then it is far less likely that content would have ever gotten trapped in this extinct format.

With digital preservation, the rule of thumb is to move your content into file formats that are 1. open, and/or 2. popular.  When a file format is open, we can get inside its structure and know what’s going on, even if the software that a file was originally created on no longer functions.  The thinking behind going with a popular file format over one that is used less frequently, is that a way to “get inside” the format will be inevitable since so many people will have invested their content into that format.  Someone will find a way in, and hopefully share their secret.

Here is a case demonstrating the issue of open versus proprietary formats.  The University of Michigan’s University Digital Conservancy explicitly determines how much preservation action they can put into specific files based on their format:

More extensive actions will be taken to preserve usability for objects in file formats that are fully disclosed, well documented, widely adopted, and are most accessible for migration, emulation, or normalization actions. Fewer actions will be taken to preserve usability for file formats that are proprietary and/or undocumented, and those that are considered working formats (e.g., Photoshop .psd) and/or are not widely adopted.

You can view the tables outlining their levels of preservation support per file format here.
I also liked this table of recommended formats put together by the Florida Digital Archive (PDF).

File Format Resources

PRONOM is a remarkable project of the UK’s National Archives.  They have created a comprehensive directory of file formats and the programs that can understand them.  It’s truly a great resource for digital archivists because a search for a file format will yield information about its origin, its particular specification signatures, associated rights, and more.  The National Archives also developed DROID to work in conjunction with PRONOM.  DROID can automatically identify file formats in batch operations.

Growing from a partnership between PRONOM and the Global Digital Format Registry (GDFR) is the forthcoming Unified Digital Format Registry (UDFR).  The aim of this project is to create a larger, open registry to which formats can be added by community participants and is based on the PRONOM database.

If you’re looking for new fodder for your RSS feed, here is a blog that is entirely devoted to discussing file formats in the context of digital preservation.  It’s written by Gary McGath, who worked on the JHOVE and JHOVE2 projects, which validate file format claims upon repository ingest.  Here is an older post about the projects.

Advertisements

8 thoughts on “File Formats and Preservation

  1. Steve Hitchcock October 5, 2010 / 10:22 am

    This is a helpful overview of the two ends of the digital preservation workflow for file formats, that is, format identification at one end, and taking preservation actions such as format migration at the other. The question is how you connect the two in a productive and consistent way. The answer is preservation planning and a tool called Plato, developed by the European Planets project. For interested readers, there is a course to explain how this works for a digital repository, starting with the preservation workflow and covering format risk management and preservation planning. This part of the course (KeepIt course 4) comprises a number of blog entries, which include the original presentations and practical resources, and can be followed using the links and starting here http://blogs.ecs.soton.ac.uk/keepit/2010/09/21/keepit-course-4-putting-storage-format-management-and-preservation-planning-in-the-repository/

    • M. Amaral October 6, 2010 / 1:27 pm

      Hi Steve –
      Thank you so much for directing us to this resource. I’m glad the KeepIt courses are summarized like this for those of us who weren’t in attendance!

  2. Daniel Ransom October 5, 2010 / 11:06 am

    I’d really love to see the MacPaint files worthy of perpetual conservation. I always loved the “spraypaint” option making my prepubescent masterpieces on the Mac Plus. 😉

    • Euan Cochrane October 5, 2010 / 8:46 pm

      I like your post Megan and I’ts lead me to discover the rest of your blog which I’m very impressed with.

      I would suggest though that file formats never become obsolete, rather the software (needed to fulfill the instructions contained in the files formatted according to the particular formatting standard) becomes obsolete.
      This reminds me that I’m seriously concerned about the lack of tools to identify the creating applications of files. In order to do migration or emulation, digital preservation practitioners need to know the creating applications of files in order to know what they outcome of their preservation actions will be. A file that purports to be formatted according to the pdf standard but which was created using word 2007 can often be significantly different to one created using Acrobat Professional for example when rendered using the same application.

      p.s. go spraypaint brush!

      • M. Amaral October 6, 2010 / 1:05 pm

        Hi Euan –
        Thanks for making this distinction. I’m also glad that you mentioned the difference between files of the “same” format actually having big differences if when created by different programs. That was certainly something that I hadn’t thought much about, and I’ll have to investigate a bit more in order to fully understand it!

  3. Gary McGath December 6, 2010 / 1:37 pm

    Thanks for the plug for my blog! I haven’t actually worked on JHOVE2, though I’m on the advisory board. Today I came across your blog and just added a link to it on my blogroll.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s