Thank you for visiting my collection of digital preservation concepts!
As you have probably guessed, I am no longer updating this blog, but I sincerely hope that it continues to serve as a resource for the basic concepts and processes that make up this (no-longer) nascent field. I’d imagine that some of these writings are no longer up-to-date, as is the inevitable nature of anything that tries to capture a digitally-focused moment in time.
The constant changes in our digital environments are exciting and call for consistent adaptation. Since leaving grad school, I have moved on to roles in the tech industry, and have learned this: Despite the institution or industry, I think that the challenge of ‘keeping up’ is exhilarating, pervasive, and always involves a gamble. In particular, digital preservation is not for the faint of heart! I applaud the passion and competency with which this field has grown to approach the preservation of valuable digital resources. Please feel free to reach out for anything.
Really briefly, I wanted to direct you to the new page I added to this blog, Digitization Specs. Currently there is a compilation of various specs for digitizing photographic prints and negatives.
Just for fun (and reference) I compiled the photograph digitization specifications that I could locate from the websites and publications of various libraries and cultural heritage institutions. I thought it would be neat to see them all side by side, and as far as I know, there isn’t such a resource yet. I’ve included specifications for color and black & white photos, in print and film form.
I also intend to eventually compile specifications for digitizing text/print-based documents, but that’s for another day!
File formats are the rock stars of digital preservation. After all, one of the goals of digital preservation is to prevent a loss of access to files due to file format obsolescence. If you are using a file format migration strategy for preservation, then you will be refreshing the digital files over time to keep the content stored in formats that are readable by the current technology. If you are practicing a software emulation strategy for preservation, then you are maintaining software that will be able to read the old file formats.
When a digital object is deposited into a digital repository, the type of file that it is will be declared by its extension (.jpg, .pdf, etc.). The type of file you are dealing with has big implications for how preservation practices can be applied to it now and in the future. This is because being able to access the contents of a digital object depends on the ability to store, read, and edit the digital files – actions that are products of the file format’s specifications and the software that’s necessary to understand that file format. The specification is a description of the file format that includes basic building blocks and technical byte-by-byte descriptions of the file format’s layout. Cornell’s digital preservation tutorial says a bit more about it.
As you know, when a software program creates a file, the program can re-open the file to view it, edit it, etc. This is because the program knows the file format’s specifications and was designed to be able to work with it. As software programs get upgraded or disappear, the ability to read the files that it created becomes riskier. Software upgrades happen all the time, and it is usually possible to open a file created with the previous version of a program. But over time and numerous updates, this might not always be the case. And it certainly won’t be possible if the software stops getting upgraded and will eventually not be capable of running on new machines.
To illustrate this point, let’s look at the old Mac program MacPaint, which was a basic painting program that shipped with Apple computers from 1984-1988. Files created with this program were “MacPaint bitmap images,” and received the extension .mac (there were a few other extensions for this format, but let’s focus on this one). MacPaint won’t run on modern machines, and there are certainly no programs from after 1988 that were designed to read this format. So all .mac files became orphaned, and the only way to read them was to boot up an old machine with MacPaint on it. (Happily, Apple released the source code of MacPaint to the Computer History Museum, meaning that with a little work these files are readable.)
Open & Proprietary Formats
But we’ve come to an interesting juncture in this discussion. File formats can be clumped in to two categories: open and proprietary. Open file formats are those in which the file format specifications are publicly available. When this information is available, programs other than the one that created the file can be made to interpret the file’s format (or migrate an old file into a newer format), and we are not dependent on the original program. This implies a more guaranteed longevity for the file in its original format. Some open file formats that I’m sure you’ve come to love include .pdf, .jpg, and .tif.
When a file format is proprietary, the format’s specifications are not available because they are usually guarded as property of the company that created the program that creates the files. If the .mac file format had been open, then it is far less likely that content would have ever gotten trapped in this extinct format.
With digital preservation, the rule of thumb is to move your content into file formats that are 1. open, and/or 2. popular. When a file format is open, we can get inside its structure and know what’s going on, even if the software that a file was originally created on no longer functions. The thinking behind going with a popular file format over one that is used less frequently, is that a way to “get inside” the format will be inevitable since so many people will have invested their content into that format. Someone will find a way in, and hopefully share their secret.
Here is a case demonstrating the issue of open versus proprietary formats. The University of Michigan’s University Digital Conservancy explicitly determines how much preservation action they can put into specific files based on their format:
More extensive actions will be taken to preserve usability for objects in file formats that are fully disclosed, well documented, widely adopted, and are most accessible for migration, emulation, or normalization actions. Fewer actions will be taken to preserve usability for file formats that are proprietary and/or undocumented, and those that are considered working formats (e.g., Photoshop .psd) and/or are not widely adopted.
You can view the tables outlining their levels of preservation support per file format here.
I also liked this table of recommended formats put together by the Florida Digital Archive (PDF).
File Format Resources
PRONOM is a remarkable project of the UK’s National Archives. They have created a comprehensive directory of file formats and the programs that can understand them. It’s truly a great resource for digital archivists because a search for a file format will yield information about its origin, its particular specification signatures, associated rights, and more. The National Archives also developed DROID to work in conjunction with PRONOM. DROID can automatically identify file formats in batch operations.
If you’re looking for new fodder for your RSS feed, here is a blog that is entirely devoted to discussing file formats in the context of digital preservation. It’s written by Gary McGath, who worked on the JHOVE and JHOVE2 projects, which validate file format claims upon repository ingest. Here is an older post about the projects.
METS is the Metadata Encoding and Transmission Standard, which is applied to encoding metadata via a standardized XML schema. METS handles all types of metadata that is relevant to preservation: descriptive, administrative, and technical/structural metadata are all included in the schema, and a METS document will serve as the container for all of this information about a digital object.
The schema was initially developed for the digital library community, and has thus extended to the digital repository and preservation communities. The fact that METS confines varying types of an object’s metadata to one standard XML-based file type is excellent news for sharing and preserving resources.
As is evident through the experience of many current digital preservation programs, collaboration among multiple institutions is a very strategic move for a successful digital preservation program. Using METS as a guideline for creating readable and transferable metadata ensures a more seamless sharing experience. It also aids in escape strategies should the repository or institution hosting the repository fails and the digital objects need to be transferred to someone else’s care.
The beginnings of a standardized metadata scheme for collections of digital objects can be traced back to 1997, when UC Berkeley and the Digital Library Federation (DLF) initiated a project to further the concept of digital libraries sharing resources. By 2001, the DLF-sponsored METS schema emerged, which is supported by the Library of Congress, and was made a NISO standard in 2004, and was renewed in 2006.
By 2006, it had become clear that METS could not only serve as an answer to the interoperability needs associated with sharing digital objects, but that METS is also valuable for preservation purposes. Jerome McDonough (2006) states that “the METS standard can be considered one of many efforts to try to determine…how complex sets of data and metadata might best be encoded to support both information exchange and information longevity.”
How it’s Used
The OAIS reference model considers an acceptable digital object as one that includes the original content as well as the metadata required to understand the content, its structure, its rendering needs, and its preservation history. This information plus the actual content forms a complete “information package,” which comes in the flavors of SIPs, AIPs, and DIPs, depending on the object’s role in a repository, as discussed in the OAIS reference model. The metadata that comes in each of these flavors is referred to as the Preservation Description Information (PDI). (Note: DIPs do not always have PDIs since they are the distribution versions.)
We know from the OAIS model that a PDI categorizes a digital object’s metadata into reference, provenance, context, and fixity categories. METS is capable of fulfilling these metadata requirements with corresponding sections in each METS document:
Administrative <amdSec> (covers provenance and rights)
File Groups <fileGrp> (lists any and all files that comprise the digital object)
Structural Map <structMap>
Structural Links <structLink>
It is important to realize, however, that according to the METS standard, the only required part of a METS document is the Structural Map. So in order for METS to be effective when applied to preservation, there must be information in each of these sections (FYI – a truly complete METS file will also include a header <metsHdr>).
So where do we get this information to fill up a METS file? The answer is PREMIS.
METS and PREMIS – A Perpetual Preservation Honeymoon
You may recall that PREMIS is also an XML schema that has been developed for preservation metadata. The PREMIS structure is based on entities and semantic units that will harbor information about a digital object that is necessary for supporting and recording digital preservation actions.
What’s important here is that PREMIS will sit inside the METS document. You can see an example of this here. All of the preservation information will be present in the PREMIS file, and by nesting the PREMIS data into the METS file, the metadata becomes transferable to other repositories.
The flexibility of both of these schemas implies that there are variations and complications with integrating PREMIS and METS. The Library of Congress created a working draft of guidelines for this process, which is viewable here (PDF, 25K).
Helpful METS Resources
METS Primer (Revised 4/2010) (PDF, 1.53MB) – Readable, and has color images and examples.
PREMIS in METS toolbox, information about the project here.
On Monday, I was thrilled to attend a workshop entitled Digital Preservation for Video, presented by Linda Tadic for Independent Media Art Preservation (IMAP). The workshop was held in San Francisco at the Bay Area Video Coalition (BAVC). The scope of the event was to cover some of the key considerations in digitizing video and creating a digital preservation program at the DIY level (i.e. without a huge IT department backing you up). A few of the institutions represented by attendees included BAVC, the Pacific Film Archive, the California Institute of the Arts, the California Academy of Science, the Sierra Club, the San Francisco Symphony, and the California Film Institute.
Prior to this workshop, I hadn’t had a great deal of exposure to the digital preservation challenges of moving visual materials. In fact, I confess that I hardly knew anything about the current physical formats used for video storage, nor much about the hard work that is necessary for digitizing them. Most of the attendees have done their share of digitizing moving images (or of outsourcing the digitization), and I think that most of us were there to explore the answers to the question of “now what?”
The Move to File-Based Video Storage
Physical moving image storage formats are on death row. We spent the bulk of the morning going over the characteristics of different physical media and their expiration dates, which served as an effective motivator for digitization and instilling all but panic among the attendees.
Unlike paper, the magnetic tapes, reels, and discs that moving images are physically stored on are on a very tight deadline; aside from succumbing to format obsolescence, most of the media is reaching the end of its life expectancy, after which the images on them will simply not exist anymore. To give some examples of formats that I was more familiar with, the life span of VHS is approximately 15 years, while MiniDV, DVCam, and Video are 5-10 years. This illustrates a point that in some cases, it isn’t necessary to digitize the oldest things first.
Digitization is arguably the increasingly best preservation option for some of these formats, and it is important that the road to digitization doesn’t result in a dead end. That is why we need to ensure that once digitization has occurred, there is a digital preservation plan in place to ensure that the video content will continue to survive, especially since the original physical sources of the content will be dead in short while.
Indeed, we are observing a shift from format-based physical video storage to the file-based storage of digital video content. Preservation will no longer be about making the tapes last as long as possible, but by caring for the digital files representing the content that the tapes once held.
Preservation Concerns for Digital Video
I appreciate how Linda was adamant in reminding us that digital preservation is not a one-time fix for digital video longevity. She was very clear in telling us that it requires a constant guardianship consisting of a deliberate, scheduled management of the digital files. To use her phrasing, there is no “store-and-ignore” solution. Preservation activities involve keeping file formats current so that they can be accessed by the software of the now. It also involves exercising the hard drives that your files may be stored on and not letting them sit idle for more than 6 months. It requires diligent updating of the files’ accompanying preservation metadata so that changes to the files can be tracked and managed.
Linda also stated what nobody likes to hear about digital preservation: that there is no one way to do things, and that there is no one set of instructions to follow that will help you save your content. As with all file types, the preservation decisions you make will depend on your content, your files types, your storage, and your intended access methods. So, in the case of making storage selections and creating a plan, knowledge is power. I’ll try to summarize some of the key points covered.
My Internet radar is starting to pick up some buzz about iPRES 2010. iPRES is an annual, international conference on “the preservation of digital objects” (see my previous iPRES gushing from when I was an intern). The 2010 call for papers for the October 2010 meeting has been issued, and this year, there is also a call for leading workshops and tutorials for digital preservation activities. This will surely lead to great opportunities to learn and share experiences and skill sets.
iPRES 2010 is being held in Vienna and is hosted by the Austrian National Library and the Technical University of Vienna. I think the statement that this year’s organizers have made with their logo design is very apt for the conference subject matter. It reads “iPRES 02010.” Expressing the year in five digits instead of four is an excellent reminder that we are only existing at a particular place in time. As climactic as the year 2010 seemed to us on January 1st, it really isn’t any sort of finale.
Speak to a person involved in digital preservation, and they may be able to forecast what the next five years of digital information preservation management will look like. Maybe. The five-digit year expresses the future, and encourages thoughts about the people coming after us who will stand to benefit from our digital output. Thinking that far ahead when talking about digital preservation is rather lofty, I know. But it illustrates a pervasive point.
It is not likely that I will attend iPRES this year, so I’ve appeased myself by reliving some of the great presentations I saw last year in San Francisco. The host of last year’s iPres, The California Digital Library, recently put up the proceedings of iPRES 2009 on their open access publishing platform, eScholarship. The slides from the presentations have been available for a while on the website.
Here are some 2009 papers that have been influential in my own thinking since the conference:
Tyler Walters, Liz Bishoff, Emily B. Gore, Mark Jordan, and Thomas C. Wilson: Distributed Digital Preservation:: Technical, Sustainability, and Organizational Developments. This panel relayed their experiences in participating in private LOCKSS systems with geographically distributed institutions. Really great for looking at benefits and challenges of joining a network versus going solo. (Also, I’m taking a course from Tyler Walters this semester, which is an additional bonus.)
The entire proceedings are available here for free, individual downloading.
To be clear, digitization and digital preservation are not the same thing. Digitization is the process of making digital copies of physical items. Digital preservation refers to the activities associated with maintaining the viability of, and access to, digital files over time. Thus, the activities of digitization will result in things that can be (need to be?) included in a digital preservation project.
I like to think about all of the large scale book scanning that is happening. Massive amounts of digital files are being created from physical books. If these files aren’t taken care of properly, then over time they will become unusable…and all the news coverage of the Google Books settlement seems like a laughable waste of time.
This semester, I’m taking a course specifically on Digitization. One of the first questions posed to us was whether or not a physical item (book, document, etc.) that has been digitized can be discarded. Now, I am aware of the intrinsic value that a book hand-printed in the 1500s has, but in my response, I chose to focus on the intellectual (or informational) content. This question made me think about how intricately tied digitized projects should be to digital preservation programs. Why would we risk the total loss of an item’s content if we rely on a digital version of it that is not receiving any stewardship after the physical copy has been tossed? So here are my conditions for tossing a physical copy once it’s been digitized.
The digitized copy should be of preservation quality, meeting (what seems to be) the non-standardized requirements of 600+dpi, TIFF file format, etc.
The organization charged with keeping the digital file of the digitized item should have a solid and reliable digital preservation program in place. In a successful digital preservation program, the issues related to file format obsolescence, file corruption, and crashed hard drives will be nullified, as the program should account for these disasters ahead of time and be ready with plans to prevent such events. Under this condition, it is safe to assume the analog copy would no longer need to be retained since its informational content is safe in a digital format.
Access to the digitized copy must be equal to or greater than the access that was allowed with the physical copy. Preferably, access should be increased, as the new format enables more avenues of access, by nature. As Oya (2007) points out, the investments made in large scale digitization initiatives to aggregate and store digitized collections are huge. “Such investments will be more worthwhile if discovery, access, and delivery are given equal emphasis.” The argument could be made that increased access to content is as much of a justification for digitization as are any reasons associated with preservation of the content of the physical item.
Finally, it must be determined that the physical copy of a digitized item has no other value than what can be conveyed through its digital copy. If the physical item is valuable for more than its informational content, then perhaps discarding it after it has been digitized is not a reasonable option.