Dear Readers

Thank you for visiting my collection of digital preservation concepts!

As you have probably guessed, I am no longer updating this blog, but I sincerely hope that it continues to serve as a resource for the basic concepts and processes that make up this (no-longer) nascent field. I’d imagine that some of these writings are no longer up-to-date, as is the inevitable nature of anything that tries to capture a digitally-focused moment in time.
The constant changes in our digital environments are exciting and call for consistent adaptation. Since leaving grad school, I have moved on to roles in the tech industry, and have learned this: Despite the institution or industry, I think that the challenge of ‘keeping up’ is exhilarating, pervasive, and always involves a gamble. In particular, digital preservation is not for the faint of heart! I applaud the passion and competency with which this field has grown to approach the preservation of valuable digital resources. Please feel free to reach out for anything.

Onward!

Digitization Specifications Compilation

Really briefly, I wanted to direct you to the new page I added to this blog, Digitization Specs. Currently there is a compilation of various specs for digitizing photographic prints and negatives.

Just for fun (and reference) I compiled the photograph digitization specifications that I could locate from the websites and publications of various libraries and cultural heritage institutions.  I thought it would be neat to see them all side by side, and as far as I know, there isn’t such a resource yet.  I’ve included specifications for color and black & white photos, in print and film form.

I also intend to eventually compile specifications for digitizing text/print-based documents, but that’s for another day!

File Formats and Preservation

File formats are the rock stars of digital preservation.  After all, one of the goals of digital preservation is to prevent a loss of access to files due to file format obsolescence.  If you are using a file format migration strategy for preservation, then you will be refreshing the digital files over time to keep the content stored in formats that are readable by the current technology.  If you are practicing a software emulation strategy for preservation, then you are maintaining software that will be able to read the old file formats.

When a digital object is deposited into a digital repository, the type of file that it is will be declared by its extension (.jpg, .pdf, etc.).  The type of file you are dealing with has big implications for how preservation practices can be applied to it now and in the future.  This is because being able to access the contents of a digital object depends on the ability to store, read, and edit the digital files – actions that are products of the file format’s specifications and the software that’s necessary to understand that file format.  The specification is a description of the file format that includes basic building blocks and technical byte-by-byte descriptions of the file format’s layout. Cornell’s digital preservation tutorial says a bit more about it.

Extinction

dinosaur bones
Photo by Charles Tilford, CC license

As you know, when a software program creates a file, the program can re-open the file to view it, edit it, etc.  This is because the program knows the file format’s specifications and was designed to be able to work with it.  As software programs get upgraded or disappear, the ability to read the files that it created becomes riskier.  Software upgrades happen all the time, and it is usually possible to open a file created with the previous version of a program.  But over time and numerous updates, this might not always be the case.  And it certainly won’t be possible if the software stops getting upgraded and will eventually not be capable of running on new machines.

To illustrate this point, let’s look at the old Mac program MacPaint, which was a basic painting program that shipped with Apple computers from 1984-1988.  Files created with this program were “MacPaint bitmap images,” and received the extension .mac (there were a few other extensions for this format, but let’s focus on this one).  MacPaint won’t run on modern machines, and there are certainly no programs from after 1988 that were designed to read this format.  So all .mac files became orphaned, and the only way to read them was to boot up an old machine with MacPaint on it.  (Happily, Apple released the source code of MacPaint to the Computer History Museum, meaning that with a little work these files are readable.)

Open & Proprietary Formats

But we’ve come to an interesting juncture in this discussion.  File formats can be clumped in to two categories: open and proprietary.  Open file formats are those in which the file format specifications are publicly available.  When this information is available, programs other than the one that created the file can be made to interpret the file’s format (or migrate an old file into a newer format), and we are not dependent on the original program.  This implies a more guaranteed longevity for the file in its original format.   Some open file formats that I’m sure you’ve come to love include .pdf, .jpg, and .tif.

When a file format is proprietary, the format’s specifications are not available because they are usually guarded as property of the company that created the program that creates the files.   If the .mac file format had been open, then it is far less likely that content would have ever gotten trapped in this extinct format.

With digital preservation, the rule of thumb is to move your content into file formats that are 1. open, and/or 2. popular.  When a file format is open, we can get inside its structure and know what’s going on, even if the software that a file was originally created on no longer functions.  The thinking behind going with a popular file format over one that is used less frequently, is that a way to “get inside” the format will be inevitable since so many people will have invested their content into that format.  Someone will find a way in, and hopefully share their secret.

Here is a case demonstrating the issue of open versus proprietary formats.  The University of Michigan’s University Digital Conservancy explicitly determines how much preservation action they can put into specific files based on their format:

More extensive actions will be taken to preserve usability for objects in file formats that are fully disclosed, well documented, widely adopted, and are most accessible for migration, emulation, or normalization actions. Fewer actions will be taken to preserve usability for file formats that are proprietary and/or undocumented, and those that are considered working formats (e.g., Photoshop .psd) and/or are not widely adopted.

You can view the tables outlining their levels of preservation support per file format here.
I also liked this table of recommended formats put together by the Florida Digital Archive (PDF).

File Format Resources

PRONOM is a remarkable project of the UK’s National Archives.  They have created a comprehensive directory of file formats and the programs that can understand them.  It’s truly a great resource for digital archivists because a search for a file format will yield information about its origin, its particular specification signatures, associated rights, and more.  The National Archives also developed DROID to work in conjunction with PRONOM.  DROID can automatically identify file formats in batch operations.

Growing from a partnership between PRONOM and the Global Digital Format Registry (GDFR) is the forthcoming Unified Digital Format Registry (UDFR).  The aim of this project is to create a larger, open registry to which formats can be added by community participants and is based on the PRONOM database.

If you’re looking for new fodder for your RSS feed, here is a blog that is entirely devoted to discussing file formats in the context of digital preservation.  It’s written by Gary McGath, who worked on the JHOVE and JHOVE2 projects, which validate file format claims upon repository ingest.  Here is an older post about the projects.

iPRES and 02010

My Internet radar is starting to pick up some buzz about iPRES 2010. iPRES is an annual, international conference on “the preservation of digital objects” (see my previous iPRES gushing from when I was an intern).  The 2010 call for papers for the October 2010 meeting has been issued, and this year, there is also a call for leading workshops and tutorials for digital preservation activities.  This will surely lead to great opportunities to learn and share experiences and skill sets.

iPRES 2010 is being held in Vienna and is hosted by the Austrian National Library and the Technical University of Vienna.  I think the statement that this year’s organizers have made with their logo design is very apt for the conference subject matter.  It reads “iPRES 02010.”  Expressing the year in five digits instead of four is an excellent reminder that we are only existing at a particular place in time.iPRES 2010 logo As climactic as the year 2010 seemed to us on January 1st, it really isn’t any sort of finale.

Speak to a person involved in digital preservation, and they may be able to forecast what the next five years of digital information preservation management will look like.  Maybe.  The five-digit year expresses the future, and encourages thoughts about the people coming after us who will stand to benefit from our digital output.  Thinking that far ahead when talking about digital preservation is rather lofty, I know.  But it illustrates a pervasive point.

It is not likely that I will attend iPRES this year, so I’ve appeased myself by reliving some of the great presentations I saw last year in San Francisco.  The host of last year’s iPres, The California Digital Library, recently put up the proceedings of iPRES 2009 on their open access publishing platform, eScholarship.  The slides from the presentations have been available for a while on the website.

Here are some 2009 papers that have been influential in my own thinking since the conference:

The entire proceedings are available here for free, individual downloading.

Briefly Exploring Digitization

To be clear, digitization and digital preservation are not the same thing.  Digitization is the process of making digital copies of physical items.  Digital preservation refers to the activities associated with maintaining the viability of, and access to, digital files over time.  Thus, the activities of digitization will result in things that can be (need to be?) included in a digital preservation project.

Automated book scanner
Automated book scanner

I like to think about all of the large scale book scanning that is happening.  Massive amounts of digital files are being created from physical books.  If these files aren’t taken care of properly, then over time they will become unusable…and all the news coverage of the Google Books settlement seems like a laughable waste of time.

This semester, I’m taking a course specifically on Digitization.  One of the first questions posed to us was whether or not a physical item (book, document, etc.) that has been digitized can be discarded.  Now, I am aware of the intrinsic value that a book hand-printed in the 1500s has, but in my response, I chose to focus on the intellectual (or informational) content.  This question made me think about how intricately tied digitized projects should be to digital preservation programs.  Why would we risk the total loss of an item’s content if we rely on a digital version of it that is not receiving any stewardship after the physical copy has been tossed?  So here are my conditions for tossing a physical copy once it’s been digitized.

  • The digitized copy should be of preservation quality, meeting (what seems to be) the non-standardized requirements of 600+dpi, TIFF file format, etc.
  • The organization charged with keeping the digital file of the digitized item should have a solid and reliable digital preservation program in place.  In a successful digital preservation program, the issues related to file format obsolescence, file corruption, and crashed hard drives will be nullified, as the program should account for these disasters ahead of time and be ready with plans to prevent such events.  Under this condition, it is safe to assume the analog copy would no longer need to be retained since its informational content is safe in a digital format.
  • Access to the digitized copy must be equal to or greater than the access that was allowed with the physical copy.  Preferably, access should be increased, as the new format enables more avenues of access, by nature.  As Oya (2007) points out, the investments made in large scale digitization initiatives to aggregate and store digitized collections are huge.  “Such investments will be more worthwhile if discovery, access, and delivery are given equal emphasis.”  The argument could be made that increased access to content is as much of a justification for digitization as are any reasons associated with preservation of the content of the physical item.
  • Finally, it must be determined that the physical copy of a digitized item has no other value than what can be conveyed through its digital copy.  If the physical item is valuable for more than its informational content, then perhaps discarding it after it has been digitized is not a reasonable option.

Rieger, Oya, Preservation in the Age of Large-Scale Digitization (DRAFT). Washington, DC: Council on Library and Information Resources, 2007. http://www.clir.org/pubs/abstract/pub141abst.html.

Book scanning photo by cogdogblog on Flickr, Creative Commons Attribution 2.0 Generic license.

Copyright and Digital Preservation

Before you read any further, please note that I am not an expert in copyright law…by a long shot.  What you’ll find below is a discussion about how copyright law affects digital preservation as I understand it.  Copyright law is very complex, especially in regards to dealing with the “new” issues presented by the digital environment.  Hopefully you will find the references I have listed below, and the items on the Resources page, useful in getting started.

The Problem with Copyright:

The absolute biggest barrier that copyright presents to preserving digital materials is the copyright owner’s exclusive right to reproduce and adapt a work.  Making copies of digital items and adapting them in various ways are generally the first steps of preservation — think of making copies to back things up, and the act of making changes to digital objects during digital format migrations.

Another impediment to digital preservation efforts are the dissemination restrictions that copyright law upholds.  Digital preservation is closely tied to access, yet this main goal of any preservation effort is restricted by current copyright law.  The glory of digital items is that they can theoretically be accessed from anywhere, and by multiple simultaneous users.  But copyright law hasn’t quite caught up to accommodate the digital environment and allow us to (legally) use and preserve digital items in the full capacity that the medium allows.

Determining the duration of copyright is somewhat confusing since it depends on when the work was created (or in some cases, when it was published versus when it was created).  Various acts of legislation over many years complicate the law because they have resulted in different copyright durations and renewal lengths.  Bitlaw provides a concise write up for the summary-inclined among us.

Exceptions to Copyright Law:

Libraries and archives follow the copyright provisions laid out by Section 108 of Title 17 (The Copyright Law) of the US Code (available here).  Libraries and archives are strong candidates for hosting digital preservation initiatives, so that’s why I’m focusing on them.  If the library or archive making the copies is open to the public or allows access to researchers from non-affiliated institutions, then it is not an infringement to make copies for preservation or replacement purposes under the conditions that:

  • the item is already currently held in the collections
  • the item “is damaged, deteriorating, lost, or stolen [not good for digital items, as it will be too late once damage has occurred] or if the existing format in which the work is stored has become obsolete.
  • the copy is not distributed in a digital format outside the walls of the library (italics added to emphasize the impracticality of this rule)

Additionally, libraries and archives are allowed to make up to three copies of unpublished works for preservation purposes, and up to three copies of published works for replacement purposes.  So, even with the compromises made for libraries in Section 108, there are problematic implications for digital preservation.  Since digital preservation is so closely tied to accessibility, libraries would be extremely limited in how they can preserve – and then share – digital material.

There is hope; people are aware of these limitations.  In March 2008, the Section 108 Study Group released a report of suggestions to improve Section 108 and advance it into a more digitally-oriented mindset.   These suggestions include allowing copies of works to be made prior to damage or loss; make copies of publicly accessible websites with an opt-out option (see the Internet Archive in the following section); and lift the three-copy preservation or replacement limit.

And let’s not forget about Fair Use.  I won’t get in to it deeply here, but it’s a doctrine within Title 17 (Section 107) that actually reduces the copyright holder’s exclusive rights.  It allows people to reproduce parts of copyrighted works.  It is a totally vague and subjective doctrine, and seems to be more of a defense against infringement lawsuits rather than a right.

An Aside about Copyright and the Web:

There are many web-archiving projects, the resulting files of which will need to be included in preservation processes.  Like content that is created off the web, web-based content is also protected under copyright law unless it is stated otherwise.  The Internet Archive’s approach to harvesting web content for archiving is to collect everything from which its crawlers are not excluded, and to provide an opt-out policy for anyone who specifically does not want to be included.  While the legality of this method is up for debate, the Internet Archive has avoided many infringement suits via their “willingness to respect the wishes of those copyright owners who want to limit and control the reproduction of their copyrighted works” (Hirtle, 2003).

Since the introduction of Web 2.0, web content on a given web page may also have more than one creator.  So, obtaining copyright permission for preservation purposes may be more challenging than contacting one person.  In the case of blogs, for example, blog writers do not own the copyright to comments other people have left (Biederman & Andrews, 2008).  To take this one step further up the difficulty scale, think of the challenges introduced by anonymous comments with no clear author.

Additional Restrictions:

Outside of the general US copyright law that is applied to a work, we must also take into consideration the licensing restrictions that may be associated with subscription materials.  These will likely have their own rules and implications for preservation, especially given that which is made more clear by the Digital Millennium Copyright Act (DMCA).  The DMCA prohibits “circumventing technological access controls to obtain access to copyrighted works,” meaning if access to the work is password-protected, you cannot create a work-around to allow others to get to it (Besek, 2003).

No Real Precedents:

Finally, I think another basic challenge with copyright is that there are no precedents for many of the issues that digital preservation activities bring to the surface.  This is especially true in regards to the Fair Use exemptions, which are judged on a subjective basis.  The Fair Use exemption could be a saving grace for preservation activities, but until it has proven to be so in an infringement challenge or lawsuit, it is a very big risk to assume that this can be the case for all instances.  It’s likely not the right preservation decision to wait until copyright law catches up with the needs of our digital environment.  So…who wants to try first?

Biederman, C. J., & Andrews, D. (2008, May 1). Applying copyright law to user-generated content. Los Angeles Lawyer, 12.
Besek, J. (Jan 2003). Copyright issues relevant to the creation of a digital archive: A preliminary assessment.  CLIR.  Retrieved Jan 5, 2010 from http://www.clir.org/pubs/reports/pub112/contents.html
Hirtle, P.B. (2003).  Digital preservation and copyright.  Retrieved Jan 5, 2010 from http://fairuse.stanford.edu/commentary_and_analysis/2003_11_hirtle.html

A Budding Branch?

This evening I sat in on a lecture given by John Phillips, a Management Consultant at Information Technology Decisions.  John was giving an overview of what he saw as the similarities and differences between the three main branches of information management professionals: librarians, archivists, and records managers.  What was not included in this list were digital preservationists.

Now, as someone who is not actually working in the field, I may be remiss in assuming that digital preservation has yet earned thusly titled professionals.  But I think if this is not yet the case, then it certainly will be in the future…once it becomes clear that professionals from the other three branches of information management cannot be expected to all have expert-level knowledge of digital preservation practices….which will become clear because everyone in information management really needs to starting thinking about technological obsolescence.

The point of this post, though, is to point out a major correlation between records managers and what I would be inclined to think of as digital record preservationists.  As John pointed out, records managers differ from librarians and archivists because, 1) they tend to work in business or corporate environments, and 2) they are OK with – and are expected to – throw things away after they are no longer of value to the owning organization.

photo by Sebastiano Pitruzzello

Upon an item’s accession into a repository, records managers will asses the value of an object, and then revisit that assessment later on in the course of retention decisions.  If the item is no longer worth keeping, it is discarded.  This is also the (theoretical) case with digital preservationists.  In digital preservation, OAIS-type repositories are intended to preserve digital items for as long as those items are of value to their designated communities.  This implies that at some point, a digital item may no longer have value, and therefore continued preservation efforts for that item are not economically justified.  Throwing things away is a dirty job, but just as we can’t possibly collect everything out there, perhaps we can’t keep it all, either.

But let’s not discredit those clingy librarians.  John gave an interesting guesstimate regarding the types of respective repositories information professionals work with.  Among records managers, archivists, and librarians, librarians deal with the highest proportion of electronic to physical records out of all three professions.  (John’s guesstimate was 40% electronic / 60% physical, in comparison to IT professionals, who are 100% electronic by nature.)  The numbers for records managers were 30% electronic / 70% physical, which is still quite a lot of paper to be dealing with.
So if librarians are handling the highest proportions of electronic items out of these three groups, we can make a big case for libraries to be the battle grounds for creating leaders in digital preservation.  Technological and file format obsolescence will hit libraries the hardest if these numbers are accurate.  As contenders with the most to lose, libraries are poised to harbor the most institutional support for digital preservation initiatives…and perhaps spawn the fourth major branch of information professionals.