Why There is No Single Preservation Strategy

The following are some thoughts that I had about why there isn’t one digital preservation strategy that can be applied to all digital preservation programs.  As wonderful as it would be to find one standardized solution that fits everyone’s needs, it’s essentially an impossibility.  What’s below is something I wrote up for some coursework, but I thought I’d share it here, too.

There are two ways to answer the question of why there is no universally applicable digital preservation strategy.  The first is at the institutional level, and the second is at the level of the digital objects intended for preservation.

Digital preservation efforts so far have been tied to institutions interested in maintaining access to digital objects over time.  Being tied to an institution for funding and support will come with governance, policies, administrations, departments units, stakeholders to please, and service missions which will all be very specific to the institution.  These factors will all be guiding principles in the way a digital preservation strategy will be created at a given institution.  And this is fine; Pennock (2006) even goes so far as to state that “digital preservation policies are most effective when integrated into the overall organisational policy framework.”  But this would prevent a universal digital preservation model from being possible simply due to all the “personalizations” that would need to take place in order to meet the needs of the institution as well as the capabilitibubbleses allowed by whatever funding is available.

The second way to answer the question of why there is not a single digital preservation method that can be applied to everything is at the level of the digital objects.  When it comes to determining the actual preservation method, some ways work better than others depending on the type of file at hand, and the needs associated with that individual file.  For example, van der Hoeven (2004) points out that migration is an effective preservation method for widely supported file formats, but it might not be good for files that must maintain high levels of authenticity.  He even goes on to state that “…no one size fits all solution is possible.  Digital documents differ from each other in too many ways and are used for many different purposes by many different users.”
If we are to come up with an effective digital preservation strategy (at both the institutional and document levels), we must remain aware of the options, and expect to employ more than one method, strategy, and tool set.

Pennock, M. (2006).  “JISC Briefing paper: digital preservation, continued access to authentic digital assets.”  Retrieved September 30, 2009 from
van der Hoeven, J. R. (2004). “Permanent Access Technology for the virtual heritage.”  Retrieved September 30, 2009 from http://jeffrey.famvdhoeven.nl/dd/Researchtask%20IBM%20TU%20Delft%20-%20J.R.%20van%20der%20Hoeven.pdf
Photo by Tambako the Jaguar under a Creative Commons Attribution-No Derivative Works 2.0 Generic license.

JHOVE and JHOVE2

When curating digital files for storage in a digital repository, being certain of an object’s file type format is very important for preservation purposes and for future accessibility.  JHOVE is an open-source, Java-based framework that will identify, validate, and characterize the formats of digital objects.  This tool can be integrated into an institution’s workflow associated with populating a digital repository.  If the repository is OAIS-compliant, the workflow integration would occur during the creation and validation phase of an information package in the digital object’s ingestion.

JHOVE’s three steps of identification, validation, and characterization will result in us knowing a great deal about a digital object’s technical properties.  We’ll know the object’s format (identification), we’ll know that it is what it says it is (validation), and we’ll know about the significant format-specific properties of the object (characterization).

JHOVE’s name comes from the partnership that spawned between JSTOR and the Harvard University Library in 2003 to create the software, and stands for JSTOR/Harvard Object Validation Environment.

The Digital Curation Centre did a case study in 2006 which provides a concise background of the JHOVE project.

JHOVE2

The September 2008 Library of Congress Digital Preservation Newsletter reports on the development of JHOVE2.  The users and creators of JHOVE decided to address what they saw as shortcomings and improve the tool for JHOVE2.  However, the project has moved to the guidance of the California Digital Library, Portico, and Stanford University.jhove2

The most notable change to come in JHOVE2 is the shuffling around of the original  three-step process outlined by JHOVE.  For JHOVE2, the whole process is now considered to characterize a digital object by identifying, validating, and reporting the inherent properties of the object that would be significant to its preservation.  Added to this process is an assessment feature that determines the digital object’s acceptability for an institution’s repository, based on locally-defined policies.

A really exciting and valuable improvement in JHOVE2 will be the ability to address the characterization of digital objects that are comprised of more than one type of format.

One of the developers of JHOVE and JHOVE2 is Stephen Abrams of the California Digital Library, who has been designated as a Digital Preservation Pioneer by the Library of Congress.

JHOVE2 is expected to be released early 2010, but the prototype code was made available last month for viewing and comments.  Here are the FAQ from the project’s site.  The completed product, like its predecessor, will also be available under an open source license.

There will be a JHOVE2 Workshop following iPRES 2009.

iPRES

iPRES is an annual international conference on the Preservation of Digital Objects.  Current research and projects are presented by authors of papers that have been selected by a comprehensive review process.  The papers tend to focus on technological research and from authors’ experiences in implementing and practicing different preservation strategies.  iPRES 2009 marks the sixth year the conference has been happening, and it is taking place October 5-6th at the Mission Bay Conference Center in San Francisco, CA.  The California Digital Library is acting as this year’s host and is thus leading the internal conference planning and local preparations.

Last year’s conference was hosted by the British Library and was held in London.  Previous to that, iPRES 2007 was organized by the National Science and Technology Library of China and was held in Beijing.  More information about previous conferences can be found here.

iPRES 2009 has posted a two-track draft program, which reveals that David Kirsh and a panel from members of the Blue Ribbon Task Force on Sustainable Digital Preservation and Access will give the keynote addresses.

Also of interest to this year’s conference is the string of related events that follow it.  These events are taking place in San Francisco as well, and might make for exciting ways for iPRES attendees to tack on a couple of extra days to their stay in the city.

Digital Preservation Coalition (DPC)

The Digital Preservation Coalition was established in 2001.  It is a UK-based non-profit whose members share the goal of raising awareness and sharing knowledge about digital preservation.  I think their first success in achieving this goal was to create an international organization of members.

Membership is open to all parties, given that they are non-profit or collective.  There are different tiers of involvement in which members can participate, from being a funding supporter of a specific project to full membership, which costs 10,000 GB pounds/year.  A list of members can be found here.

Mission

Reading through the mission of the DPC is like looking at a hit-list of many of the key issues of digital preservation efforts.  Primarily, it is easy to appreciate that the DPC recognizes the necessity of collaboration in an effective digital preservation strategy by openly stating the very harrowing admission that no organization can “address all the challenges alone.”  Sharing progress and ideas is fundamental to this effort.  But the DPC also encourages individual projects done by members in order to promote more homegrown institutional and sector-level preservation practices and policies.

My favorite part of their mission is this:
“Instituting a concerted and co-ordinated effort to get digital preservation on the agenda of key stakeholders in terms that they will understand and find persuasive.”

I’m glad that someone has taken this part of the digital preservation process to the battlegrounds.  No matter how well-planned and coordinated any digital preservation project may be, they all need funding.  And funding will probably have to come from parties that have not thought of or even necessarily heard of digital preservation and its importance.  Explaining the process and need is really Step 1 in any successful attempt to secure support and funding.

In general, I would think that digital preservation efforts are at an advantage for getting funding because once its goals are understood, it would be difficult for a truly invested stakeholder to overlook its relevance.  The DPC has really addressed connecting the dots between the people involved in digital preservation projects and the people who need to support these efforts in this part of their mission statement.

What the DPC Does

The DPC produces and shares information regarding research and practice within the digital preservation community.  They also work on promoting technology and standards, including coordinating recommendations for the 5-year review of the OAIS standard.  There is a clearly defined list of other goals and objectives here.

Their website is a comprehensive hub for their reports and activities, and also lists the projects of its members – arranged by type.  You will also find various training opportunities and a quarterly newsletter produced in concert with PADI.

The DPC also administers the international Digital Preservation Award.

What I think is probably their magnum opus up to this point is their Handbook.

The Handbook

This incredibly useful handbook is maintained by the DPC.  It goes far beyond the OAIS model guidelines by including more information and concepts, as well as information about selecting materials.  The handbook is meant to be “of interest to all those involved in the creation and management of digital materials,” and I think it really is.  A brief look at it will show you:

  • A who, what, why, how overview of digital preservation
  • A glossary of definitions and concepts
  • A run-down of media storage formats
  • Preservation strategies at the institutional level
  • Organizational, workflow, and institutional collaboration strategies
  • Acquisition and selection guidelines with an incredible supplementary flow chart for selecting materials

One final note I’d like to make regards the UK-centric view their mission proclaims this organization has.  This is an organization comprised of international members who are all making strides together in preserving global digital artifacts.  I think that just because the DPC is based in the UK, and it aims to place UK digital preservation strategies into an international context, we all stand to benefit from it as a resource and organization.  One shouldn’t be deterred from participating or from using what the DPC has to offer for this reason!

HathiTrust

“Hathi” is the Hindi word for elephant, and this project uses the elephant’s association with wisdom and memory in its name, HathiTrust.  HathiTrust is a shared digital repository whose content is composed largely of digitized books from the Google Books Library Project.  The idea to create the shared digital repository comes from Committee on Institutional Cooperation and the University of California system.  The associated universities are all partners in HathiTrust, and partnership is open to other research libraries.

As far as preserving the content within the repository, HathiTrust touts that it “provides a no-worry, pain-free solution to archiving vast amounts of digital content. You can rely on the expertise of other librarians and information technologists who understand your needs and who will address the issues of servers, storage, migration, and long-term preservation.”

HathiTrust envisions itself as becoming a very credible and large digital library as well as being able to provide a viable preservation service for its partners.

The Content and Google

In the HathiTrust shared repository, outside users will mostly find content that is in the public domain.  Due to copyright and licensing regulations, most the content that is currently in the repository cannot be viewed by non-owning parties.  In fact, only 16% of the total volume stored in HathiTrust are in the public domain as of August 8, 2009.  The FAQ also state that, “as it becomes possible to expand access to the materials through permissions or other agreements, other materials will be made available. HathiTrust has already been contacted by some rights holders wishing to provide broader access to their content.”
HathiTrust has made all of the public domain content available via search within the repository and through Google.  You can also browse a very cool visualization of the public domain content by the LC classification.  The search functionality of the full catalog is still under development in partnership with OCLC, but they offer this at the current time.

Most of the public domain content of this shared digital repository comes from books that have been digitized through the Google Books Library Project. Google partnered with institutional libraries to digitize books that are hard to find and may otherwise be unavailable.  HathiTrust stemmed from some of the libraries in this partnership banding together to create a way to preserve and share their copies of the digitized books…independently of Google.

Having two sources of the same digitized content might sound a little redundant.  The Google Books Library Project corresponds directly with Google’s mission to make the world’s information accessible.  However, when we look a little more closely at the Library Project’s goal, it seems that Google is really trying to create some sort of visual card catalog, and is less concerned with the actual content.  Google also does not have the historical role that research libraries do in creating consistent access to materials.  Additionally, HathiTrust is focused on the long-term retention and storage of the digital content.  We can’t be sure of what Google’s plan is.  So perhaps it’s good that there is another project attempting to serve researchers and the public good that has corporate or commercial ties.

HathiTrust aims to include digitized content from printed materials (including journals) that extend beyond the scope of the Google digitization project.  They also hope to include born-digital materials and, eventually, items from institutional repositories.

Accessibility and Preservation Benefits

In my mind, one benefit of HathiTrust’s efforts that concerns a large number of potential researchers is that fact that it will provide an opportunity to remotely access items from libraries’ collections that would have otherwise required travel…or would have made side by side comparison an impossibility given the fragile and protected nature of many rare library items.
But why limit the benefit of accessibility to scholars?  One thing about digitization is that it can create a user audience where there once was no audience…so it is impossible to say how the general public can utilize newly available resources.

As for libraries, becoming a partner in the program will enable a library to store vast amounts of digital content, be it a special collection that has been digitized, endangered books (rare, brittle, etc.), purchased ebooks, and other digital items.  This will be a huge benefit to libraries who cannot start or handle a digital collection and storage project on their own.  The partner library will provide all the bibliographic data records for items to be ingested, and HathiTrust takes on all the hardware, staffing, and migration concerns.

Even non-partner libraries and their patrons will find an enormous benefit to utilizing HathiTrust: “HathiTrust is making bibliographic records for the public domain HathiTrust materials available so that institutions around the world can load them into their online catalogs, alerting users to the availability of these digitized volumes.”

Funding

HathiTrust is a non-profit, with no intention of becoming for-profit.  The original partners from the CIC and the University of California have apparently provided a large sum of the project’s existing funding.  Further partnerships from joining libraries will continue to provide funding.  The library partners will be charged once upon joining depending on the number of volumes they are adding to the repository using a per GB calculation system.

OAIS Reference Model Part II: The Model

Welcome to Part II of my OAIS Reference Model crash course!  By now you probably have noticed that I have refrained from including in this post any of the many graphed images that are in the OAIS reference model document.  This is because before I had a basic understanding of the model, these images seemed supremely complicated and confusing…kind of like Power Point slides with too many words.  I hope that what I provide here is a substantial enough understanding of the OAIS model to make the images less frightening when you do eventually encounter them.

Model Roles:

To start, it is important to recognize the three types of people that will be affiliated with a repository within the OAIS framework: the Producers of the repository’s content, the Managers of the content and repository, and the Consumers who use the content stored in the repository.  Each phase of the preservation process effects these three roles.  The ingest, the processing and storage, and the accessing of digital objects

The Model in Brief:

The document for the OAIS reference model has several key areas of content:

  • Terminology: An awesome vocabulary and glossary for the operations and information structures of repositories is located in Section 1.
  • Mandatory responsibilities: A list of the things that a repository must do in order be considered an OAIS-type repository comprises Section 3.  One particular action that this section calls for is identifying a designated producer/consumer community and ensuring that the information within the repository (metadata, etc), should be independently understandable (and accessible) by this community.  This means that “the community should be able to understand the information without needing the assistance of the experts who produced the information.”  Read this for more detail about the other mandatory responsibilities.
  • A model for ingesting, storing, and providing access to stored items, including a very smart model for capturing each item’s metadata (Content Information) and preservation metadata (Preservation Description Information).  Together, this data is discussed as an item’s “packaging information.”  It is intended to include information about an item’s context in order to fulfill one of an OAIS-type repository’s mandatory responsibilities.  This is all discussed in Section 2.
  • An outline for administrative management of the repository and the OAIS functions is presented in Section 4.  This discusses working with the creators of the digital objects and the objectives behind the day-to-day mangement of the repository. The administrative role also oversees the general planning and governance of the repositories, and include policy and preservation decisions.
  • Actual preservation methods: Preservation processes such as digital migration and emulation are examined in Section 5.  Preservation Planning is obviously a central part of any repository’s role.
  • Archive and repository interoperability: concepts behind repository interoperability and federation are discussed and explained in Section 6.  Heavy cooperation between repositories to develop common local standards in order to make this a possibility.

By following the OAIS model and the mandatory responsibilities which it entails, a repository will gain recognition as an OAIS-type archive or repository.  It is beneficial for a repository to be recognized as such because it means that the well-documented archival standards of the OAIS model will have been applied to help ensure the effective long-term storage, retrieval, and preservation of digital documents.  Another benefit is that communication with similarly-purposed OAIS repositories will be easy and fluid.

OAIS in Action:

DSpace and Fedora are two repository software platforms that have included OAIS-compliance capabilities in their product.  This helps pave the road for any repository that is built using either of these open source systems to follow procedures from the OAIS model.

What I would love to find or collect is a list of actual digital archives and repositories that are following the OAIS model either by the book or in some variation.  If anyone has a suggestion, please post a comment!

OAIS Reference Model Part I: Background and Influence

The OAIS model is an international standard that has been adopted for guiding the long term preservation of digital data and documents.  In fact, the OAIS model is an ISO standard (ISO 14721:2003): it was developed by the Consultative Committee for Space Data Systems (CCSDS) in 2002, and was adopted as an ISO standard in 2003.  The document is freely available, despite the fact that most ISO documentation is usually sold as a service.  It’s a hefty 148-pages, available in PDF form here.

oais
Photo by OliBac licensed under Creative Commons

The OAIS model is a standardized model describing a way that digital repositories intended for preservation purposes can be run.  Within this model, you will not find a standard for metadata.  It also does not endorse any particular repository platform, software, protocols or implementation procedure.  The OAIS model is simply a set of standardized guidelines intended to aid the people and systems behind a repository that has been designated with the responsibility of maintaining documents for archival purposes over a long period of time.

OAIS stands for Open Archival Information System, the word open referring to the open and public process under which this model was developed.  Participation in its initial development was encouraged by the CCSDS, and as an ISO standard, it will go under review every five years.

Because the OAIS model is a recognized standard, its users have formed a default sub-community within the digital preservation community.  But it has also been very beneficial to the digital preservation community at large and has helped promote progressive thinking and discussion.  Here are some key reasons why the OAIS model is so helpful to the digital preservation process and community:

  • It has standardized the terminology associated with digital preservation
  • It has outlined the duties and services of a preservation repository
  • It has outlined a way that information should be attributed and managed within a repository
  • It has mobilized community discussions about repository standards and certification
  • It has included preservation metadata as an important part of the preservation process
  • It focuses on long-term preservation, but lets “long-term” be defined by the repository managers
  • OAIS-type archives are committed to a set of defined responsibilities

As a final note, is important to make it clear that the OAIS model is by no means a requirement for a digital repository; while it is a recognized way of running a repository, it is not the only way.  It may not fit for some repositories, depending on their intended size, resources, and designated communities.  But admittedly, when a repository chooses not to follow the OAIS recommendations, it cannot fall under the umbrella of the most widely-used and understood digital archive standard.

————

Here are some resources that were incredibly useful for me while writing this post and the one to follow:

  • I really benefited from reading this post by John Mark Ockerbloom, the editor of the blog Everybody’s Libraries.  I almost considered forgoing my own entry and just directing readers directly to his!
  • And then I found this post and was blown away by how thorough it is.  It’s really well done and I’d encourage you to check it out.
  • This page is a brief run-down of OAIS from the JISC Standards Catalogue.

Continue on to Part II

ISO Standards

ISO is the commonly used name for the International Organization for Standardization. This is an international, non-governmental organization that creates standards based on a consensus of international committee members.

One ISO standard that is relevant to digital preservation practices is the OAIS model.

Additionally, there is a working group attempting to create an ISO standard for digital repository certification, which I think is an excellent idea. A wiki is maintained here with information related to their regular remote meetups and the documentation they are creating and collecting to assist in the process of writing a standard. A useful glossary of digital preservation terms can also be found on their wiki.

Original publication date: 7/20/09

Cloud Computing

Let’s talk about cloud computing.

At its simplest, things that are in the “cloud” are things that float around in a sort of digital airspace and don’t exist on your computer. They exist on remote servers which can be accessed from many computers.

cloud1
Photo by mansikka under a Creative Commons license

For this reason, the cloud is a good metaphor for the Internet. For most of us, keeping things in a cloud results in a convenient and logical way to make life simple.

You can access things in the cloud from anywhere that is connected to the Internet…depending on the service and its security (private cloud or public cloud). It’s kind of like your email or Facebook account. You have lots of stuff stored in these accounts that is specific to you, but you can log in from anywhere. And it will always look the same and have all your stuff in it. Your stuff is always just…there.

cloud
Photo by AJC1 under a Creative Commons license

Getting a bit more technical, your stuff is actually physically stored somewhere as bits on servers that are run by whoever is providing the service. For example, some institutions have servers dedicated to an institutionally-based digital repository. These servers might live on the campus and will store everything that is added to the repository. But the whole repository will not exist on the specific machine that you might use to access documents stored there. Your computer will connect to the remote server to access the repository.

What makes this fun for digital preservationists is that cloud computing can really increase the scale and sharing of preservation duties. Maureen Pennock, the Web archive preservation Project Manager at the British Library, recognizes this in her blog: “This minimises costs for all concerned, addresses the skills shortage, and produces a more efficient, sustainable and reliable preservation infrastructure.”

In the future – and as we are seeing with DuraCloud – all the tech work behind producing ways to store and retrieve data may be provided as part of a single repository product. (This type of service, by the way, is referred to as IaaS – Infrastructure as Service.) This would be excellent news for a great deal of institutions that don’t have the means or skills to set up a repository themselves.

Cloud computing offers a huge potential for an off-site alternative to the out-of-the-box repository products that most institutions currently must use. Instead, external organizations will be able to do the tech work while the institutions will be able to focus on non-technical repository maintenance.

Images by Flickr users mansikka and AJC1

Original publication date: 7/15/09