JHOVE and JHOVE2

When curating digital files for storage in a digital repository, being certain of an object’s file type format is very important for preservation purposes and for future accessibility.  JHOVE is an open-source, Java-based framework that will identify, validate, and characterize the formats of digital objects.  This tool can be integrated into an institution’s workflow associated with populating a digital repository.  If the repository is OAIS-compliant, the workflow integration would occur during the creation and validation phase of an information package in the digital object’s ingestion.

JHOVE’s three steps of identification, validation, and characterization will result in us knowing a great deal about a digital object’s technical properties.  We’ll know the object’s format (identification), we’ll know that it is what it says it is (validation), and we’ll know about the significant format-specific properties of the object (characterization).

JHOVE’s name comes from the partnership that spawned between JSTOR and the Harvard University Library in 2003 to create the software, and stands for JSTOR/Harvard Object Validation Environment.

The Digital Curation Centre did a case study in 2006 which provides a concise background of the JHOVE project.

JHOVE2

The September 2008 Library of Congress Digital Preservation Newsletter reports on the development of JHOVE2.  The users and creators of JHOVE decided to address what they saw as shortcomings and improve the tool for JHOVE2.  However, the project has moved to the guidance of the California Digital Library, Portico, and Stanford University.jhove2

The most notable change to come in JHOVE2 is the shuffling around of the original  three-step process outlined by JHOVE.  For JHOVE2, the whole process is now considered to characterize a digital object by identifying, validating, and reporting the inherent properties of the object that would be significant to its preservation.  Added to this process is an assessment feature that determines the digital object’s acceptability for an institution’s repository, based on locally-defined policies.

A really exciting and valuable improvement in JHOVE2 will be the ability to address the characterization of digital objects that are comprised of more than one type of format.

One of the developers of JHOVE and JHOVE2 is Stephen Abrams of the California Digital Library, who has been designated as a Digital Preservation Pioneer by the Library of Congress.

JHOVE2 is expected to be released early 2010, but the prototype code was made available last month for viewing and comments.  Here are the FAQ from the project’s site.  The completed product, like its predecessor, will also be available under an open source license.

There will be a JHOVE2 Workshop following iPRES 2009.

HathiTrust

“Hathi” is the Hindi word for elephant, and this project uses the elephant’s association with wisdom and memory in its name, HathiTrust.  HathiTrust is a shared digital repository whose content is composed largely of digitized books from the Google Books Library Project.  The idea to create the shared digital repository comes from Committee on Institutional Cooperation and the University of California system.  The associated universities are all partners in HathiTrust, and partnership is open to other research libraries.

As far as preserving the content within the repository, HathiTrust touts that it “provides a no-worry, pain-free solution to archiving vast amounts of digital content. You can rely on the expertise of other librarians and information technologists who understand your needs and who will address the issues of servers, storage, migration, and long-term preservation.”

HathiTrust envisions itself as becoming a very credible and large digital library as well as being able to provide a viable preservation service for its partners.

The Content and Google

In the HathiTrust shared repository, outside users will mostly find content that is in the public domain.  Due to copyright and licensing regulations, most the content that is currently in the repository cannot be viewed by non-owning parties.  In fact, only 16% of the total volume stored in HathiTrust are in the public domain as of August 8, 2009.  The FAQ also state that, “as it becomes possible to expand access to the materials through permissions or other agreements, other materials will be made available. HathiTrust has already been contacted by some rights holders wishing to provide broader access to their content.”
HathiTrust has made all of the public domain content available via search within the repository and through Google.  You can also browse a very cool visualization of the public domain content by the LC classification.  The search functionality of the full catalog is still under development in partnership with OCLC, but they offer this at the current time.

Most of the public domain content of this shared digital repository comes from books that have been digitized through the Google Books Library Project. Google partnered with institutional libraries to digitize books that are hard to find and may otherwise be unavailable.  HathiTrust stemmed from some of the libraries in this partnership banding together to create a way to preserve and share their copies of the digitized books…independently of Google.

Having two sources of the same digitized content might sound a little redundant.  The Google Books Library Project corresponds directly with Google’s mission to make the world’s information accessible.  However, when we look a little more closely at the Library Project’s goal, it seems that Google is really trying to create some sort of visual card catalog, and is less concerned with the actual content.  Google also does not have the historical role that research libraries do in creating consistent access to materials.  Additionally, HathiTrust is focused on the long-term retention and storage of the digital content.  We can’t be sure of what Google’s plan is.  So perhaps it’s good that there is another project attempting to serve researchers and the public good that has corporate or commercial ties.

HathiTrust aims to include digitized content from printed materials (including journals) that extend beyond the scope of the Google digitization project.  They also hope to include born-digital materials and, eventually, items from institutional repositories.

Accessibility and Preservation Benefits

In my mind, one benefit of HathiTrust’s efforts that concerns a large number of potential researchers is that fact that it will provide an opportunity to remotely access items from libraries’ collections that would have otherwise required travel…or would have made side by side comparison an impossibility given the fragile and protected nature of many rare library items.
But why limit the benefit of accessibility to scholars?  One thing about digitization is that it can create a user audience where there once was no audience…so it is impossible to say how the general public can utilize newly available resources.

As for libraries, becoming a partner in the program will enable a library to store vast amounts of digital content, be it a special collection that has been digitized, endangered books (rare, brittle, etc.), purchased ebooks, and other digital items.  This will be a huge benefit to libraries who cannot start or handle a digital collection and storage project on their own.  The partner library will provide all the bibliographic data records for items to be ingested, and HathiTrust takes on all the hardware, staffing, and migration concerns.

Even non-partner libraries and their patrons will find an enormous benefit to utilizing HathiTrust: “HathiTrust is making bibliographic records for the public domain HathiTrust materials available so that institutions around the world can load them into their online catalogs, alerting users to the availability of these digitized volumes.”

Funding

HathiTrust is a non-profit, with no intention of becoming for-profit.  The original partners from the CIC and the University of California have apparently provided a large sum of the project’s existing funding.  Further partnerships from joining libraries will continue to provide funding.  The library partners will be charged once upon joining depending on the number of volumes they are adding to the repository using a per GB calculation system.