Celebrating Digital Preservation Day: Why Scalable Digital Preservation Systems Should Be Everywhere
On Digital Preservation Day 2017, Euan Cochrane, Digital Preservation Manager at Yale University Library, discusses the challenges of scale associated with vast collections of born-digital content
Today marks the first annual ‘Digital Preservation Day,’ when preservation practitioners around the world can reflect on the importance of ensuring critical digital records, whether of cultural or business importance, remain accessible, secure and future proofed. It’s a day for us to consider not only the value of the collections we have, but also the scale of many quickly growing born-digital archives.
The preservation and scalability challenges of using CD-ROMs to store unique digital collections brings into sharp contrast the difficulty in managing both native and born-digital content. A few years ago at Yale University Library we started to take action to mitigate against the risk of media degradation by actively preserving our born-digital content on external media in our general collections.
We began by identifying the CD-ROMs and floppy disks that were residing in our collections, and then systematically recalled the media to make disk images of them using BitCurator and Kryoflux tools and software.
After the first 200 CD-ROMs had been imaged I decided to analyze them to produce some metrics on what we had in order to help with understanding their preservation needs, discussing their value and their preservation context, reporting on progress, and explaining them more generally. I hoped this would show us what we had to do to best meet our preservation needs, and open up a discussion on the value of our digital assets. In the first 200 successfully imaged CD-ROMs there were:
- 243,714 files
- 152 distinct “formats” identified by DROID (a file format identification tool) and 43 format “versions” amongst those
- 52,361 files whose formats couldn’t be identified
To communicate this volume and complexity with colleagues more familiar with analogue content, I thought it would be useful to covert these numbers to “paper equivalents”. The majority of the 243,714 files were made up of PDF files (in 9 different versions) with an average size of 343KB per file, while a normal archival box can fit a maximum of 2,500 sheets of paper.
With a conservative set of conversion assumptions where we assume that each PDF file, if printed, only consists of one page, and we can fit 2,500 pages per box, then we would have equivalent of 98 boxes of printed paper within the first 200 CD-ROMs. This shows that for each 1,000 CD-ROMs we have at a minimum, the equivalent of 487 boxes of printed-paper equivalent, or approximately 487 shelf-feet at 1 foot per box. Using more reasonable page (but still potentially conservative) conversion assumptions of 1,500 pages per box and 211KB per page we get at least 1,321 printed-paper box-equivalents per 1,000 CD-ROMs, or 1,321 shelf-feet of boxes.
Why is this important? This illustrates that, likely without intending to, many libraries have already built large collections of born-digital content. Collections that may well rival their analogue collections in volume, complexity and numbers of objects.
The challenge of scaling preservation
Do you know what you have? Do you know what you have in those collections that is unique? At the Yale University Library we have a huge range of different types of born digital content in our general collections. From companion-exercises for text books, to conference proceedings that never made it to the internet, to dissertations and notes from professors and researchers, to video games, to copies of government and business datasets that I believe are now unavailable from other sources.
The volume of content digitized elsewhere, acquired digitally or digitized in-house has been growing dramatically over the past few decades. While there has been a huge amount of work and success in building and implementing institutional repositories for managing and sharing digital content, few of these repositories have been built with long term digital preservation functionality in mind.
Research I led at Archives New Zealand showed that it took about 9 minutes to manually test a comprehensive preservation process on a single file. To do this with all the files in the first 200 CD-ROMs would take one person four years. With library budgets under huge pressure throughout the western world, if asked to employ people to undertake this themselves they simply could not.
In other words, manually processing and preserving our digital collections is untenable. But it’s not necessarily easy to identify and implement an effective solution. In order to address this huge challenge, organizations will need systems that will process large volumes of content automatically, preserving and protecting files while also scaling and connecting with other systems as needed.
Until Next Digital Preservation Day:
A big thanks should go out to the Digital Preservation Coalition for all the effort it has gone to in marking this first annual Digital Preservation Day. The hope and aim is that the newfound awareness of digital preservation created by this annual event will inspire new debate and conversation and understanding surrounding the complexities and challenges faced by digital managers today, and in the future.
Find out more about how Yale University Library is using Preservica to maintain over 300 years of history.