by David Clipsham

Updating Preservica following a PRONOM release

December 19, 2023

November saw the release of the latest PRONOM update, so I thought this would be a great moment to describe how I assess a PRONOM release to understand its potential impact on existing Preservica processes, on our automated Active Digital Preservation features and on opportunities to enhance Preservica's capabilities.

What is PRONOM?

PRONOM is a registry of file format information curated by The National Archives in the United Kingdom. It currently contains data on over 2,300 file formats and provides mechanisms for accurately identifying most of the file formats it details.

PRONOM's data underpins the file format identification capabilities of tools such as DROID and Siegfried, and consequently plays a critical part in the application of preservation policy within Preservica.

PRONOM benefits from international contribution and Preservica is one of its most active contributors.

First steps

I begin with a thorough review of the PRONOM release notes. I check for any notable changes, for example any new formats that I expect will be important to Preservica users, or any additions that are likely to be supportable with our existing toolkit.

I also seek samples of any new formats added. Many contributions to PRONOM are submitted publicly through the PRONOM Research GitHub repository and some submissions include examples of formats that can be used for testing purposes.

Format tests

Within Preservica we maintain a collection of files representing hundreds of file formats we have gathered over many years for testing purposes, so my next step is to re-identify this collection, which tells me whether the identification of anything within our known data set has changed.

This is useful as it demonstrates to me whether any existing business rules, such as those used for migrating formats or rendering them within Preservica, may be affected by updated format identification so any such rules can be updated accordingly.

Detailed analysis

I then run a set of analyses over the updated PRONOM data, as represented in the identification 'signature files' produced for processing through the DROID file format identification tool.

I methodically check for changes between the previous and new PRONOM with the assistance of a set of XSLT transformation scenarios.

XSLT is a technology for transforming XML data, such as the data created by PRONOM, and in my case, I use it to filter PRONOM's data by name, extension, 'priorities' (the relationships between file formats), and identification signatures.

This allows me to easily see the exact changes to the data between the previous and new PRONOM update. I have details of all updates to PRONOM data going back to 2006.

This approach also helps to spot any issues with PRONOM’s data itself, such as mis-assigned priorities, and enables me to feedback any such problems directly and promptly to the PRONOM team for correction.

Re-identification recommendations

I create a set of recommendations for file format re-identification. There are certain changes to PRONOM's data which prompt a recommendation to recharacterize specific file format instances. These include updates to format names or version numbers, the introduction or removal of a file format identification signature, the introduction of new format priority relationships, and updates to the identification signatures themselves.

Having created these 'recommended processes', these will then be applied to Preservica systems through our automated Active Digital Preservation technology, ensuring Preservica customers benefit from the most accurate file format identification available.

Tools and business rules

Finally, I update the Preservica Registry, ensuring any new file formats that are compatible with existing tools are correctly wired up to work with them. This means that Preservica users can take advantage of the latest format rendering capabilities, file format migration pathways, property extraction, and format validation.