Keeping PAR on course: creating a generic protocol for the community
The Preservation Action Registries (PAR) is a protocol that allows Digital Preservation practitioners, researchers and vendors to share information about preservation tools, actions, business rules and policies. Currently at the “Proof of Concept” stage, this is backed by vendors Preservica, Artefactual, Arkivum, practitioners JISC and digital preservation technology stewards OPF and is currently looking for more community involvement. In this technical blog, Jack O’Sullivan explains how Preservica is already using PAR to support radical changes in their product, and how the lessons learned are helping to reshape PAR into a generic protocol useful for all.
Gaining momentum with PAR
Gaining momentum to make deep rooted underlying changes to the way things are done in a software application is difficult. Most of the time, “that’s just the way it works” and “it does what we need” are strong enough arguments against the inherent risk of making significant changes that the status quo is retained. Just occasionally however, events align to create a brief window of possibility, where big changes are tolerable, necessary or maybe even inevitable. That window opened up at Preservica last summer, and so we used it to see what the PAR model could do, even in its infancy.
We had already committed to making changes to Preservica’s underlying data model, and as that work proceeded it became clear that a lot of the processes in the application, including the framework for performing characterisation and migrations, would need at least some level of work. At around the same time, having successfully created a proof-of-concept endpoint for the nascent PAR API, based on Preservica’s existing Registry technology, we were starting to turn our attention to how the PAR Business Rules and Preservation Actions should be mapped to the existing Migration Pathways and Tool priorities used in that same framework.
Learning by doing
As these two events converged, I realised that here was an opportunity to skip the mapping exercise and re-align Preservica around something that looked much more like the PAR model. As well as offering an opportunity to simplify the way Preservica’s behaviour is controlled, it would also give the PAR model some much needed stress testing.
Learning by doing is one of the best ways to uncover where a developing model works, or doesn’t work, and where important but easy to miss details have been left out or where ambiguities might make it difficult to use. Implementing a client to execute Preservation Actions based on PAR’s Business Rules certainly helped illuminate elements of the model and API that have not yet been fully fleshed out.
The discussions around the model and API had always centred on the idea that an PAR endpoint should be able to list entities, subject to date filtering. The fundamental question “which business rules have changed since I last read from this end-point” was always at the core of the design. Our initial API specification included this feature, providing date filters on the endpoints, however it soon became clear that useful filters would extend far beyond that, for example, being able to ask for a list of Business Rules filtered by the file format they can be applied to, and then again by the type of Action they implement. This has led to a proposed extension where the API endpoint for a given entity should be filterable by any other PAR entity that is relevant, so Business Rules can be filtered by formats and types, Preservation Actions can be filtered by type and tool, and so on.
In the original model, the PAR Core entities did not include a “Format Family”, even though we had defined such a concept within the Business Rule. The scope of utility of being able to group related formats meant that in implementation, we decided to provide a top-level endpoint for these entities, effectively making them “Core”.
Executing Preservation Actions
In executing the Preservation Actions, and specifically the characterisation actions where reading the result of running a tool is required, we realized that the model could assert that we needed to specify a way to tell the client how to interpret the output, without offering any clear instructions on what that should look like. Fortunately, for now, all the tools we use produce XML output, so all we need to do is ensure our executor knows to interpret the instruction as an XPATH. In future this will require some form of controlled vocabulary, which should probably either be suggested, recommended or defined within the scope of PAR.
We also came up against the inevitable question of how to describe a situation where multiple steps should happen as part of the same overall action. For us, that occurred when trying to describe how to extract properties from WAVE Audio files. We knew that there were some properties we wanted to extract using JHOVE, but also knew that MediaInfo would give us some others that we wanted. The PAR model doesn’t yet have a definitive view on how to string together such actions. In the end, we decided we would describe our intention in a single Business Rule (Perform Property Extraction of WAVE files), listing multiple Preservation Actions that have equal priority as a means of accomplishing that. Our client interprets this specific condition not as an error, but as a means of indicating that it should perform multiple actions. Thankfully at this point, the order in which the actions occur is not important to us, but eventually we will hit the case where it is!
What is perhaps most surprising is how (relatively) minor the changes and extensions that we needed to make actually were. For the most part, specifying the intended behaviour of Preservica’s characterisation and migration frameworks in terms of PAR Business Rules, and specifying the implementation in terms of Preservation Actions was largely straightforward. There are still gaps that we know both as contributors to the PAR model, and developers of a digital preservation system, need to be plugged. In addition to the need to perform multiple actions in a set order, we will sometimes need to specify actions that require multiple input files, or describe a Business Rule that should apply to a complex object that isn’t just specified by single a file format.
Within the PAR team, we don’t want to let the perfect be the enemy of the good. Publishing concrete models and APIs, even where we know there is still work to be done and kinks to be ironed out, offers the best possibility of allowing users to start discovering what benefits might be possible, and what work should be prioritised to start making them tangible.