Recharacterizing files at scale to align with the latest tools and best practice
Recharacterizing files at scale to align with the latest tools and best practice
Digital Preservation is a continuous and dynamic set of activities. Ensuring that digital records remain available to use over the long term requires continual vigilance, re-evaluation of established policy, and technical understanding within a rapidly evolving digital landscape.
In this blog, David Clipsham, File Format Analyst at Preservica, explores why continuous improvements to the PRONOM registry and DROID identification tool mean that many files characterized in the past now need to be re-characterized in order to ensure they are accurately identified.
Based on David’s detailed analysis this challenge impacts 100s of different file formats and millions of actual files – highlighting a need for the continuous appraisal of preserved content (post-ingest) as well as ways in which recharacterization can automatically happen at scale – both on a one-off and on-going basis.
Active management and appraisal of your preserved content
Digital Preservation is a continuous and dynamic set of activities. Keeping up with changing tool sets and best practice recommendations can be a challenging proposition for organizations of any size, and is the reason why Preservica is beginning to roll-out automated active digital preservation for our Cloud and Enterprise edition customers.
Automated active digital preservation leverages vast community knowledge and best practice to allow you to automatically apply your chosen preservation policy settings to stay ahead of at-risk file formats and free-up time to work on other value-added tasks.
Among other actions, automated digital preservation streamlines the recharacterization and migration of assets to the latest accessible formats.
You may be aware that I joined Preservica in May as the File Format Analyst, having previously worked for The National Archives in the UK in a variety of roles. Crucial to my new role, I previously led on research for the PRONOM file format registry that underpins the Preservica format registry and many aspects of digital preservation policy and action planning.
File format characterization
My initial task at Preservica has been to create a list of recommended recharacterization actions that will drive initial automated digital preservation actions. File format characterization is a process that typically happens when digital files are ingested into a preservation system. It involves running tools against a set of files to uncover information about them, commonly in the form of technical metadata.
‘Recharacterization’ is re-running the characterization actions based on current tooling and capabilities, since the tools used at ingest may have improved, new tools may have been introduced, and the data that underpins those tools may have evolved.
The very first step of characterization is typically file format identification, as different characterization and metadata extraction tools will specialize on working with specific file format types. For example, a tool such as MediaInfo is brilliant at uncovering technical detail, such as the underlying codecs used to compress the audio and video streams within a media file, but it isn’t designed to extract anything useful from word processor documents, so it only really makes sense to run MediaInfo against files that have been identified as those that MediaInfo can work with.
File format identification therefore helps to determine which additional characterization tools should be run against a given file but may also help to guide any other preservation steps, such as file format migration, to ensure a file remains accessible and available to use over time.
The role of DROID and PRONOM
Preservica uses the tool DROID to perform the initial file format identification task, and DROID relies on data from the PRONOM file format registry. Both DROID and PRONOM were originally developed by Preservica’s former parent company, Tessella, in partnership with The National Archives in the UK, who continue to maintain the tools. I won’t go into too much detail about the intricacies of how exactly DROID and PRONOM work as other blog posts have explored this in depth, for example: PRONOM: A database centenary — The National Archives blog.
PRONOM’s underlying data about file formats is the result of a collective, international research effort involving digital preservation practitioners from dozens of institutions, including Preservica. Each file format has an identifier, the PRONOM Unique Identifier, commonly referred to as the PUID, and each PUID has its own webpage of information.
For example, the .docx format produced by Microsoft Word since 2007 has an identifier of fmt/412, which is distinct from the .doc format produced by versions of Microsoft Word 97 – 2003, which has its own PUID of fmt/40. For the oldest versions of Microsoft Word, the formats were often subtly different from one another depending on which operating system the software was designed for, so PRONOM includes entries for formats like ‘Microsoft Word for MS-DOS Document 1.x‑4’ and ‘Microsoft Word for Windows Document 1.’
Usually, PRONOM research focuses on the specificities of a file that allow it to be accurately identified by a file format identification tool such as DROID. A researcher will examine the underlying byte code (the ones and zeros that make up any digital file) and will aim to determine a consistent pattern of data (similar to a fingerprint) within a given format, that can be reliably used to identify instances of that format.
But PRONOM data is not static, and instead is continually iterated upon. Sometimes the collective understanding of a particular format will improve, or the software vendors responsible for a format will release an updated version of a format. Occasionally mistakes or assumptions are made that require correction.
When changes to the underlying data occur, it can be crucial to recharacterize files to ensure their identification is up to date with current understanding, so that any subsequent actions, such as those driven by a file format migration policy, can take place.
But we do not want to recharacterize files inefficiently. It would be wasteful to attempt to recharacterize every file that has previously been ingested every time PRONOM is updated just in case something has changed. Instead, we should only recharacterize those files that have a reasonable chance of encountering a different identification outcome, and it is here where my recommendations will be of use.
Recommended recharacterization actions
I have been conducting thorough historical analyses of all PRONOM data since the registry was first made public back in 2005 in order to build a knowledge base of the evolution of PRONOM’s format data and to understand when any changes might warrant a recharacterization action.
I have created a list of all recommendations created so far, which can be found on a GitHub wiki, here: automated-preservation-recommendations Wiki. This will be a living list, so whenever a new PRONOM update is released, I will assess the changes and publish any new recommendations. However, the first release shows there are already approximately 500 formats that need recharacterizing.
For Preservica, these recommended recharacterization actions will be incorporated into our active digital preservation software to enable files to be continuously and automatically recharacterized at scale (post ingest). All customers who subscribe to our recommendations will have their files recharacterized automatically in the background and can rest in the knowledge that their preservation plans are best on current best practice.
Further recommended actions, such as additional and updated migration pathways will follow.
Here are three illustrative examples:
JPEG 2000 formats
The PRONOM entry for JPEG 2000 (an image and video format often used for digitization) was previously a very general listing that encompassed all subtypes of the JPEG 2000 file format family, but in PRONOM update v62 on 28 August 2012, this was separated in to four separate PUIDs:
- x‑fmt/392 — JP2 (JPEG 2000 part 1) – this is the main image format
- fmt/151 — JPX (JPEG 2000 part 2) – this is an extended version of the image format
- fmt/463 — JPM (JPEG 2000 part 6) – this is a compound, or ‘layered’ image format
- fmt/337 — MJ2 (Motion JPEG 2000) – this is a video format
This means that any x‑fmt/392 files most recently identified using a version of the registry older than the v62 PRONOM update must now be recharacterized as they will now receive their specific identification outcome.
The Video Object Format (VOB) is a format based upon the MPEG‑2 video format and is most commonly used for DVD video. VOB is a subset of MPEG‑2 that means it contains specific features that are unique to VOB, over and above the base MPEG‑2 standard. We handle subsets within PRONOM by giving the more specific format priority over the less specific format – if we did not set a priority, a VOB file would identify as both VOB and MPEG‑2.
Unfortunately, when VOB was introduced, the priority was added the wrong way round, meaning any VOB files would only identify as MPEG‑2 and not the intended outcome. This means that any MPEG‑2 files that were identified based on a PRONOM data older than the v75 update may benefit from recharacterization as these may actually be VOB files.
The PRONOM entry x‑fmt/403 for StarOffice Writer 5.2 was deprecated in PRONOM update v83 on 17 December 2015. Deprecation is effectively a marker to say ‘do not use this entry anymore’ but because PRONOM entries are supposed to be persistent, they should never be fully deleted. All files ingested that have a PUID of x‑fmt/403 should be recharacterized. These will most likely receive a new identification outcome of x‑fmt/400 — StarOffice Writer 5.x.
iPres 2022 – 12 – 15 September, Glasgow
I will be demonstrating some of the more interesting aspects of PRONOM’s evolving history during a lightning talk at iPres 2022 on Tuesday 13th September during the 4pm‑5:30pm session.
Automated active digital preservation, as well as other Preservica functionality, will also be demonstrated at iPres during the ‘Bake Off’ sessions throughout Wednesday 14th September.
I hope this serves as a useful introduction to me, my role, and Preservica’s automated active digital preservation.
I’m especially keen to help Preservica customers with their interesting file format challenges, so if you find that parts of your ingested collections are, for example, currently unrecognized, then please do get in touch via the automated preservation Group on the Preservica Community Hub.