Preserving content from closed systems part 3: Lotus Notes De-Commissioning and Preservation
In this blog series I have been exploring how the combination of applications and data management into a single platform make the trustworthy extraction of information for preservation more complex. The preservation of legacy information from a Lotus Notes Database, often during an application de-commissioning process, is one of the most complex examples and compromises have to be made to manage the process smoothly.
A Lotus Notes Database is in some ways similar to the Office 365 example discussed in an earlier blog. The good news is that Notes has a programmer interface and there are a range of tools that can extract the data from the database. However, Notes introduces a lot of complexities that make the preservation of its information more complex and inherently lossy.
One approach often explored with Lotus Notes is to use an emulation framework to run a Domino Server and Notes client allowing the user to interact with the data as if they were using the original system. This recreates the original user experience and is totally lossless but presents several challenges. The first is that the original user experience was pretty unpopular and the skills to use the application will quickly diminish. The second is that some functionality in the original was missing, for example the ability to search over multiple databases. Thirdly, the attachments within the data may be subject to format obsolescence if they stay within the original data. Also Notes applications were often licenced and so re-running the old application will be very expensive. And lastly Notes security is pretty much uncrackable so if the admin passwords or certificates are lost the data is lost.
So if we decide we need to migrate to a digital preservation system we need to understand what is in a Notes Database. At its heart it combines metadata fields with formatted information held in a rich text format. This may also contain attached files, such as office documents or images. The Notes hierarchy is abstract, that is it is created on the fly from the metadata fields. This is very flexible when creating applications but does not map well onto most Digital Preservation systems that have fixed hierarchies. Most extraction and transfer exercises have to pick one of the hierarchies (views) to copy and map the records onto that.
The good news is that using third part software the record metadata fields can be extracted and mapped easily onto metadata in the preservation system to create indexed fields for faceted and fielded search. This can be even better than in Notes itself and it is certainly a chance to revisit the indexing rules and to improve them in the light of experience. So far so good.
The main challenge is to transfer the rich text in a form that appears the same as the original Notes record but in a way that can be preserved independently. The first step is to create the record header showing the metadata fields as they appeared in the original system. The second step is to extract the rich text along with the metadata header into a format that looks the same but can be read outside Notes. This can be HTML but is more usually PDF which does a good but not perfect job of replicating the look of the original record.
The next challenge is the attachments. These are usually embedded inside the Notes record and most extraction tools leave them as embedded files within the PDF document that is created. For digital preservation purposes they need to be extracted as separate files so they can have their own Preservation Actions applied. However, they must be permanently associated with the PDF and metadata, bringing us to the concept of the “multi-part asset”. As discussed in the last blog, this was introduced in Preservica v6 and combines all the metadata and files associated with a single piece of information into an atomic asset that must be handled as a whole.
Preservica is familiar with this process from when it was created as an independent company. The information from all our projects within Tessella going back to the last millennium were held in several large Lotus Notes databases. With the help of specialist AD7 we extracted all the records and loaded them into a Preservica database reserved for internal use. Recently we worked with AD7 again to transfer 50 million records from 800 Notes databases in a UK Government Department into Preservica in the cloud.
The extraction process from Notes is much more destructive than taking data from Office 365 or Twitter. It can however be automated and records transferred at huge scales which allow the Lotus Notes records to be searched, explored and consumed long after Lotus Notes is no longer in widespread use.
This blog series has intended to show that the traditional view of a single piece of information in a file on a computer is breaking down. Information is getting more complex and is becoming entwined with the systems used to create and distribute it. As this is likely to be the norm into the future, digital preservation has to evolve to keep up and to adopt sophisticated data acquisition processes to ensure that information can be extracted, preserved and used in a way that future generations of information consumers can trust. The Innovation Team at Preservica is working on many of these challenges, and solutions are now being released into the product.
You can hear more about the successful decommissioning and transfer of 50 million records from 800 Notes databases in a UK Government Department into Preservica in the cloud at #WeMissiPRES on Day 1, Tuesday September 22nd from 12.55pm CEST
If you missed the previous blogs in this series you can catch up on them using the links below: