Preservica Logo
by Jon Tilbury

Preserving content from closed systems part 1: Office 365 & Google G‑Suite digital preservation challenges

Digital Preservation has always understood that information is complex but most practical implementations have worked with the concept that information is held in files that can easily be extracted from where they are created and initially consumed, and can be preserved in isolation from other files.

August 18, 2020

Digital Preservation has always understood that information is complex but most practical implementations have worked with the concept that information is held in files that can easily be extracted from where they are created and initially consumed, and can be preserved in isolation from other files. These files could often be accessed separately from the application that maintained them, for example on a file server, and had properties such as last changed date, size and checksum that could be independently verified.

As information becomes more complex and more intertwined with the application that creates it, this model is breaking. This presents a challenge to anyone implementing Digital Preservation technology and to those creating trustworthy information management systems. This blog series explores several popular systems to see how this impacts the preservation of the information they contain.

Let’s start by examining the Microsoft Office journey. Initially information was held in individual files, for example a Word document, Excel Spreadsheet or PowerPoint presentation. Each file was a digital asset which had its own lifecycle. It could easily be extracted from the file system it was held on and the format migrated as the versions of Office were updated. So long as the whole chain of files was retained you had a complete record of that information’s digital preservation lifecycle.

The move to Office 365 meant the files were now held in a cloud hosted information management system along with other information such as metadata and a version history. However, the information can still be extracted as a file, either manually or via a comprehensive API that enables automated transfer. In fact, the API also allows transfer of the metadata to increase the quality of the information. The files are held in a familiar fixed folder hierarchy which can also be replicated automatically in the digital preservation system.

Of course there is no reason to believe that the information within Office 365 is the same as that delivered by export. The file is an artefact of the transfer process, but the same software can be used to interact with information held inside Office 365 and that held on the local file system. This gives us a degree of confidence that the information we have exported is a very good or identical copy of that inside the system. Also, the file formats are externally published and can be read by third party tools such as LibreOffice.

Google G-Suite has many similarities. The content (documents, spreadsheets, or presentations) is held within the system and presented via a drive for editing using the tools in the system. The significant difference is that applications are only ever online, and the content cannot be edited outside of the system. There are no separate G-Suite or third-party editors and the internal bitstream is not shared with anyone else.

The extraction process converts the object into a format for editing in someone else’s product, for example Microsoft Office, Open Office or PDF. Given that the digital preservation of these objects relies on extraction into a separate system, it makes it difficult to trust that the object you extract is the same as the one inside the system.

It must be remembered that users are constantly changing information but cannot get access to the raw data they are maintaining. It is critical to be sure when the file changed, how it changed and who changed it, and the conversion process makes this difficult. And this brings us to more complications. G-Suite is fast changing and evolving and there have been cases of the content being changed automatically to be usable with new system features. This makes it very difficult to know whether the system has changed the content or an individual. As proving the content has not been changed since a specific date is a key part of the digital preservation process, it means it is difficult to trust content created and extracted in this way.

So the digital preservation of G-Suite is problematic. We cannot independently access the raw data and cannot independently interact with this content. Also, the raw content is prone to change which makes digital proof difficult.

It is quite possible that G-Suite provides a vision of the challenges for digital preservation in the future. As content and the application that creates it become more intertwined, especially in mobile apps, we may find many more cases like this to unpick in the future. My next blog will explore how this impacts the way we preserve social media such as Twitter and the complications this presents.

Jon

You can read Part 2 in the 'Preserving Content from Closed Systems' blog series here