Preserving content from closed systems part 2: Digital preservation of social media

In my last blog I explored how online content management systems are making the preservation and trust of their digital content more difficult by wrapping the creation and storage of digital content into a single platform. Nowhere is this more obvious than in social media platforms where there is no direct access to the content they publish. In this blog I will use Twitter as an example of how this information can be preserved and how well-supported APIs make this easier.

Tweets are now becoming a major part of the world’s information dialog and the retention of the information they contain is critical when looking at the whole historical record. They are held internally in databases of unknown structure and the information is presented on the fly in different ways depending on the application used. How do we preserve this with any degree of trust?

Many people try to archive Tweets by taking snapshots of the website for a specific account but this is inherently lossy – the tweet is presented differently in different platforms and not all the information is revealed. Also, the links between tweets, for example replies or re-tweets, are difficult to represent in this way. There must be a better way to get the information.

I Stock 1095023700 1

Well the good news is that Twitter has a very comprehensive API that allows you to extract everything about a tweet in JSON format including the text and plenty of metadata – everything you could possibly need for preservation. This goes into much more detail than what you see on the screen and gives you great confidence that it is complete. It is straightforward to build a render tool that replicates the tweet viewer found online or the mobile application, so the consumer has the same experience as a current user of Twitter.

However, the challenge is that a tweet contains other information alongside the text and metadata. are the attached images or a video, for example. The JSON helpfully contains links to these objects at various qualities so you can choose which to download (usually the best quality). The files are typically small, with the exception of where Twitter is used for live streaming where the video can be many gigabytes. These media files need to be permanently associated with the JSON and metadata, bringing us to the concept of the “multi-part asset”. This was introduced in Preservica v6 and combines all the metadata and files associated with a single piece of information into an atomic asset that must be handled as a whole.

The next extraction challenge is where the tweet contains a URL link to an external web page. This is contained in the JSON and can be shown in the render tool but introduces the concept of link rot – how do we know that the information linked is the same as the time of tweeting or that the URL even still exists. It is possible to take a snapshot of the web page, either as an image, PDF or WARC file, and to add that to the multi-part asset but what are the copyright issues relating to this? This remains to be solved.

I Stock 1169192820

At Preservica we have a proof of concept running, acquiring tweets as multi-part assets and also creating links between tweets for conversation tracking, for example for retweets and quotes. Whilst extraction changes the information it can be done almost immediately after the tweet is posted and is comprehensive and appears to be identical to the original tweet.

Of course the APIs themselves are now a critical part of the process. They are licensed and often have stringent and yet ambiguous terms and conditions, and these can vary at zero notice. At PASIG 2019 Amelia Acker of The University of Texas at Austin explored this in more detail and showed how the APIs themselves and certainly their terms and conditions should be preserved alongside the content extracted.

So Twitter preservation has introduced some interesting digital preservation concepts. It has showed that a good quality API can be very useful in exporting a comprehensive copy of the information held within the system so it can be re-used and trusted. It has also introduced the concept of a multi-part asset which contains multiple files which combine to present a single indivisible piece of information.

In the next blog I will explore how older information management systems such as Lotus Notes present even more complex digital preservation challenges and show some of the approaches we have used to overcome them.


If you missed Part 1 in the 'Preserving Content from Closed Systems' blog series you can catch up on it here

Mg 0211
Posted by
Jon Tilbury
Chief Innovation Officer
Information Management Today