Preservica Logo
by Rob Sharpe

The costs of active preservation

Some people may have seen that David Rosenthal from Stanford has posted on his blog, commenting on my previous post.

December 9, 2010

Some people may have seen that David Rosenthal from Stanford has posted on his blog (http://blog.dshr.org/2010/12/puzzling-post-from-rob-sharpe.html) commenting on my previous post here. He makes a number of interesting points that I thought I should follow up on.

Digital Preservation Architectures

First of all he states that his main argument is not that formats don't become obsolete but "basing the entire architecture of digital preservation systems on preparing for an event, format obsolescence, which is unlikely to happen to the vast majority of the content in the system in its entire lifetime is not good engineering".

My experience is that, when it comes to building software systems the 80-20 rule often applies: i.e. you can produce the easy 80% of the functionality you want with 20% of the effort and you need 80% of the effort to produce the other 20%. It is usually the hard requirements that drive the architecture of the system because they are where you'll spend most of your effort and they often difficult to "bolt on" to the wrong solution. Hence, systems need to be complex enough to do the job they need to do but, of course, no more complex than that.

Some migration is required to at least some of the content in digital preservation systems (albeit, as I said in my last post, it is more common to do this for presentation rather than preservation reasons). When migration is needed, it often needs to be done in high volumes. Thus we see it is a requirement (of our system at least) to be able to handle the use case of automated, policy-driven migration. This requirement does place some architectural demands on the system: in particular on the information structure required and SDB deals with this. However, our functional model is workflow-based and thus quite flexible and so if a user doesn't want to do migration (or characterisation) they can choose not to.

Of course, a lot of people don't want to develop a system, just to procure one. When doing this people have a clear choice that can be weighed against other factors: you can choose a system that has been engineered to cope with automated, verifiable migration (should it be needed) or one that hasn't.

Running costs

Another part of David's argument is: "The effect of this [Active Preservation] approach is to raise the cost per byte of preserving content, by investing resources in activities such as collecting and validating format metadata, that are unlikely to generate a return".

I think the first thing to point out here is that, in SDB at least, characterisation is a fully automated process driven by machine-readable policies that are established at the start of the system's operation and changed rarely. Hence, there are no significant on-going human costs.

Of course characterisation programs need to be created but this is also a one-off cost that can be shared between lots of organisations in some way. Indeed, most such tools are free to use so there are usually no on-going licence costs either. There are maintenance costs to consider, if they can be directly assigned to this additional functionality but this, in my experience, is not significant either.

Hence, this means the only effective cost is the cost of actually running programs on a server. I've heard a lot of discussion in digital preservation circles about the "cost" of such activities but most systems I've seen have a huge amount of spare processing capacity. Hence, the incremental cost of using CPU resources to run automated characterisation (and/or migration) in such systems is effectively zero.

If large amounts of content are being processed it is plausible that additional application hardware might be needed but I think this is unusual. With full automation SDB, for example, can process ingests at a rate of up to 2TB/day on a single, fairly standard server with about a third of the time being spent performing one form of characterisation or another. (The actual speed depends on the spread of file sizes and, to some extent, format). This meets most organisation's needs without any additional investment in hardware. Also, it is worth pointing out that in large volume systems (that might need to invest in additional application hardware to cope with the processing burden of characterisation) this cost will be dwarfed by the storage costs associated with such systems.

Of course there is the cost of storing the additional metadata that characterisation generates. One thing we do in SDB is to allow administrators to choose to only store relevant measured properties (i.e. those that can conceivably contribute to the obsolescence risk or contribute to post-migration comparisons). This keeps the generated metadata down to an acceptable level and certainly a fraction of the size of the content for all but the smallest of files. Hence, this also does not produce an appreciable cost either.

So, all in all, the costs of characterisation are marginal compared to the rest of the running costs of a system. In my view they do not justify David's comment that if characterisation occurs "vastly more content will be lost because no-one can afford to preserve it".

Microsoft Project

The final point on David's post was related to my example of Microsoft Project. First of all, I apologise that in writing the previous post I got my versioning mixed up. I said we couldn’t read the Project 98 format but it is actually the previous version(s) we can’t read directly. What we can do (and why I got confused) is to use the Project 98 software (which is no longer supported) to convert it into a format that modern versions of Project can still read.

Nonetheless, the specifics of this is less important than David’s question do we eat our own "dog food" (i.e. if we can’t read it why don’t we apply “active preservation” to it)?

In my original post I said there is a "lack of [best] practice and lack of tools to deal with every format". In this case we have developed (some of) the best practice but not a tool. In doing such migrations we found that some of the aggregated conceptual properties of projects (e.g., total cost, total effort and duration) are quite sensitive to small changes and thus good examples of 'significant properties' to check before and after a migration. What we have not done is to produce automated tools either for characterisation or migration since these require detailed knowledge of the format specification.

At the risk of repeating my previous post, I believe the production of such tools (and further development of best practice) remains legitimate areas for further research effort and investment.

Rob