digitising public collections – a long talk

This post is going to be a bit of a mishmash of thinking I’ve done in my classes this semester (I’m looking at digital archiving and organising information digitally), my previous work during my PhD (I was interested in how communities of users employ a range of digital tools to complement their face to face activities), my sessional teaching in courses on audience studies (I taught students how to do audience research, particularly online) and some of the reactions I’ve had to public chitchat about digitising public collections. All my knowledge about digital archiving in practice is theoretical, but for the stuff I’ve learnt during assignments. When I talk about the literature in the information management discipline, I’m talking about the literature as it’s been presented to me in my courses. So it’s a pretty subjective overview reflecting course requirements rather than the field as a whole. Which is a pretty interesting comment on the role of course syllabi in structuring and reflecting our understanding of a discipline, yeah?

I decided to write this really long post in response to a number of really annoying things I’ve read online. Firstly, Marcus Westbury made this comment on his blog and in his Age column:

THE perception at election time is that politicians can get ahead only by rolling out the pork barrel and spending up big. But while we’d all love more money for the arts, there are a lot of things that could make us all culturally richer for not much more than the cost of political will, taking the lead, changing some rules and tweaking a few settings.

So, in the spirit of trying to get some actual ideas into the vacuum that is this election campaign, I’ve knocked up a list of suggestions for pollies wanting to do — or be seen to be doing — something useful on the cheap.

He continues

3. Put our national cultural collections online

It’s a no-brainer. Our museums, galleries, orchestras, opera companies and state theatres are sitting on rich cultural archives that could be shared online tomorrow. But the confusion of rights makes it difficult, and they often err on the side of caution, meaning these vast resources are lost in red tape. We need the 21st-century equivalent of public lending rights on the national broadband network. It’s the cheapest education and most effective audience development opportunity there is. We’re spending a fortune on infrastructure, so we’d be mad not to do it.

This comment really bothered me. It reveals a lack of familiarity with Australian public institutions’ digital activities, not to mention the practical reality of digitising collections. I understand that Marcus is suggesting a solution to the challenges of managing IP online (which do slow down digital library activities) – “the equivalent of public lending rights on the national broadband network” – but I am also fairly sure that this doesn’t recognise the fact that information management specialists are already onto this challenge. I had a three hour session on it last week, and I don’t envy the people who are actually involved in unravelling IP for public institutions’ online and other digital activities.

Digitising national collections isn’t cheap, nor is it simple. I’m going to explain why a bit further on in this post.

I was also alerted to this this UK Telegraph article by Colvinius on twitter, which is about a slightly different topic – archiving popular online culture. I can’t really address that in this post as it’s already too massive, but I’m not keen on the implication that public institutions aren’t paying attention to online culture. I mean the Library of Congress is kind of already onto that and they’re not alone – even the NLA is into that action.

Digital archiving? Digital preservation? Digital collections?

Firstly, when I say ‘digital archiving’ or ‘digital preservation’ and ‘digitising collections’ what am I talking about? This semester our work has made a fairly arbitrary distinction between digitising items for long term preservation as digital items, and digitising items for improving community access to these items (whether online at home, in schools or public libraries, on mobile devices or whatever). The differences between preservation and access occur at many points in an item’s journey from physical object to digital file, and then in its storage and access. In the simplest terms, though, a digital preservation project aims to digitise items and then store them for long periods of time and a digital library or gallery project digitises items so as to make them accessible now.

Users? Or, why are we doing this again?
In both cases an ‘audience’ is imagined for the collection, though the term ‘user’ is employed, and complex user personas and use scenarios are developed to aid the design and implementation of the digitisation project. In digital curation, however, the user is positioned as a ‘designated user community’ so as to manage the fact that we cannot actually address the needs of any and all potential users in the future. This is where I’ve found a sticking point: I’d like to design systems that are agile and flexible, able to respond to the needs of potential users. This, however, is in conflict with the fact that much digital preservation work is founded on projects or models developed for use within the scientific or engineering community, digitising their data.

The most important digital preservation model – the Open Archiving Information Scheme – was actually developed to manage the vast quantities of data produced by space research. So the user community – the designated user community – with this sort of data and research was an elite group of highly institutionalised research scientists and engineers dealing with a particular type of scientific data. No flexibility there. In fact, there are, of course, all sorts of positivist connotations about ‘data’ and ‘research’ here: data can be ‘captured’ objectively by skilled professionals working in laboratory-like environments and then ‘stored’, untouched by this process, for long periods of time until it is ‘retrieved’ at a future point in time. Problematic, much?

This raises all sorts of problems for people researching with real, live, actual hoomans (and the word ‘humans’ is actually used in much of this literature, in part to distinguish them from computer users, but still – it’s an amusing distinction, yes?). Particularly those of us working in the humanities area. We know that data cannot just be ‘collected’ as though it were little chunks of rock lying about in ‘the field’. We know that our ‘collecting’ is actually a series of complex interpersonal interactions shaped by our own lived experiences. We know there are power dynamics at work here, and that ‘data’ isn’t some sort of coherent, unit of information, that it is instead a more dynamic map of human relationships and practices that just doesn’t hold still.

Information management literature likes to draw a distinction between ‘data’ and ‘information’. Data is – and this is stated explicitly – objective units of ‘fact’. Information is created in the human interaction with these facts. There is no allowance – despite some fringe ‘theory’ literature – for the idea that all data is actually ‘information’. This notion of data as an object to be stored has been important to library and museum management for a very, very long time. After all, if objects don’t have an intrinsic, objective value and meaning, why would we bother to store them so carefully for so long for some future generation we can’t ever know? This sounds like quite a long diversion from my point. And it is. But I think it’s important to introduce it here because it informs the whole digital library and digital preservation process. So let’s get back to that stuff.

What do I mean by ‘items’? Basically, I’m talking about the stuff in public (or private) collections: books, posters, costumes, china, maps, records, photos, reel-to-reel film and so on. The physical items in a collection. A collection can include ‘born-digital’ items like websites and digital audio files, but I don’t want to talk about them here. Which is a shame, as that’s the point of that second article I quoted up there at the beginning of this post. Ah well.

Items are digitised in a number of ways, including these few examples:

taking digital photos with a camera
scanning items with a digital scanner (whether flat bed or otherwise)
creating digital recordings of analogue sound recordings (ie turning the sound recorded on a vinyl record into a digital audio file, and then ‘capturing’ the physical recording media – the vinyl record itself – using photography)
creating digital versions of audio-visual files (ie turning a film recorded on reel-to-reel film into a digital file, and then capturing the physical items associated with that recording (can cover, etc) with other devices)

Whose user?
Who is the ‘user’, then? This is tricky one, and even the literature in the field has trouble with it. While some of the authors in this area are approaching a cultural studies – a feminist, humanities-type – understanding of user which equates to our much-debated ideas about audiences, for the most part there is little rigorous attention to the way ‘users’ are imagined and then built into the design process by information architects. Who are the people who design and then build archiving systems (whether they be for the short or long term). In all this year, I’ve not heard one comment from any of my lecturers or in any of the literature addressing the role of our own identities in our development of user personas. It’s not been addressed even when I and others in the class have suggested that any user persona we develop is in fact more revealing of our own identities than any ‘real’ user (no matter how well substantiated with research). I’ve found this incredibly difficult to deal with.

To my mind, the design of a website or a poster or a database reflects the ideas and values and ideology and physical experience of the people who made it, and it also reflects the way they imagine the people who will use that system or item. My ideas here are informed by my experiences with cultural studies, media studies, hell, even semiotics and the fundamentals of textual analysis. This concept is a very basic prerequisite for thinking and writing about culture in media, communications and cultural studies literature. I’ve taught it a million times to a zillion students. I can’t even begin to contemplate not believing this. It’s the foundation for much of my thinking about power and privilege and broader social relations. So I’ve found this semester quite frustrating. But let’s return to my original discussion.

Organising items or data systems and controlling language
This all means that a database is not a neutral or objective system or structure. It is not only a reflection of ideology (and culture), it is also continually remaking and restating this culture and ideology. We’ve demonstrated this in class when each of us has come up with quite different architecture for the same database proposal. I think this is why I find the discussions about library and museum management of Aboriginal and Torres Strait Islander culture so interesting. It’s in this literature (which I sought out myself) that I’ve found discussions about how databases and other library systems articulate race and class and power. The best example of this is the ATSILIRN alternative thesaurus.

In the simplest terms, indices or thesauri are controlled languages for describing items. When you enter information about an item into a database, you have to fill in fields like ‘title’, ‘author’, ‘year created’, etc. You could, potentially put any old information in there. The other week when we were discussing indexing I heard my first in-class comment from a lecturer about how language changes over time and how a word used by me today mightn’t mean the same in 50 years time. It was almost a discussion about cultural specificity, but there was no discussion of power or any self-reflexivity. But most database designers recognise this (though they might perceive it as ‘error’ or as aberration) and so try to control what can be entered into the fields. You can do this by adding constraints like only allowing numbers to be entered, or only allowing one of a number of options in a drop down menu to be selected. But you can also use specific indices or thesauri to control the words that are entered into the database. If, then, there are only five words specifying race in your thesaurus, and all of them created by and reflecting 19th century Anglocentric discourse, you can see where Dodson is coming from when he writes

We have been referred to and catalogued as ‘savages’ or ‘primitive’ while Western industrial peoples are referred to as advanced and complex (Mick Dodson, 1993, quote from the ATSILIRN).

Standardisation
I’ve wondered away from my initial point in quite a serious way, here. But it’s important to discuss this stuff, because all databases managing the records of library and museum and other collections employ controlled vocabularies of some sort. They might be in the indices and thesauri, but they are also in the metadata – the information about data – that is used to organise items within a archiving system. There are various types of metadata, and they gain their value from being standardised. Standardised metadata systems are really lists of fields which have been agreed on by various communities (usually in a formal sense, after intense negotiation, as in the case of Dublin Core).

I was super excited when I heard this – an enforced interpretive repertoire! (you can read about interpretative repertoires in Potter and Whetherell’s work, particularly the 1987 book Discourse and Social Psychology: Beyond Attitudes and Behaviour The collaborative meaning making Stuart Hall talked about was actually codified and institutionalised here! I’d been used to thinking about these sorts of systems as evidence of ideology – or as things that could be analysed for ideology. But we don’t talk about this in class at the moment, and I daren’t even raise it, simply because the concepts require quite a bit of background information. If anything, my frustration here is perhaps evidence of how difficult it is to articulate this sort of theoretical and critical work in everyday language. Irony, much?

So, to this point, I’ve discussed ‘items’, the way users are imagined, and controlled languages in information management. How does all this relate to digital libraries and digital preservation? And exactly what does digitisation of a collection involve?

The Canadian Heritage Information Network (CHIN) has a very handy guide to digitising small collections. You can see as you browse through this excellent guide, that digitising a collection involves a little more than just digitising a bunch of things and then wacking them up on the internet. The digitising process – where you make digital files of each item – is about one third of the entire project. The larger part includes planning, budgeting and then organising the digital collection. Even a small collection includes thousands of digital items, all of which have to be organised within a (series of) databases. Even if your digital collection is not intended for public use on a regular basis, it will still require complex data management tools. Luckily, there is OAIS, which is a standardised (it even has its own IOS standard). Unluckily, OAIS is complicated and not exactly simple to implement. Perhaps more usefully, there is the Digital Curation Centre’s Curation Lifecycle Model, which allows you to develop an overall plan for long term digitisation and preservation. It’s also quite complicated.

I wrote a paper this semester about the challenges of digital preservation for remote Aboriginal and Torres Strait Islander communities. The most pressing of these (which apply to all collections) are:

handling delicate or culturally sacred material items
securing permissions to handle and then digitise and store digital records of these items
creating an architecture that will organise these individual records and files
acquiring and then maintaining the hardware necessary for storing such large amounts of data (ie having good server centres with reliable electricity, cooling and physical facilities)
having really good internet access to facilitate introduction of items to the archive and then maintaining the collection and supporting use of the collection
having sufficient funding to cover all aspects of the digitising process
having access to suitable skilled personnel to do the digitising, organising and maintenance of the collection

All these things are challenging for remote communities facing other quite serious social issues. But they are also challenges for librarians and curators working with smaller collections in regional institutions or within larger institutions.

In ~~the most basic terms~~ actually fairly complicated terms, digitising a collection for preservation involves:

Digitisation of a collection: Planning

planning and scoping the project can take a long time. You have to know what you have to digitise, what condition it’s all in, and what your resources are.
If you don’t have much money (and no institution in Australia has the money to digitise its entire collection), you have to start planning some serious fund raising. If your collection is attached to a large private enterprise or company, you’re in a much better place than a large public collection.
You have to start thinking about acquiring the technology and skills to do your own digitising, or you have to source a suitable external body to do it (which is something most collections do these days). The technology is super expensive, and the hours required for digitising are massive.
You have to discover and plan the appropriate standards for file formats for the digital records you produce. What file format will be useable ten years from now, let alone fifty? The questions at this stage continue and continue…

Digitisation of a collection: preparation

This is where you start ascertaining the state of your collection and beginning necessary conservatorial and preservation processes. Some items are physically very fragile and need to be handled by professionals in safe conditions. Their preparation for the physical act of digitising can take a long time. Some are damaged by the very act of digitising. Check out Pinknantucket’s blog for some super cool discussion of this stuff.
It’s also during this phase that you start checking out the permissions and intellectual property rights associated with items in your collection. Who owns them? Who can use them? There are some loop holes in Australian copyright law that allow public institutions like libraries and museums to digitise or copy items without permission. But you have to be very very sure you understand these laws or you can cost yourself a massive amount of money in legal fees later on.
You also have to find out whether you have permission to digitise all the items in your collection. Are you holding items on behalf of third parties? Do you have permission to digitise them, let alone make them available for future use by other parties? This is particularly relevant for items belonging to Aboriginal and Torres Strait Islanders. One of the largest challenges here, is that many collections possess items for which they don’t have ownership details. The forcible removal of Aboriginal people from their lands and subsequent government policies of removal of children from families contributed to the mis-identifying or un-identifying of items in public and private collections. If you don’t have a suitably qualified person on your staff to aid the identification of your items, you can’t even begin to approach gaining permission for digitising or use. But this is also important for other community groups and individuals. Is it appropriate to digitise very personal, private stories without permission? Would artists approve the changing-of-form which digitising involves?
You should also be thinking about whether open access to everything in your collection is a good idea at all. I’m not convinced that complete access is a good idea. Michael Gurstein raises some interesting questions on this.

Digitisation of a collection: acquisition

This is where we actually start digitising. Digitisation is a complex, demanding, labour and resource intensive process. It’d not just a matter of taking a photo of an item with a digital camera. The equipment is very expensive and requires particular skills to use. The digital files produced must be of a very high standard, and of an appropriate format. They must have longevity (ie won’t be unusable in ten years) and they are usually very large files. It’s also at this point that we see the difference between digital archiving and digitising for access. One single photo, for example, must also be reproduced in a number of different formats for public access – a smaller file for thumbnails in a catalogue, a high resolution image for sales to the public, a high resolution image of the item accompanied by colour matching tools and measuring tools and so on. Each of these files must also be tested for errors and accuracy – the colours must be perfectly matched. The scale carefully noted.
Creating metadata. This is where the digitising process gets complicated. You can’t simply wack a digital item ‘up on the net’ or into a box for use or preservation. All the thousands and thousands of files must be organised – for preservation, for maintenance, for error checking, for migration (where files are transformed to new formats as technology and file formats are superceded, or as publicly accessible catalogues are produced) and so on.
This organisation is enabled by attaching metadata – information about the data object – to the digital item itself. This can’t just be one long list of details, it has to be an organised system of sets of information, all attached to one digital file. There are different sets of metadata – different sets of ‘rules’ or lists of fields for information about the item – and they can’t be applied randomly. It’s important to use standardised metadata (rather than just ones you make up yourself) for two main reasons: to allow interoperability (where your system can work with other systems – within your institution or with other institutions’) and access (either public access or access within your institution to facilitate maintenance).
OAIS outlines a complex system for managing this organisation of the the digital files and attendant metadata. There are different types of metadata: submission information (where the digitisation process is recorded and attached to the file, noting things like file format, camera type, photographic angle, etc), descriptive information (which records all manner of things, but can also include descriptions of the original item from its author, historians or other people), technical information (regarding the file format, technology required to access the file – eg a Word .doc file will require a copy of Word to run it) and so on. Copies of the technology required to use the file may also be included as metadata. Most digital preservation systems favour open source software formats so as to avoid the IP challenges associated with this process – so using a Word .doc file is problematic.

Digitisation of a collection: Ingest

This is where you actually start putting your digital files into your archiving system. At this point you create a whole new set of metadata to record this stage of the process. It’s really important to create metadata as it will not only help you find your item in the archive later, it’ll also help you open it and then share it.
Metadata records permissions, technical information and also descriptive information. The ingest process creates metadata that is about the administration of the item within the system. It’s actually the really big and most important part of a digital archiving process. This is where the big thinking, planning and skill part happens.

Digitisation of a collection: maintenance

So once you’ve got your digital archive, you need to maintain it.
You’re going to need clever, well researched disaster recovery programs (eg what will you do if your data centre is flooded? Where do you keep multiple redundant copies of your data? What if electricity becomes too expensive for your institution to handle in the future?).
You’re going to need to migrate your data regularly to make sure it’s accessible. This is a problem with data in the earlier space missions – it’s stored on technology that we no longer know how to use or have the hardware to use. You need to not only migrate your data across software, but also record this process so that you have records of when data changed form and how. More metadata
You’re going to need ongoing funding. More of it. So you need to continually seek out funding for the maintenance of your collection.
You’re also going to need to manage the IP of your collection. Has copyright lapsed on an item? If so, do you still have the rights to store it or use it? Has the IP for an item changed hands, moving between generations? What are the wishes of the next generation? Has an item now become publicly usable – can you now make an item publicly available as per the wishes of its creator? If you’re managing ATSI items, how does sorry business (ie the management of items in regard to the passing away of associated persons), sacred and secret status and cultural significance affect the metadata and permissions associated with an item?
Repatriation and ‘returning’ items. If you do have ATSI items in your collection, how will you manage the repatriation of those items to the original owner? If you are asked to return an item to its traditional owners, how will you deal with the digital records of the item?
Disposal: some items must be disposed of at a certain date. This is the case with items under military or official secrets acts, personal records such as medical records and other issues. How will you handle data destruction? Do you have a schedule for these sorts of events?
Ongoing community access. How will your archive accommodate access to and use of the collection? Is your archive intended solely to preserve items for perpetuity? If so, why? Who are the end users you’re preserving these items for? How, when and if will you allow copies of your collection for general access? What are the terms upon which your digital collection can be used? If you are a public institution, what is the mandate of your organisation – who is your user community? This becomes an issue when you’re funded by public money and required to represent the interests of the community which funds you. Are you federally funded? State funded? Commonwealth funded? How should you make your archive available for use? What constitutes use?

All these issues are very complicated. As you can see, it takes a lot of time, money and effort to create and maintain a digital collection. The types of metadata you use will depend on the goal of your work: are you preserving for the long term? Are you interested in broader community access? In my course a distinction is made between digital preservation and digital libraries for immediate access. This distinction affects metadata strategies. To my mind, a sound digital preservation project allows for interim use as well as long term preservation.

But the challenge here is that digitisation strategies are relatively new, requiring serious institutional changes and are being developed as we go along (though you can read about NSW State Library digitisation policies here). We had a talk from a representative of the NSW State Library this semester, and he noted that for the most part they were figuring out how to do things as they went along. Though there are international standards for a range of aspects of the process, they had to figure out how to do things like manage work flows and labour; how to handle the digitisation of physically fragile items; and how to prioritise digitisation of a massive collection. In the last case, his comment was that they digitised first the items or mini-collections which attracted the most public or private sponsorship. Simply put, the items they’d digitise first were those private beneficiaries and donors considered most important. We were all aware of the difficult power and taste issues at work there.

He also made an interesting point about the way digitising was shaping the organisational structure of the NSW State Library as a whole. The vast resources required for digitising were re-working work flows and allocation of resources throughout the entire library. Digitisation was effecting the culture and goals of the organisation as a whole.

Australia’s national and state institutions’ digital activities today
Australia has actually been very foresighted in its approach to digitising cultural heritage. The National Conservation and Preservation Policy for Movable Cultural Heritage was composed in 1995. The National Library has a well-respected and quite comprehensive digital preservation policy. The National Museum also has a digital preservation policy. In fact, most national and state institutions have digital preservation policies, in part motivated by policy documents like the National Conservation and Preservation Policy for Movable Cultural Heritage Policy. I’d be curious to see the government policies for these things and to see how they’re being fulfilled (or not) today.

I’ve been really surprised by just how active our various national and state institutions are in terms of digitising and engagement with online communities and culture. The Collections Australia Network (CAN) is not only facilitating and documenting digital work by Australian cultural institutions both large and small, it is also engaged in collaborative projects overseas (with CHIN in particular). A whole range of Australian institutions are involved in digital online public access projects like the flickr commons project which works with a third party facilitator, but also with national collaborative projects like Trove.

Trove is a very interesting project because, while it administered by the National Library of Australia, it draws on the collections of a great many Australian institutions and projects. You can use Trove to search hundreds of collections and catalogues. This interoperability is facilitated by badass metadata. Simply, Trove and other tools (including the wicked mashups with google maps around the place) are enabled by hardcore metadata. And controlled vocabularies. These standardisation tools provide common languages for collections, allowing this sort of exciting collaborative work. Which is yet another reason for following the time consuming, frustrating and expensive digital preservation models like OAIS.

One of the impressive things about Australian collections online is their willingness to take up, experiment with and then move on from online projects. Picture Australia, for example, was an innovative gateway to the image collections of Australian institutions. But it will soon be dismantled in favour of (or integrated into) Trove as its technology and services are made redundant.

I’m particularly interested in the way a number of Australian digital collections employ ‘crowd sourcing’ for intellectual, technical and practical labour and community participation. Various institutions employ crowd sourced ‘tagging’ for items. No controlled vocabularies there. The Australian Newspapers Digitisation Program is particularly exciting. Basically, the NLA scans Australian newspapers then puts the files online. Here, volunteers edit the scans, correcting mistakes in the scanning. This is a truly amazing cooperative project, where a national collection works with people in the broader community to make their collection not only online but also usefully online.

Beyond these few examples, most large Australian institutions also maintain a lively online communicative space. I follow a range of organisations on twitter, from CAN to the Powerhouse Museum. Most also have blogs or regularly updated websites with comments and places for people to ask questions (there’s an interesting discussion about the Powerhouse Museum’s 80s exhibition social media strategy here). The National Museum is particularly accessible, not only making items available online, but also providing useful, skilled advice to complement the items themselves. And DINOSAURS!

I could go on on and on. But I think it’s worth noting that point that it’s not simply a matter of wacking stuff online. Sometimes access to a collection means providing assistance in using online AND face to face services. This seems particularly important in regards to the National and State Archives which are very important institutions for individuals trying to find out about their families. They’re used not only by Aboriginal and Torres Strait Islander people tracing families, but also by migrant and refugee people. In these cases online access simply isn’t as useful on its own as it is in cooperation with skilled, culturally sensitive librarians and archivists – real, live people.

I think I’ll leave this with the comment that our museums, galleries, orchestras, opera companies and state theatres are NOT sitting on rich cultural archives that could be shared online tomorrow. No matter what the state of IP and copyright legislation.

Leave a Comment Cancel reply