Machines Make Records: The Future of Archival Processing

When I was writing this presentation, I found myself remembering Depeche Mode’s 1984 single People are People, partly because midlife leaves me prone to nostalgia and partly because it seemed to act as a shorthand for a number of points that I wanted to make

Firstly I would like to remind us that, if we define people as our users, or our audiences, or our stakeholders, we cannot avoid ‘othering’ them to some extent. We are now in a relationship with them and we have also defined the terms of that relationship, of that engagement, reducing them to those of use, attention and stake. In this presentation I am starting instead from the position that archivists, records managers and conservators are all just people and like all people they are facing an increasingly uncertain and difficult environment, particularly it seems with respect to the ways in which technology is advancing and shaping a reality that must now be shared not just with other people, animals and inanimate objects, but also with a whole host of artificially intelligent systems.

Returning to the idea of People are People then, a question that is increasingly being asked, and no longer just within science fiction, is that of are machines people too? At what point could or should an A.I. (artificial intelligence) be allowed ‘personhood’, legally, ethically or socially speaking? The robots are coming scream the headlines and scandals such as the recent Facebook/Cambridge Analytica debacle demonstrate the truth of what Terry Cook, speaking in an entirely different context over 20 years ago, highlighted when he pointed out that “Our mindsets and solutions come from and reflect generations of practice in a paper-based world […] This older world is no longer holding.”[1] I would venture that many of us, not as archivists or records managers or conservators, but just as people, are nervous, scared even about how the newer world is shaping up. We start to worry that our basic literacy skills are lacking as coding is taught in schools and the bright young tech giants run rings around government and citizen alike.

That we are not alone in our anxiety should be of some comfort, and yet for those in the recordkeeping community in particular, it is actually just another worry, because it means we also worry that we are not fulfilling what we have long seen as our role and raison d’etre – the maintenance of everyone else’s trust and confidence in and through access to a transparent, interpretable, reliable record of events, identities and transactions. Today I would like to outline the position I am currently taking in the face of all this worry and to offer a view of the future of archival processing.

Just so we are clear on what we mean by archival processing, here is a definition from the Society of American Archivists : “The arrangement, description, and housing of archival materials for storage and use by patrons”. Does this sort of processing have a future? Yes. We may no longer put stuff in boxes to put on shelves in a physical repository, but we are quite happy to put stuff in bags or packages in a digital one. Once in the repository we have to check on those packages now and again, and occasionally we will feel the need to rehouse them most probably into a nice new file format. Clearly I am being deliberately provocative here, but it seems to me that conceptually speaking, this is just business as usual, albeit on a much accelerated timetable. Even checksums, one of the shiny new playthings that we like to hold up to show how digital we are, are conceptually no different from the weighing in and out of material we have on occasion carried out in our reading rooms.

On the bright side, the digital does bring a reduction in the amount of physical labour we have to do, not just in terms of lugging cardboard boxes around, but also in terms of not being data input clerks and typing in all the metadata accompanying the material. For the digital archivist, receiving a large accession is no longer a legitimate excuse not to go to the gym, but apart from that, what other differences are there really?

Another real difference that we must perhaps allow is scale – there’s a lot more stuff in the world than there used to be. Well yes, fair enough, but it is also the case that the fact that there is a lot more stuff in the world is not unrelated to the fact that we have discovered a new way of encoding and recording that stuff so that it can be reproduced and processed many many times more quickly and efficiently than has ever previously been possible. And so, if the fact that we are now receiving stuff in born digital form is a problem, it is also the solution because the big plus that we are perhaps only just beginning to fully realise, is that material being born digital means that we now have access to new tools and techniques that can help and support us in our processing of it. These tools and techniques allow us the possibility of dealing with it at the sort of scale that is required.

The questions we should perhaps be asking ourselves are therefore things like – how we can make the most of this possibility? In what way should we seek to realize it? Can we identify and overcome the barriers that might prevent us from achieving what we want to achieve? Writing back in the mid-1990s, David Bearman identified a couple of problems for archives as they tried to engage with the emerging technology market. These were firstly that ‘the market is too small and diffuse’ and secondly ‘the absence of an agreed to model of what archives do or how they do it’.[2] Both these problems are still relevant today as I will discuss shortly, but they are also perhaps particularly acute, because the technology is finally coming close to being able to do the sort of processing that we need it to do. As far back as the 1980s I have found reference to artificial intelligence in archival literature, with this bit of future gazing from Peter Hirtle.

Recent developments in the field of artificial intelligence (AI) may change this picture. In particular, with the development of expert systems, it has now become possible to foresee a time when archival automation may actively assist in the processing of collections and even meet some of the reference needs of the users.[3]

The phrase expert systems dates this quotation back to a previous hype cycle in the field of artificial intelligence which soon after this article was published perished in the storm of what became known as the AI winter. In the years following work on AI has been carried out under many different labels and hard work has been done so that we are getting to the stage when it does actually work and it does so in a number of real-life and practical contexts, e.g. speech recognition, machine translation, security, health, finance and so on. Research and development is an expensive business however and so, funnily enough it is the markets and problems of security, health, finance, etc. that tend to attract the attention of cutting edge researchers looking for funding for their work.

We return to the fact that the archives market is too small and diffuse, and we could add poor. But before we all get too depressed, things are starting to look up. Within our community we are starting to undertake more digital research and development in more ways than ever before. Any one of you who has ever tried to work out what to do about the born digital material in your institution, perhaps by looking at the blogs of others doing the same, or by trying out new tools or techniques, you have been doing research and development. Then again, we have also not been slow at reaching out to adapt tools and techniques developed by others, such as the digital forensics community in the case of BitCurator, or eDiscovery tools in the case of some recent work by The National Archives. We have leveraged our existing allies and natural partners, e.g. working with humanities researchers on the problem of machine reading old handwriting, and with government on questions such as sensitivity review. Some of us are even starting to collaborate on research that seeks to direct new technologies, such as blockchain, as, fundamentally, recordkeeping technologies.

All positive stuff, but one problem particularly with research and development at the more blue sky and cutting edge, is that it is just that – cutting edge stuff concerned with extending knowledge and the realm of what is possible – not so much with the application and use of that knowledge on a daily basis. If we are happy that a subset of our community are turning their attention to more fundamental research and development on our behalf, how do we see and who are we happy should put that R&D into production? Do we also want do to that for ourselves, on the sort of consortium basis used by BitCurator or ArchivesSpace? Do we want a subset of our community to become software stewards like the Open Preservation Foundation, managing an ongoing road map of development and maintainance?  Alternatively do we want to rely on commercial companies such as Artefactual or Preservica taking the lead? I offer no opinion one way or the other, but I do think it is something we need to have a considered discussion about. Small and diffuse the archive market may be, but we can still explore different ways to define it – after all we are the market so why can’t we decide not to be at its mercy.

So much then for Bearman’s first point, but what of his second ‘the absence of an agreed to model of what archives do or how they do it’. With this comment, Bearman is talking in a specific context and that is the context of the particular meaning that technology has forced upon the word ‘system’ as a software package. Just as we like to package stuff up to make it easier to deal with and move through time, the technology industry likes to package stuff up so as to make it easier to deal with and sometimes to sell. We may now be exercising some doubts about this approach in terms of Electronic Records Management Systems, but we are still seemingly happy to buy into the idea of Archive Collections Management Systems, Digital Preservation Systems or even Open Archival Information Systems. In this context, we have spent a lot of our time trying to define and communicate our ‘requirements’ and to produce descriptions of systems we already follow. These sorts of descriptions are not going to help us if we want to engage with and communicate our needs of artificial intelligence. Rather what we need is an explanation and description of the forms of intellectual and informational processing and above all the reasoning we are undertaking, when we undertake archival processing.

Accounts of the sort of reasoning that archivists carry out are few and far between. One notable exception was a study carried out a few years ago by Vicki Lemieux which aimed to study archivists’ cognitive tasks when carrying out arrangement. Amongst other things she found the following;

  • the importance of form […] inferring information about function from the form of the records
  • The archivists […] were using the actual materiality of the records to create mental ‘space’ to think about more abstract concepts of series and the intellectual order of the archival documents[4]

It is specification at this sort of level that is going to be needed if we are going to communicate what we need from artificial intelligence. Lemieux for example took from her study the fact that there was a need for a clustering algorithm for form, but what do we mean by form? And what about this link to function. In what way is form aiding our reasoning about function? How does it act as a cue to reason at an even higher level of abstraction about something else entirely? If this is pattern recognition what are the patterns we are recognising?

Another study that I find very interesting is one undertaken by K Chandler who passed metadata extracted from a department’s Sharepoint site through two different community detection algorithms and then passed the resulting visualisations to the staff of that department so that they could say which one best represented their department.[5] The cues in use here were things like creator name and modified date and from these, algorithms reasoned different representations of the activity of a department. Can we see immediately how this machine processing might fit into our existing workflows? No, but does it make me wonder whether part of the reason for the reasoning we carry out during archival arrangement is community detection. Indeed taking this train of thought further could it not be that once that community has been detected, we take steps through arrangement to reinforce it, in the same way that an amplifier is needed to extend the transmission of signals over increasingly long distances.

Another lesson that I take from these two studies, and the one reason why I do fear slightly for the future of archival processing is that they remind me firstly of the importance in our reasoning of using the actual materiality of records, and secondly of the degree to which we are increasingly distanced from that materiality. The actual materiality of records is now magnetic fluctuations, bits and bit streams, Unicode, file formats, physical, conceptual and logical data models, relational and non-relational databases, information architectures and so on. All different and most definitely not materiality we are used to.

In this light, it is perhaps also worth considering what is not going on in this picture. These students may be parsing handwritten script and undertaking some form of natural language processing but they are not doing that in the same way as a machine. They are certainly not reading every single word, mapping that into some kind of vector space, computing millions of similarity scores and deriving conclusions from the results of the same. Perhaps, after all, machines are better suited to reasoning with the new materiality, but we will still need to direct the reasons for and in such reasoning.

If you came to this talk hoping to be told what the future of archival processing would be, then I am sorry, I left my crystal ball at home, but if you want to know what it is, then the answer is us. We are the future of archival processing and if we want it to have a future then we are going to need to start considering a number of things in a much more co-ordinated and strategic way.

  • Firstly we (or at least some of us) need to improve our understanding of the current state of research across a fragmented artificial intelligence landscape in order to target the limited resources we have for such work in the right place and in the right partnerships
  • Secondly we (or at least some of us) need to come up with ways to put relevant research and development into ‘production’ and with sustainable models for software/tool stewardship over time
  • Thirdly we (or at least some of us) need to re-engage with the new materiality of records, learn how to reason with it and better articulate what sort of reasoning that is.

[1] Terry Cook (1994), “Electronic Records, Paper Minds: The Revolution in Information Management and Archives in the Post-Custodial and Post-Modernist Era”, Archives and Manuscripts 22: 300-328.

[2] Directory of software for archives and museums 1994-95, compiled by Belinda Wright with an essay David Bearman, Pittsburgh: Archives & Museum Informatics, c1994.

[3] Hirtle, Peter B. (1987), “Artifical Intelligence, Expert Systems, and Archival Automation,” Provenance, Journal of the Society of Georgia Archivists 5 no. 1.

[4] Vicki Lemieux (2015) “Visual analytics, cognition and archival arrangement and description: studying archivists’ cognitive tasks to leverage visual thinking for a sustainable archival future”, Archival Science 15. https://link.springer.com/article/10.1007/s10502-013-9212-y

[5] K S Chandler (2017) “Investigating original order with cybernetics and community detection algorithms”, Archival Science 17 (3). https://link.springer.com/article/10.1007/s10502-017-9276-1

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s