I’ve been looking at the impressive Cambridge University Institutional Repository, called Apollo. The repository has an impressive range of content, with over 155,000 items in the collection. Clearly the repository is widely used.

But the more closely I looked, the more I started scratching my head. If ever anyone wanted a clear demonstration of the limitations of current metadata principles, here is an example. While the principles on which the repository was founded seem to have been sound, the resulting collection doesn’t seem to be very searchable. Apollo may have been the god of prophecy, but anyone trying to discover items in this collection may well be in need of some prophecy.

Let’s take content types – over 20 of these listed. No matter that one of those content types is “other”, or that “book”, “book chapter”, and “book or book chapter” comprise three of those types. There is also “article” and “journal article”. It doesn’t look as though there is much rigour when adding new content to the repository.

Well, if it’s difficult to locate content by type, surely there must be some useful keywords? There are no fewer than 65,335 keywords, on the day I looked at the repository – there may well be more by now. This vast number of keywords, around one third of the total number of objects in the collection, is surprising – after all, keywords are supposed to be a concise set of terms that saves you the need to look through all the items individually.

Why so many keywords?   The majority of the keywords have only one occurrence, and seem to have been author-generated, which means that whatever the author added as a keyword has been used. So keywords include “F. Clooney” (rather than just the surname “Clooney”), “fabrication”, indexed separately to “Fabrication”. In other words, this indexed has not been normalized in any way. Moreover the alphabetization used to show the keywords as almost unusable. All terms in italic are placed before terms in roman, so Xenopus appears before Aachen.

Still, nobody uses indexes these days. What about simply linking one content object to another? That will give a good link.

Sadly, it seems this is not the case. Germaine Greer’s doctoral thesis, to take an example, is entitled The ethic of love and marriage in Shakespeare’s early comedies, and has links to five other documents. One of them is “political theologies in late colonial Buganda”, and another is the Apollo Annual Report 2008-2009. I can’t see any reason why these two documents should be related to a thesis on early Shakespearean comedy.

Is there anything that can be recommended? Well, good marks for the repository indicating there are insufficient records in that subject, in cases where the number of related records is so few that the system cannot make a good-quality recommendation:

But if the system is able to decide this, why does it link seemingly unrelated content as shown above?

My recommendation would be to use an AI-based tool to reduce many of the above problems. It would index all the content consistently, and so remove the need for authors to tag their own content – the resulting hotchpotch is all too visible here. It’s a great example of well-meant individual actions resulting in collective confusion. Or perhaps the real reason is that the repository doesn’t really exist to find anything – it it there to fulfil the requirement that academic content generated by members of the university should be available (which doesn’t mean findable).