I’ve been looking at the impressive Cambridge University Institutional Repository, called Apollo. The repository has an wide range of content across several subjects, with over 155,000 items in the collection overall. Clearly the repository is widely used to hold content. But is it used to find content?

The more closely I looked, the more I started scratching my head. If ever anyone wanted a clear demonstration of the limitations of current metadata principles, here is an example. While the principles on which the repository was founded seem to have been sound, the resulting collection doesn’t seem (at least to me) to be very searchable. Apollo may have been the god of prophecy, but anyone trying to discover items in this collection may well be in need of some prophetic vision, because the normal search and discovery tools seem to be less than adequate.

Let’s take content types – over 20 of these listed. No matter that one of those content types is “other”, or that “book”, “book chapter”, and “book or book chapter” comprise three different content types, which means you have to look in all three to find book content. There are also content types “article” and “journal article” – aren’t these the same thing? It doesn’t look as though there is much rigour when adding new content types to the repository.

Well, if it’s difficult to locate content by type, surely there must be some useful keywords? There are no fewer than 65,335 keywords, on the day I looked at the repository – there may well be more by now. This vast number of keywords, around one third of the total number of objects in the collection, is surprising – after all, keywords are supposed to be a concise set of terms that saves you the need to look through all the items individually.

Why so many keywords?   The majority of the keywords have only one occurrence, and seem to have been author-generated, which means that whatever the author added as a keyword has been used. So keywords include “F. Clooney” (rather than just the surname “Clooney”); the keyword “fabrication” is indexed separately to “Fabrication”. In other words, this indexed has not been normalized in any way. Moreover, the alphabetization used to show the keywords is almost unusable. All terms in italic are placed before terms in roman, so “Xenopus” appears in the index of keywords before “Aachen”.

Still, nobody uses indexes these days. What about simply linking one content object to another? That would give a good link.

Sadly, it seems this is not the case. Germaine Greer’s doctoral thesis, to take an example, is entitled The ethic of love and marriage in Shakespeare’s early comedies, and has links to five other documents. One of them is “Political theologies in late colonial Buganda”, and another is the Apollo Annual Report 2008-2009. I can’t see any reason why these two documents should be related to a thesis on early Shakespearean comedy.

Is there anything that can be recommended? Well, good marks for the repository indicating when there are insufficient records in that subject, in cases where the number of related records is so few that the system cannot make a good-quality recommendation. But if the system is able to decide this, why does it link seemingly unrelated content as shown above?

Since the problem is that the repository managers cannot control the quality or do much about the consistency of the metadata provided by new content coming into the repository, my recommendation would be to consider using an AI-based tool to reduce many of the above problems. It would at least index all the content consistently, and so remove the need for authors to tag their own content – the resulting hotchpotch is all too visible here. Each author adds tags to their best ability, but Apollo is a great example of well-meant individual actions resulting in collective confusion.

Or perhaps, to be cynical, the real reason for the poor discovery is that the repository doesn’t really exist to find anything – it it there to fulfil the requirement that academic content generated by members of the university should be available (which doesn’t mean findable).