Library of Congress [public domain]

I was interested to see a new report, Machine Learning + Libraries, commissioned by the Library of Congress. Actually, the report is from LC Labs, a team in the Library Digital Strategy Directorate, so it looks as though this report is fairly indicative of where the Library of Congress envisages libraries be moving in terms of machine learning. This report was partly based on a 16-week collaboration with the University of Lincoln-Nebraska, followed by a “Machine Learning + Libraries Summit”, as well as a series of “innovator in residence” projects, so you feel there is an extensive background to what is discussed.  Note that the title is “machine learning and libraries” (actually machine learning + libraries – the report is very fond of the plus sign) not “machine learning and the Library of Congress, a rather different subject. In other words, this document is a recommendation for all libraries.

The author of the 97-page report, Ryan Cordell, is an Associate Professor of English at Northeastern University, Boston; his website states he is founding director of a letterpress studio for Northeastern, and specialises in 19th-century newspapers. This doesn’t immediately seem to be relevant to a report on AI, but the report looks to be reasonably aware of current developments in machine learning. The report is structured in five parts, with the fifth part containing over 20 pages of recommendations.

What is the background to the report? Kate Zwaard, Director of Digital Strategy at Library of Congress, stated this report was “to help realize the vision of our first ever Digital Strategy” (Did the LoC not have one before now?) and so represents quite a substantial investment in time and resources.

The report starts well with the undeniable premise:

Human time, attention, and labour will always be severely limited in proportion to the enormous collections we might wish to describe and catalog

MACHINE LEARNING + LIBRARIES, PAGE ONE

The other key starting point is that

our current digitized collections … comprise only a small subset of the analog collections held by libraries.

Machine learning + libraries, page one

I can’t help thinking that the most effective answer to the title of the report, Machine Learning and Libraries, is to recognise that libraries have an opportunity via ML to get closer to their fundamental goal of discovery. It’s very simple:

  1. The goal of libraries is to enable users to discover content.  
  2. Digital content is far more discoverable than print content.
  3. So libraries should first digitise their content, and second, use full-text indexing and AI tools to make that content discoverable.

What is remarkable is that libraries do not currently offer full-text indexing of their own content. Surely this is the primary goal, and all the rest is secondary, to be honest, but this report seems not to state these key goals.

Instead, the report’s actual recommendations seem to me something of a missed opportunity. Before we even get to the recommendations (they first appear on page 42 of the report), and while we are still on page one, we are reminded sternly of the “dangers of the Silicon Valley ideology”, before we even find out what machine learning might be able to do for us. I am reminded of R H Tawney, who states in his book Equality (1931), that the abuse of the welfare system did not mean that governments should not provide welfare programs: abuse of a principle does not mean the principle is wrong. Similarly, the fact that Facebook may or may not use digital content ethically does not mean that digital content should not exist.  As it happens, there is a digital copy of Tawney’s book Equality held in the Internet Archive, so I am able to find a reference in the full text – something I could not do from most library catalogues, certainly not from World Cat or from most university library catalogues I have used. And the chances are that Google will find that statement, or some version of it, when I search on Google.

The report’s actual recommendations are:

  1. Cultivate Responsible ML in Libraries
  2. Increase access to data for ML (note this doesn’t mean “digitise more content”, but “make existing data more available to ML tools”, in my opinion a secondary goal)
  3. Develop ML + Libraries infrastructure, which means to “clearly establish divisions of labor and expectations for collaborators, while defining specific goals and outcomes for all partners on ML projects”.
  4. Support for ML + Library Projects
  5. ML Expertise in Libraries, emphasise the role of the library to cultivate literacy about ML

Each major goal is then divided into subgoals, such as “Commit to honest reporting”, or “Adapt model statement of values”. All these recommendations are admirable, but nobody would object to most of them – they are self-evident, and uncontentious. But they appear to have missed the more fundamental recommendations above.

When content was mostly print, libraries made their content discoverable via human-generated metadata, such as library classification systems and keywords. The attempt to classify the world’s analogue knowledge, as found in print books and journals, was ambitious from the start, and resembled the labour of Sisyphus: a never-ending task. No sooner had cataloguers completed classifying one chunk of existing content than new content was published. No major library has yet completed cataloguing its analogue collections. The Bodleian and the British Library certainly have not. Even when catalogued, the entries frequently turn out to be less than adequate. When Microsoft digitised some 50,000 19th-century books from the BL collections back in the 1980s, they found that many of the newly digitised books had no adequate metadata from the old card catalogue.

Manual subject indexing did not take into account the innovation of full-text indexing that became possible in the 1960s (for example, BRS/Search, one of the first full-text indexing tools, was launched in 1968, and became commercially available in 1977) and which is today ubiquitous with internet search engines such as Google. Strangely, libraries seem not to have implemented full-text search and retrieval of their owned or licensed content. This requires the digitised text of the content, and frequently libraries do not hold this digitised text. They outsource indexing to external vendors of library systems, so that increasingly, the library does not hold content but simply provides signposts and authorisations to content held elsewhere. So the library is reliant on what the third party does (or does not) choose to index.

Even where full-text repositories exist, they are often inadequate for discovery and manipulation; Google Search has achieved its current pre-eminence because it provides a whole range of string- and AI-based tools to supplement simple string matching to identify words in a text. Nowadays we all expect the search engine to use these tools to help interpret users’ searching. Libraries have failed to take on board this technology, and they have not made use of the latest developments in AI, which use corpus analysis to automatically identify concepts from content and which work at truly monumental scale. Current computing power makes it possible to continuously index the 90+ million journal articles in existence, so why can it not be used to index all book chapters, and to keep the indexes up to date on a rolling basis? You could say the academic library has failed in its fundamental mission, to enable discovery using the best available tools. By emphasising human-based tools such as manual identification of metadata and indexing books but not indexing their content, they have started to become incidental as part of the discovery process. The biggest opportunity for academic libraries is to provide discovery tools similar to (but applied slightly differently) to those in Google and major search engines, based on the full text of content. Failure to provide these means that (as many surveys of academic literature point out) that library systems are only used for a minority of the searches of content carried out by academic researchers. For example, the STM 2018 Report that library online catalogues are less used than Google, Google Scholar, or a specialist A&I database such as Web of Science (which typically does not hold the full text of items catalogued).

Surely the over-riding goal of the report should be twofold:

  1. Get content digitised.
  2. Use AI and ML tools, in conjunction with full-text search engines, to enable discovery of this content

All other criteria, such as examining bias, access to all, and so on, however important they may be, must follow the above first goals.