Reading Time: 6 minutes
This is a sparse matrix – most of the elements are zero. Simple, really.

An account of the Search Solutions Conference practical session, 22 November 2022

The annual Search Solutions Conference is not just a great way to keep up with the world of discovery tools, but also provides an excellent collection of practical sessions for users to extend their knowledge. I joined the above one-day session, and found it fascinating.

By way of introduction, here is the shortest history of search I could manage. The course presenters assumed you would have this background, so I have included it for my (and perhaps your) benefit. If you know all this already, please skip a paragraph or two.

The shortest history of search

By way of introduction, this is as short a history of search as I could manage.

The world of search has changed radically during the last few years. For some decades, searching was based on keywords, using the principle of the inverted index, a simple tool for finding content in a text document more quickly than simply searching through the text from the beginning. Google Search, incidentally, still uses an inverted index (along with many other tools, some of which are described below).

Then, in the 1970s, researchers began to think about how often a word appeared in a document (the “term frequency”). In 1972, this idea was elaborated by considering inverse document frequency, a term invented by Karen Spärck Jones, who formulated the simple but effective idea that the significance of a word in a document tended to be inversely proportional to the number of occurrences in that document, hence the “inverse document frequency”. The two were combined as the well-known “TF-IDF” tool.

Bag of Words

During the 1970s and 80s, probabilistic models were introduced, and these were combined with the TF-IDF model, resulting in the widely used BM25 ranking function, another tool for estimating the relevance of a document to a search query. The limitation of BM25 was that it simply treated a document as a bag of words (BoW), that is, a collection of words, without taking word order and proximity into account. Nonetheless, it is very useful for such tasks as spam detection in emails, since some words in spam messages occur more frequently than in standard emails.


But it was the arrival in about 2018 of transformers that truly, well, transformed the world of search; notably, the transformer called BERT (Bidirectional Encoder Representations from Transformers). The “bidirectional” refers to the way that BERT looks not just forward through a document, but forward and backwards, that is, bidirectionally. It creates a scoring for each word based on the terms before and after it.

The practical tutorial

The above mini-history of search was an essential preamble to the tutorial itself, which, as the title suggested, described BERT but then went beyond it, concentrating on developments of the last few years. Since the development of transformers, a multiplicity of models has been created, and the tutorial described some of those models. What made the event more than just a presentation was the hands-on experience of being able to try the tools for ourselves, via a laptop. The lively questioning meant that the presenters, from the Universities of Glasgow and Pisa, had plenty to keep them occupied.

Introducing PyTerrier

There were three sessions, all of them using a framework called PyTerrier, a tool that reduces, if not quite eliminates, the need to write code. It certainly simplifies any coding, since for the most part, all you have to do is select from a set of prepared modules the components you want to use, such as the datasets (there are over 128 available) and the various filters. Steve Zimmerman, an alumnus of the University of Glasgow (and thanks to Steve for reading the first draft of this post), explained PyTerrier as “a much more human friendly Python interface that links to the Terrier libraries written in Java, which are very complex”. From my point of view, I wasn’t aware I was writing in Python or any other language, because it was so simple to change the parameters. It meant that I, a non-programmer, could play with the tools and try out some what ifs? What if we change the parameters or the dataset, for example? Using PyTerrier was simplicity itself – as a user, you just specified the IR tool and the dataset (PyTerrier has 128 datasets available), and watched the results unfold via your laptop. It was a bit like building a website using WordPress, without any need to know HTML or CSS: the full tools are available if you wish, but for many (most?) websites there is no need to get immersed in such detail. We have come a long way since the need to program using machine code.

The three presenters (one of whom had travelled from Pisa just to teach us in this session) led us through a day structured somewhat like the ascent to Parnassus. What was described as the “traditional” approach to information retrieval was around the time of BERT, just four years ago. Sean MacAvaney described how PyTerrier was built on the earlier Terrier, which turns out to be built by the University of Glasgow as well. Why use such tools for investigating algorithms? Simply because, explained MacAvaney, you want to concentrate on the results rather than struggling to make them happen. He promised us no FOR loops, and I didn’t see any.

Part Two: Neural Re-ranking

Re-ranking turned out to be less challenging than might be imagined. Re-ranking is an example of how the latest techniques do not replace earlier tools such as the inverted index, but build on it and improve it. If a search engine such as Google really identifies over a million hits from a search query, then we still have a problem: clearly these results have to be arranged in some kind of relevance ranking to be of any use. One standard method for information retrieval today is to use some additional technique to the initial search engine result, to reduce the number of hits still further – in other words, re-ranking the results. This re-ranking can be simple or sophisticated; the simplest is the wonderfully named “one-hot encoding”, which simply represents each word in a sequence with “1”, and then a “0” for every other word in the vocabulary – no word scores more than any other. More sophisticated re-rankers take multiple words into account, and better still, we reach BERT, looking at words in context. BERT scores each term based on the scores of all the words around it. Using BERT, even though it has no understanding of the text or even the language in which it is working, we can identify terms that are similar in some way. This presents many opportunities for analysing the text. These “context-aware embeddings” can, for example, then enable us to disambiguate polysemous words, such as “crane” the bird, from “crane” the lifting device. Each of these senses appears with similar words around them, and those surrounding words can give you a fair indication of which sense is intended. Other potential uses can be summarisation and identifying further reading. Incidentally, some of these new techniques are highly processor-intensive. BERT takes around 45 times longer to compute than BM25, so there is clearly a significant overhead in processing time. This raises the question, not covered in this session: is it worth all the computing power required? A question that is frequently asked about Blockchain tools.

Sparse and dense retrieval

In the final session, Nicola Tonellotto began by mentioning the inverted index, the basis of information retrieval for many years. In fact he talked so fondly about the inverted index that we almost had a few moments’ silence to remember its passing. But in truth, the inverted index is still around, just enhanced by new tools such as ColBERT and ANCE (pronounced “An-si”), and introduced to the concept of dense retrieval. This turned out to be less forbidding than its title suggests. If you create a graph of all the words in a text, and then annotate it with “1” for each time a word is used, and otherwise “0”, you will have a graph with a lot of zeros, hence a “sparse” model. In contrast, a dense model has fewer zeros. The difference between sparse and dense sounds trivial, yet it merits a Wikipedia entry for it.

One way to make a model more dense is to augment the index, which simply meant adding more terms. If there are any terms in the document that are thought to be significant, these can be repeated in the document, so the system makes more of them. Doc2Query does what its name suggests, generates questions out of a document. It is an example of a dense-embedding model. The move from searching for individual terms to asking natural-language questions is perhaps the most exciting aspect of search today.


All in all, the tutorial was a unique opportunity to get hands-on experience at practical information retrieval from leaders in the field. By the end I was prepared to talk about BM25 and TF-IDF as if they were old hat. But, of course,  these “traditional” tools are still very much in use today; they have just been supplemented by the additional capabilities provided by BERT and similar techniques. So I’ll still be able to talk about inverted indexes at parties (in addition to BERT, of course), and not feel I am entirely superannuated.