Reading Time: 3 minutes
Amazon’s “frequently bought together” tool doesn’t inspire confidence: here Amazon recommends The Handmaid’s Tale and A Streetcar named Desire for anyone buying Hamlet.

Recommender systems are the El Dorado of the 21st century. Retailers, manufacturers, content aggregators such as Netflix and Spotify see recommendations as a way of making money, for various (and sometimes disparate) reasons. After all, in a world that has too many options, how are we to make a selection? To keep up with the output of all the films from Hollywood you would need to watch 600 films a year; but this output is dwarfed by the movie industry in India, responsible for some 1800 films every year [figures from Wikipedia]. Clearly, you have to make a choice, and reading reviews, or seeing what has been recommended, is a logical place to start.

Academic publishing is similar, in that there is more content in most subject areas than any human could reasonably keep up with manually: around 3,000 new science articles are published every day. So you can see the need for a recommender system to assist researchers to find relevant content.

However, just as it is possible to question the criteria used by Netflix and Spotify, it is reasonable to ask questions about academic recommender systems. For example, how do you measure the impact of recommender systems? Many articles on this topic begin with big assumptions. Most of them are based on the “more like this” model, technically known as collaborative filtering: what other users recommended. This model has inherent drawbacks. First, other users have to recommend something. The culture of recommending is relatively recent, and for many digital collections, there is no data of what people recommended in the past. In any case, academics don’t recommend articles, although they may cite them (whether for positive or negative reasons is not always clear; most impact assessment tools don’t make any distinction between the two).

Computing-based approaches to recommender systems tend to get involved in mathematics at the expense of looking at the bigger picture. A typical paper, by Claire Longo, turns out to be less than helpful. First of all, it is based entirely on the “more like this” model. Ms Longo then compares three systems:

  1. Random recommender
  2. Popularity recommender
  3. Collaborative filter.

Option (1) appears to be similar to the placebo in clinical trials, to see if doing nothing provides any benefit. A better post is by Baptiste Rocca. However, even he proposes a rather arbitrary division of recommender systems in just two categories: collaborative filers and what he calls “content-based” recommendations, but actually based on information about the user, such as their age or gender. Academic sites do not gather such data, and so the tools are useless in this context. In any case, the conclusion to this article states:

Recommender systems are difficult to evaluate: if some classical metrics such that MSE, accuracy, recall or precision can be used, one should keep in mind that some desired properties such as diversity (serendipity) and Explainability can’t be assessed this way; real conditions evaluation (like A/B testing or sample testing) is finally the only real way to evaluate a new recommender system but requires a certain confidence in the model.

That doesn’t inspire confidence! A refreshingly honest article by Tomas Rehorek, one of the founders  of recommender system vendor Recombee: Evaluating Recommender Systems: choosing the best one for your business.  But this post is mainly based on users rating content. For academic purposes, researchers don’t rate content. If your users don’t log their approval (or condemnation), how can we make use of their activity on the site? As you would imagine for someone providing tools with e-commerce as the ultimate goal, Mr Rehorek reveals a few interesting conclusions:

In some cases (not always, depends on your business!), it’s a fair strategy to recommend only the globally most popular items (a.k.a. bestsellers) to achieve reasonable recall.

What methods does he recommend? Click-through rate (CTR) or conversion rate (CR). Click-through rate looks the most relevant. If I suggest ten recommended articles, the system counts if any of the recommendations was clicked on. But even here, the measure is not very reliable. In the end, he comes down to customer lifetime value (CLV). You want to keep your customers happy, so you try to provide “nice recommendations with high empirical quality and a reasonably positive ROI”. It’s all a bit vague.