The US Supreme Court Building (photo by Joe Ravi, CC BY-SA 3.0)

The peer review system is a mess, but there is no single cause of the problem. A very interesting paper from Anna Rogers and Isabelle Augenstein reads at times like a cri-de-coeur – it sounds like their involvement in the peer review process at first hand indicates that they have experienced some of the injustice and inequality.

Their article is about submitting articles about NLP and AI, but clearly, while this paper is written about the authors’ experience in submitting papers around this subject, including a related conference, their comments and suggestions seem to me to be universal.

As Rogers and Augenstein describe it, peer review is an “annotation” task: the reviewer makes a number of comments on the paper. The authors actually state “scores and comments”, which surprised me, because peer review does not, to the best of my knowledge, involve any kind of overt scoring, and I think perhaps it should. Humans tend to resist scoring, as they feel it belittles their judgement, but scoring provides one central aspect to the process, that of replicability, or at least justifiability.

I remember talking with an admissions tutor from an Oxford college, who had the job of turning down 75% of the applicants to study medicine each year. Wasn’t this a rather challenging exercise, I asked? Not now we have made the process more robust and more defensible, he stated. By “defensible”, he explained, he meant able to give coherent reasons why one candidate was accepted over another. In the case of admissions to medical school, the explanations had to be solid enough to stand up to a complaining parent.

There is a somewhat similar position with peer review, although scholarly articles are not usually in a zero-sum game, where one paper wins and the others all lose. Part of the problem with peer review, as Rogers and Augenstein describe, is the lack of a clear decision boundary: a few papers are clearly publishable, and a few papers are obviously unacceptable, but what about all the papers in the middle? Rogers and Augenstein cite a paper that shows 57% disagreement by two sets of peer reviewers over accepted papers (that is, papers that someone else had already thought publishable). That is a disastrous comment on human judgement. Is there a way of improving it?

The authors state that reviewers often, faced with a challenging situation, look for a “review-to-reject” attitude: is there any way I can quickly and simply find a criterion by which this paper is not acceptable? While that argument is true, I think it reveals a more fundamental challenge raised by our education system. We are all peer reviewers, in one way or another. As part of our liberal education, we have all been trained to argue a case, and judged on our argument. However, that training frequently takes the form of what I would call the debating society approach. As long as you make a valid enough point that is appealing to the audience (in this case, your teacher or examiner), your point will be taken as a proxy for the entire argument. “The use of English in this paper is of indifferent quality … so let’s reject the paper, despite the argument being sound.”

Can we improve on such inadequate reasoning? Is there any way we can make reviews more structured? One obvious way is to introduce a structure, thereby making peer review into a set of questions, with different questions for each type of paper. In that way, reviewers will be forced to examine submissions in a more comparable and measured way. Of course, reviewers should have the opportunity to provide some free text, but some aspects of a paper will remain fundamental for all submissions, for example:

  1. Is the aim of the paper stated clearly and succinctly in the title and/or abstract?
  2. Is there a literature review?
  3. If this article describes a trial, does the trial use a statistically valid methodology?
  4. Has each statement in the article been cited?

I am not the best-placed person to create the specific statement, but I don’t think it would be difficult for a group to come up with 10 or 15 questions of this kind. The approach is rather similar to the frameworks suggested for carrying out a systematic review, for example PICO and PRISMA in the medical sector. Having seen some examples of peer review, I was very surprised that an acceptable peer review could be anything from a hundred words to a thousand words. Some reviewers write a lot, and others don’t need to write so much, it is claimed; I find it difficult to believe.

A further benefit of structured reviews is that they are easier for the reviewer to carry out. If I have to answer ten questions on a submission, it takes me less time and helps me to organise my thoughts more than starting with a blank piece of paper. A peer review is not a piece of journalism or an essay to be marked. For an opinion piece, a blank sheet may be fine as a starting point. But for a peer review, a more formal assessment is required, and one which, whether or not the review is seen by others, can be justified as a fair and measured comment.

By having to answer a number of structured questions, the reviewer is forced to provide objective measurement, rather than simply identifying one or two potential distractors such as the quality of the English. Otherwise, we risk reducing peer review to the level of debate in a parliament, where a single objection is sufficient to bring down an entire argument. And, as we know from debates, if you can make the house laugh, you will win the argument, whatever the merits of the proposal. That is no way to carry out peer review.