Hamlet, Second Quarto edition

A long-running exchange of letters in the TLS (most recently May 22 2020) makes it clear that literary scholars have become aware of the use of AI tools to indicate authorship. This technique, always controversial, is not new. It is indicative of the uncertain way in which literary scholars have at times adopted with enthusiasm and at the same time been very suspicious of using quantitative methods.

If it was only a discussion about authorship I would not be so concerned, but some of the scholars in question appear to have a misunderstanding of how statistical AI tools are used to “understand” texts. I have to agree with Warren Chernaik when he writes

the question is not whether quantitative analysis, including the counting of function words, is an acceptable method of assigning authorship of particular works, but whether the method is used competently.

Perhaps I misunderstand what the scholars are doing, but the text analytics used widely in academic science publishing is employed in a very different context. First, the texts used must be long. A typical use would be to compare a 5,000-word article against a corpus of 100,000 documents. With this kind of background, the system is able to see words in context and to enable inferences about such things as the subject area of the text, significant phrases, and so on.

Secondly, the words and phrases identified are not the most common ones. Function words such as “the” are useless to identify anything about a text because they are so common.

Yet in the example described, the scholars appear to be determining authorship based on short passages of two hundred words or fewer; and using function words, in the hope that differences in the use of function words might differentiate two authors.

One of the participants in this debate is Professor Brian Vickers. Unfortunately, his venom and sarcasm makes the discussion more like parliamentary question time than a considered debate. This makes it difficult to establish exactly what he is saying, but he certainly appears to defend the use of functional word counts for attribution purposes: “my observation that many attribution scholars claim that function words are used unconsciously and are therefore better authorship markers”. The reference is elliptical, so we aren’t sure what Prof Vickers himself thinks on the matter.

Whether Professor Vickers believes in the counting of function words for attribution purposes or not, such reasoning is completely at odds with the orthodox view in text analytics textbooks. How attribution descended to such an embarrassing mudslinging debate I cannot understand. I can only imagine that literary scholars working in the area of attribution receive a training that encourages imaginative and unusual interpretations of evidence (typically from a literary text) but such training provides almost no understanding of basic statistical methods or the use of automated statistical tools on a corpus. University departments appear not as a peaceful havens of learning, but a furious acerbic cacophony of disagreement that has little connection with established practice in other disciplines; not a good advert for the humanities.