Reading Time: 3 minutes
From David De Roure’s presentation: the differing popularity of the terms “artificial inteligence” and “machine learning” over the past seventy or so years

Conferences and presentations about AI are everywhere these days, but this half-day event, part of the AI UK Fringe 2024 workshop, which looked specifically at AI (both machine learning and generative AI) and its interaction with scholarly research, proved particularly worthwhile. Perhaps what made the event so valuable was that the presenters were all people who were involved in the creation and evaluation of AI-based services, which made their contributions much more interesting.

The day was introduced by David De Roure, a digital humanities professor at Oxford, who provided a fascinating historical overview of how AI has interacted with the scholarly infrastructure over the last fifty years or so. As a veteran of symbolic AI, he showed how the term “AI” existed independently of “machine learning” until from around 2010 onwards, the two terms were increasingly co-existing. He made a persuasive case for describing the present day as “the AI era”, compared to the 1990s, which could be thought of as the “Web era”.  He pointed out that tools such as Eliza were already giving the impression of intelligence back in the 1960s, when they were nothing of the kind.

He introduced some fascinating concepts such as “social machines”: processes in which the people do the creative work and the machine does the administration, and “collective truth”, from systems such as Wikipedia, where the results are arrived at collectively rather than individually. What is fascinating is that, for all Wikipedia’s faults (and I have described some of them here and written more about Wikipedia here) it is today an indispensable resource. One example of collective work that De Roure didn’t mention is the UK Portable Antiquities Scheme, run by the British Museum, which takes advantage of the mass use of metal detectors to find archaeological remains much faster than professional researchers ever could.   

By no means were all the presentations stories of success. One revealing presentation, by Mike Thelwall of the University of Sheffield, considered using AI to assess research papers. The challenge is enormous: for the UK Reference Assessment 2021, no fewer than 185,594 articles were assessed by a team of 1,120 human experts and each article ranked between 1* and 4* for research quality. As a trial for the next REF, an AI tool (in fact many tools) was trialled and presented to a focus panel of experts – who didn’t like the results. The conclusion was that academics weren’t happy having their papers assessed by machine. There were many aspects to the comparison that deserve closer attention, but the message I picked up was about the challenge of comparing human evaluation with AI assessment in an objective way. Humans often disagree: in fact in the research assessment described above, every article was assessed not once, but twice, by humans. Unfortunately no record was kept of the disagreement (or otherwise) between the humans. Without such stats, we are unlikely to have much faith in what AI tells us. We all of us remain convinced in some irrational way that our own driving is better than any machine’s could be (although humans have an accident rate over double that of autonomous drivers).

Andre French outlined how AI was able to increase a training set by auto-generating images of plants that enabled the system to make better judgements. This was clever, but I don’t think the message was that we should create for text articles a similar number of fake papers that are similar to actual papers.

For me, one of the most powerful messages from the event (full disclosure: I work for CORE) was Petr Knoth making the case that preparing and curating the corpus that comprises the LLM was just as important as whatever tools were run on that corpus: yet the corpus that has often been taken for granted in the rush to try out generative AI tools. One of the reasons for the hallucinations presented by Chat-GPT, and other tools, is simply due to the corpus used for generating the results. The presentation by Digital Science of a wide range of AI-based tools made very clear what can be done with a well curated corpus.

The closing panel looked at how stakeholders can take advantage of these new technologies, but the topics were so wide-ranging that a list of simple recommended best practices was unlikely to emerge. Nonetheless, every attendee at the event went away with, I think, a clearer understanding of the opportunities and challenges presented by generative AI than by reading and listening to social media.