Reading Time: 5 minutes
Creating new knowledge with AI: an illustration by Paul Cleverly of a tool based on text embedding that plots two variables against each other to enable new knowledge in geoscience to be generated.

The last twelve months, up to November 2023, have been an exceptional year for the search industry. Generative AI, in several incarnations, has upset the apple cart for many approaches to discovery. Will it replace or augment traditional search? Will Google be superseded?

There were some clear answers at Search Solutions 2023, the annual conference of the BCS Information Retrieval group. A mixture of academic and professional speakers had, unusually, quite a unified response to generative AI. Tools like ChatGPT are here to stay, and they can be used to augment and to improve search using a technique known as RAG:

What is RAG? In full, retrieval-augmented generation, RAG originated in a 2020 paper by Patrick Lewis and members of the Facebook AI team. RAG looks like a solution to some of the shortcomings of generative AI. Specifically, LLMs are typically based on a closed corpus, so are not up to date, while search engines like Google are continually updated. RAG provides additional repositories of knowledge that can supplement the underlying LLM of a generative AI tool like ChatGPT. The other benefit of using RAG is that it is possible to provide a response that specified the source of the information – something that the generative AI tools typically do not do.

So, for anyone like me who came to Search Solutions to see how search has changed in light of the new AI tools, here was my answer. Several of the speakers mentioned this new technique. And, most importantly for some in the audience, the new RAG technique does require some knowledge of the R (for “retrieval”). In other words, the search professional’s job hasn’t quite disappeared yet.

But the day provided not just a concise snapshot of latest techniques, but a very welcome overview of how we got here. Julie Weeds of the University of Sussex gave a very accessible account of the last few years of machine learning and AI tools, and came up with the memorable description of ChatGPT as similar to the man in the corner of the pub. He (it’s almost always he) always has an opinion, is utterly confident of that opinion, but you have no idea if what he says is correct or utter rubbish.

Hong Zhou of Wiley Partner Solutions supplemented this overview by showing how it applied to academic publishing, examining such topics as the literature review, suggesting relevant journals for an article, and finding peer reviewers. It seemed clear that the existing tools provided by Wiley for authors have not yet been superseded. ChatGPT, for example, provided non-existent peer reviewers. Nonetheless, Hong sounded broadly optimistic for the future, suggesting that academic search will move from search for results to generating answers. Hong’s advice, to use generative AI with caution, was another common recommendation from several participants.

The panel session discussing how academic and professional exploration of search could be combined was to an extent sidetracked by the continuing discussion on generative AI – not, in my opinion, mistakenly, as the astonishing events at OpenAI in the preceding week made it clear that governance and exploitation of the new AI will not be simple to manage together. Given that ChatGPT presents its results with such confidence, will people be convinced it is correct? I am reassured that few people take the man in the corner of the pub seriously, so perhaps we have some in-built mechanism to identify persuasive-sounding phrases.

Charlie Hull started the afternoon session with a highly concentrated presentation looking at the application of generative AI to commercial search. He also had a good way to characterise tools such as ChatGPT: “the weird world of LLMs”, that don’t come with a manual. His message, reassuring to the audience of search professionals, was that search hasn’t gone away; there is a need for traditional search expertise, and he ended with the practical tip that if you are looking for funding, don’t call what you do “search”, call it AI instead.  

Grace Lee of Thomson Reuters pointed out that for legal applications, trust and explainability were even more important than in most other subject domains. Like Charlie Hull, she was optimistic about the impact of generative AI on information retrieval professionals like her: “I feel our time is coming back”, meaning she felt she was behind the curve when everyone was talking about NLP and downplaying infromation retrieval. Using tools such as RAG mean the information specialist still has a role. She even had a word of encouragement for linguists – she suggested that they, rather than computer scientists, were better for managing prompt engineering (identifying the right questions to ask of AI tools).

Peter Winstanley of Semantic Arts had perhaps the most traditionally aligned approach to generative AI. As you might expect from an ontologist, he described the use of ontologies, and their role in disambiguation, and compared them to the way that species are differentiated in nature: “No naturalist would question the reality of the species he may find in his garden” (quoting the evolutionary biologist Ernst Mayr). I’m always a bit suspicious of the analogy with natural history species; species turn out in practice to be less distinct than we once thought, and distinguishing species is no longer the provenance of the naturalist; in other words, it’s no longer so clear to discern. For me, the remarkable thing about generative AI is how it comes up with results without being presented only with carefully curated content with well-tagged metadata. Like Google, it appears to tolerate the ambiguities and messiness of natural language, and proceeds blithely to deliver results in any case. The corpus that comprises a large-language model (LLM) is not disambiguated and tagged in advance, as you would like to have it, like a dictionary definition. As a former dictionary editor, I can say with some confidence that there is no single dictionary-based description of all the words in a language: lexicographers disagree on sense distinctions. The attempt to capture all knowledge using formal statements was one of the earlier approaches to AI that didn’t seem to get very far.

Finally, Paul Cleverley gave a fascinating outline of how ChatGPT was being integrated (by the GeoScienceWorld cooperative) into OpenGeoSci, a geographical tool for researchers. He contrasted “Eötvösite”, a totally mythical (yet plausible sounding) mineral invented by generative AI, and then contrasted it with how data could be used effectively if it followed three basic principles of data management (I didn’t catch where these were derived from):

  1. Data must be managed as an asset
  2. The provenence of the data should be known
  3. Data can be exploited by AI to generate new derivatives, as long as this does not conflict with the first two principles.

I think we can all agree on this. Paul’s talk echoed the other speakers, I think, in concluding that generative AI can be a force for good, as long as it is used with caution. His geoscience examples showed that we have only just begun to discover ways in which it be most effectively be deployed.

Who knows what new developments will be presented at next year’s conference?