Are we getting any closer to learning how to use generative AI tools? The debate seems to be moving in the right direction, on the basis of a recent exchange of views on Scholarly Kitchen.
First, Hong Zhou of Wiley reported his results from trialling various LLM tools. He tried inputting a discovery question and compared the answers. First, he compared ChatGPT with Google Bard, a similar tool based on a Large Language Model (LLM). He input several questions to multiple tools, including “I am a research scientist in NLP area. I am currently interested in large language models like GPT-3. Can you recommend some relative papers about this area?”
Hong Zhou’s post was rapidly supplemented by a further post a few days later. This time, it was Dustin Smith, president of Hum, who pointed out a fundamental difference when comparing LLMs. GhatGPT is not linked to the internet, and only covers content up to 2021. Smith used the same query sentence, but extended the trial to cover Bard (from Google) and Bing AI Chat (both of them based on content to 2021, but able to search the internet for more up-to-date content). Based on his results, he made some further suggestions:
- Choose the right tool for the job
- Prompt well
More specifically, he suggests recasting the search to provide more information to the system about identity and context, in other words, making the task more explicit, as well as suggesting the output format. All these are sensible ideas, and reveal a shift in approach, but I don’t think they go far enough. I would suggest a shift in focus when using generative AI, from the answer to the question, from the output to the input. Instead of treating the tool like magic, we need to provide enough information for the system to do its work.
Let’s think about our experience of using a search engine such as Google. In the few years that Google has been around, we have learned by trial and erorr what works and what doesn’t work when searching. Finding the nearest pizzeria is something Google does very well indeed; but I wouldn’t ask Google such a vague question as “relative papers about this area”; if I did, I’d deserve an unhelpful response.
So I would focus much more on the input. I would add to Dustin Smith’s proposed parameters such information as:
- Level: do I want introductory or leading-edge articles?
- Audience: is this for specialists or for beginners? Even if I am a research scientists, I may not be expert in this area.
- Context: Where do I want the system to find my results? If I look in the daily press I will get a very different set of results to looking in a collection of academic papers.
- Date: am I looking for the most recent papers, or for the fundamental articles in this area?
Of course, one thing we have all learned from using search interfaces is that users are very poor at articulating all this kind of thing. Much better might be as part of my query to provide the system with two or three papers I have read that I found interesting. This would provide plenty of context for the system to use, without the need for me to specify level and audience.
Given this approach, let’s look again at the search question: “I am a research scientist in NLP area. I am currently interested in large language models like GPT-3. Can you recommend some relative papers about this area?”
Was the question actually about “related” papers, rather than “relative” papers? I assume the questioner meant “related”, but we need to build in a stage where the system interprets the question before further processing (“Did you mean related?”)
Secondly, and more fundamentally, let’s think about context. Is this a likely question? Any researcher inputting such a bald statement is unlikely to get the response they are looking for. Would a research scientist in NLP, in other words a specialist, ask to see papers about large language models? It is a very improbable question; they would already know plenty (but they have not specified sufficiently what they already know).
This appears to be an ongoing problem with the use of generative AI tools: the assumption that they know what you are thinking, when you have not provided sufficient input for them to create a valuable response.
In this case, all that has been provided is “research scientist”, suggesting the academic level of interest; three subject terms, “NLP”, “large language models”, and “GPT-3”. This gives very little indication to the system on how specific a response to provide. If I add these three terms to a Google search, I get 162,000 hits, starting with a Wikipedia entry. This Wikipedia entry was not found by an LLM; its position in the results was most likely determined by Google engineers deciding that an introductory article on a concept in the search should be prioritized in the results.
Richard van Noorden, in a comment on the Smith article, points out there is more than just an LLM in use when we use them for discovery. When we ask the LLM a question, as with Hong Zhou’s example query above, we expect the system to behave like a search engine with LLM added on. Search engines like Google are the result of AI, certainly, but also with a lot of human tuning to provide the kind of answers the engine thinks might be helpful given the query.
Now, the average query length in Google search is as little as three words. This means that the information provided is almost always insufficient to provide a very appropriate response, so Google has do to a lot of guesswork in providing a concise answer. If I ask Google the very question Hong asked, the first hit is the Scholarly Kitchen article where the question appeared – in other words, Google searches first for the string you input. Google didn’t find an answer to the question, it found the question! The following hits prioritise simple explanations of the subject, assuming you probably want an introduction:
Providing an introduction isn’t good or bad: it’s simply the engineers at Google tuning a system to provide the best possible response based on insufficient input. If you want to use a search engine to produce a very specific response, you have to build a search query with hundreds of terms and Boolean operators all over the place – which is how a systematic review search is created, with a lot of patience. It will eventually give you the result you are looking for, but it should be possible to provide sufficient context without a 50-term search query.
Despite all the advances provided by LLMs, we should not expect miracles. To expect a machine to understand context and assumptions is not the best way to use such systems. You have to ensure you provide the right level of detail to get a reasonable response, and one vital context you have to think about is the corpus from which the answer will be retrieved. So if you really are an academic asking an academic question, it makes sense to limit your answer to be derived from a corpus of academic content. If, on the other hand, you want a pizza, go to Google.