Reading Time: 3 minutes
 One of the most famous articles published this century, the announcement of the human genome sequence (2001). The abstract is 399 words, which means the article would almost certainly not be accepted for publication today.

Aaron Tay (together with Bianca Kramer and Ludo Witman) has written an excellent article entitled “Why openly available abstracts are important”, which makes the case for abstracts being made available. I don’t disagree with this principle, but why the insistence on abstracts in preference to full text? The article states “In many cases, however, it is a deliberate choice to work with abstracts rather than full texts, even when the full text of articles is accessible”.  They expand this idea later:

As the above examples illustrate, abstracts have many uses. This is true even given the increasing number of articles for which the full text is openly accessible… In many cases, however, it is a deliberate choice to work with abstracts rather than full texts, even when the full text of articles is accessible. This can be for technical reasons, but also because abstracts provide a more focused description of the underlying research. For instance, a recent analysis of the CORD-19 dataset of COVID-related articles showed that only 24% of the CORD-19 articles available in Web of Science include COVID-related terms in their title, abstract, or keywords – restricting to this subset may give more targeted results than using the full dataset.

I am mystified why anyone should choose to work with abstracts rather than full text. This view seems to me to be a hangover from the era of full-text searching in the 1960s and 70s, when full-text indexes identified every mention of a term but without giving any indication of the relative importance of that term. So researchers had to wade through many papers that were not very relevant.

Choosing to work with abstracts only reminds me of the systematic review specialist who was showing me how she carried out a literature search. Her example was new articles on diabetes. “Diabetes is such a common topic”, she said, “that I restrict my search to title only, not the abstract or full text”. I keep my fingers crossed that the resulting systematic review did not omit articles relevant to diabetes that did not include the term “diabetes” in the title. But I fear her methodology was fatally flawed. There will be many articles that contain content relevant to diabetes without mentioning it in the title or abstract.

For better or worse, the unit of meaning that information specialists have to deal with is the article. Any subset of the article, such as title, abstract, or keywords, is a human construct, and created by a non-expert – the author him- or herself. Researchers are notoriously bad at identifying keywords for their own papers. Moreover, article titles are frequently limited by journals imposing arbitrary length restrictions, for example Nature Physics has a limit of 15 words; Nature Climate Change has a limit of 90 characters, Nature Microbiology has a limit of 150 characters (why this variation?). The benefit of shorter titles is not clear. Some scholarly articles claim there is evidence that longer titles are associated with higher citation rates (Habibzadeh and Yadollahie, 2010). This paper was contradicted by a Royal Society paper in 2015 (Letchford, Moat and Preis 2015), which examined 20,000 papers, but the authors point out that the reason for this may be that high-impact titles impose restrictions on the length of article title; in order to submit their paper, researchers are forced to restrict the number of words in the title for high impact journals, so the result is not convincing evidence. Such a restriction may make it easier for the journal to publish a table of contents, but does nothing for the researcher looking for a specific reference. One of the most famous academic papers published this century has an abstract around double the length that would be accepted today.

The present-day use of concept extraction and relevance tools (UNSILO is an example, but there are others) are a generation beyond full-text searching. They use a combination of concept extraction and relevance ranking, with relevance calculated by the importance of concepts to the paper overall compared to other articles in the corpus. In other words, the machine determines what a paper is “about” using statistical tools, rather than the researcher intuition or the journal house rules. A good AI tool will provide hits sorted by relevance, to the user can take a view on the most appropriate cut-off point in relevance.

The concept extraction technology is corpus-based. The bigger the corpus, the better the results, as there is more context for the system to determine what a piece of text is about. It seems strange for information specialists to be arguing for the continued use of the abstract when the full text can provide much better results.