Reading Time: 4 minutes

A medical diagnosis chatbot and a scholarly analysis of clinical patient records don’t appear to be in any way related. Yet there are similarities between the two, as became apparent at this London Text Analytics Meetup (23 February).

Entitled, rather mystifyingly, a “life sciences special”, the event was focused entirely on diagnosis. Although there were four people talking, three of them described various aspects of the Babylon Healthcare chatbot; the other talk was by Richard Jackson, an academic researcher.

Perhaps “chatbot” is not the correct way to describe it. On hearing “chatbot” I imagined a laid-back social-media-based group. Babylon was certainly not that – perhaps not quite a startup (one of the Babylon staffers I spoke to answered almost apologetically, “Well, I suppose we are”) but gave the impression of an impressively focused company, even though the environment was startup-like with bright orange desks, a meeting area carpeted with artificial grass, and huge plants and hanging baskets everywhere.

Babylon is indeed a company that produces chatbots, but chatbots in a specific domain: that of medical diagnosis, interacting with users directly to ascertain their condition and to advise them. In other words, this is a very focused chatbot: a long way from Amazon’s Alexa or Microsoft’s Zo. One of the speakers pointed out how a general chatbot like Siri frequently reverted to a Google search in response to general questions, which of course would not be appropriate for a medical diagnosis tool such as Babylon. 

What did we learn from the experience of Babylon setting up machine-based interaction with users describing their symptoms? In text analytics terms, they have discovered that a hybrid solution works best, comprising a mixture of machine learning, rules, and a knowledge base. But what we also learned, less formally, were some of the challenges of having to match formal and informal language. In Ireland, we were told, you may in common speech “take a heart attack”, although in a hospital you are more likely to hear that someone has “experienced a myocardial infarction”. Either way, it’s the same thing – and the underlying system needs to identify what that same thing is. This is relevant not only for understanding, but also for the language used in Babylon’s responses to the patient, a subject called NLG (natural language generation).

Another curious insight that emerged is that when interacting with a human, patients will state what they don’t have – “I don’t have a temperature”, but when interacting with a machine, they tend by default not to describe what they don’t have. So the machine cannot simply replicate the format of human-to-human conversations with a GP.

Richard Jackson, of the Biomedical Research Centre (BRC), Maudsley Hospital, King’s College London, described how he had analysed 11 million clinical records of patients with serious mental illness (SMI) collected over a ten-year period from 20,000 patients. He used text analytics tools (mapping the words of the records using vector-space models) to identify patterns of clinical language that described symptoms.


Essentially, both presentations – Babylon and the BRC approach – considered the same area: how to describe medical conditions so that others can interpret them. Although Babylon is concerned with capturing patient responses, and guiding them to a diagnosis, while the BRC presentation was looking at pre-existing clinician-created records, they are both looking at the relationship between a taxonomy (the set of terms used to describe medical conditions) and natural language (whether created by the patients themselves, or via a trained practitioner).

The challenge is mapping an input (either a patient-submitted input, in the case of Babylon, or a clinician-created patient record, in the King’s College example) to a knowledge base. The knowledge base used in the BRC study is SNOMED-CT, described by Wikipedia as “a computer-processable collection of medical terms … used in clinical documentation and reporting”. Significantly, SNOMED-CT was designed to provide “the core general terminology for electronic health records”. It contains over 300,000 medical “concepts”.

The Babylon team did not reveal which knowledge base they are mapping to, but it would not be different in principle from that of SNOMED-CT. Essentially, the process is this:

  1. A human complains of some health problem.
  2. An expert diagnoses their problem and maps it to an ontology of known clinical problems, such as measles, or food poisoning.

In the case of Babylon, this process was not explicitly described, but is no doubt based on the same process as the BRC clinical records. A training set of documents is analysed to identify all the existing medical problems (no doubt drastically simplified in the case of Babylon, perfectly correctly in the interests of obtaining an initial diagnosis against a known set of medical conditions). Of course, Babylon needs to include a route by which a user interacting with the system may generate a response that consists of “call your GP” or “go to hospital immediately”, which, for the analysis of records after the event, is of course not relevant.  

Of course, there are permutations and variations to this process. One of the most interesting aspects of Jackson’s BRC research was that clinicians often describe symptoms that are not included in the “official” SNOMED-CT list of terms. Presumably the Babylon team ignore such conditions; the Babylon process is I assume a reductive one, identifying a number of symptoms that lead to the diagnosis of a specific condition.

Another fascinating discovery from Richard Jackson’s research was that clinicians prefer to create full-text patient records rather than use a software tool for recording notes. When asked why this might be, Jackson responded, interestingly, that clinicians prefer to express their ideas as full text, perhaps because it enables them to “hedge”. Clinicians don’t want to be too precise about saying that the patient has X or Y, but prefer to qualify their ideas, and natural language enables them to state degrees of certainty in a way that a machine does not. Alternatively, he continued, they may just be lazy and find it easier to dictate notes for their secretary to capture later.

But most fascinating was the discovery that the things revealed by practitioners frequently did not correspond with the vast ontology that is SNOMED-CT – over 300,000 terms, and yet clinicians are recording still more terms! How can this be?

In passing, what he revealed about the underlying taxonomy of conditions in SNOMED-CT was fascinating. In the area of mental health alone, there seems to be little agreement about quite fundamental terms. He used the phrase “serious mental illness”, which he described as a collective term for three better-known mental illnesses, including schizophrenia – but there is little clinical agreement about the precise boundaries of these three terms.

In conclusion, a fascinating presentation with several interesting discoveries about the interaction of natural language and clinical conditions.