Judea Pearl’s The Book of Why (co-authored with science journalist Dana Mackenzie) (2018) is a tantalising read. For the initial premiss of the book, I am convinced by Pearl’s description of the limitations of current tools, particularly in statistics. He describes three levels of causation:
- Looking for regularities in past behaviour (that is, only identifying correlation)
- Asking what would happen based on possible interventions. Statistics doesn’t answer “what would happen if we double the price?” (this level would provide a measurement of cause and effect)
- Counterfactual questions: What would the world be like if a different path had been taken? (not something we can observe from the actual state of the world)
It’s a sobering thought to read that, until now, machine learning has only got to level one. Yet, Pearl argues, “causal questions can never be answered from data alone”. You need more than just correlations to identify causality.
The problem is (and this is where I get lost in the argument) how Pearl resolves it. He presents many diagrams that, in his opinion, enable us to infer causality. Perhaps I just don’t understand, but I don’t get the feeling that he has identified a machine-based way of establishing causality. If we go back to one of Pearl’s many examples, the relationship between smoking and lung cancer, he points out that while a correlation was detected many years ago, it was only thirty or forty years later that cause was inferred. Current AI tools simply count. They don’t provide any inference. They can count the number of times (for example) smokers and cases of lung cancer co-occur, but they don’t provide any way of inferring causality.
So how do we arrive at causality? A simple example of correlation not implying causality is the relationship between children’s reading ability and their shoe size. Children with larger shoe sizes are better at reading. Of course we know that conclusion is ridiculous – the correlation is between the age of the child and the reading ability. But who is to determine which is the fundamental relationship with children’s reading, age or shoe size (or height, or inside leg, or any of the many other variables that co-occur with age)?
A good example of the limitations of current AI is the way that one system, when asked to identify what an article was “about”, selected the terms “pain management” and “placebo”. The article was about managing pain after an operation, and it described a trial where one group was given ibuprofen, one group morphine, and one group nothing (placebo). To a machine, “placebo” is as common a concept in the context of pain management as is “ibuprofen”, since the corpus here is a lot of academic research papers that measure the use of painkillers against a placebo. The machine sees “placebo” appears alongside other types of pain management; or perhaps who knows, perhaps the researchers are interested in studying the principle of placebo rather than trialling different drugs. But our common-sense knowledge enables us to guess that “placebo”, while an essential part of many medical trials, is not the focus of attention. Just because the word “placebo” is mentioned, the article is not about the placebo effect. There is a similar problem with the word “vaccine” and “vaccinated”. Many medical articles mention the term, but the article is usually not about the efficacy or practice of vaccination.
Pedro Domingo, in his The Master Algorithm (2015), mentions something similar at the basis of Bayesian reasoning:
For most statisticians, the only legitimate way to estimate probabilities is by counting how often the corresponding events occur … the “frequentist” interpretation of probability… Bayesians’ answer is that a probability is not a frequency but a subjective degree of belief. Therefore it’s up to you what you make it, and all that Bayesian inference lets you do is update your prior beliefs with new evidence. (Domingos, 2015, p148)
I don’t think Pearl gets to the bottom of (or at least, I don’t understand his explanation) why we select the variables we do. Why do we dismiss shoe size as a relevant cause for reading ability? We make an assumption from our prior knowledge of the real world. Machines provide evidence, but human brains are required to identify which are the relevant items of evidence to compare and then to draw conclusions.