Reading Time: 4 minutes

I was describing my professional activity to a friend in the industry the other day, and they said “what you do is data science!” That was news to me, as I hadn’t heard the term applied to text analytics before. My suspicion was confirmed when I read the Wikipedia definition for data science:

Data science is an interdisciplinary field focused on extracting knowledge from data sets, which are typically large. … Many statisticians, including Nate Silver, have argued that data science is not a new field, but rather another name for statistics. [Wikipedia, “data science” 5 March 2021]

Well, data science may be used to mean statistics, but textbooks on data science, interestingly, assume a working knowledge of statistics (the first text book I looked at uses phrases such as “one standard deviation lower” without any explanation).  

A typical data science textbook, ­­Data Science for Business (Proster & Fawcett, 2013) suggests something of the background to data science. In the preface, the authors state the book originated as a course for MBA students, but then proved popular with machine-learning students. In other words, it was conceived as a guide for non-technical users who did not (and were never going to) code for themselves.

This would appear to answer the question: Is “data science” the same thing as coding? Clearly, there is a distinction. You could describe it as the distinction between being able to build a spreadsheet, and the ability to use a spreadsheet.

There are many people who have vast expertise with using Excel to manipulate numerical data. Despite its limitations, Excel can deliver impressive results. At the same time, we are all of us aware of over-using simple tools like Excel, often because they are simply not designed for managing big data.

Looking in more detail at the Proster and Fawcett textbook, they define data science as “a set of fundamental principles that guide the extraction of knowledge from data … the goal of data science is decision making.” That’s pretty clear, and much more targetted than the Wikipedia definition above. They go on to emphasise their view of data science as leading to “DDD”, or “data-driven decision making”, which they define as “the practice of basing decisions on the analysis of data, rather than just intuition.”

I don’t disagree with that definition, but I would hope to add something more. Intuition may be just what is required. Somewhere between the coder, who crafts the algorithm, and the user, who acts on the results, there needs to be someone who thinks about the real-world implications of the data manipulation, and to ask if the results are meaningful. To give an example: Using Google, search for “most tennis grand slam wins”:

Google search 10 May 2021

But if you search for “most women tennis grand slam wins” you get a different answer:

Google search 10 May 2021

Serena Williams has won more grand slams than Roger Federer. The reason for the discrepancy is of course a limitation both of the algorithm and  of the corpus data. References to males in the media are often unmarked for gender, and Google’s algorithm is not capable of understanding that “most women grand slams” should be considered under the heading “most grand slams”. Now, no doubt a data scientist crafted the tools that convert a natural-language query to deliver a tabulated response; but who is the person to identify that the result is wrong? Is that part of data science?

Here is another example. I presented a tool that automatically identifies peer reviewers for a submitted manuscript to a publisher. She carried out a trial with her journal staff, and they rejected the tool because of gender bias. “We are very sharp on this aspect”, she stated. I tried a few examples, and sure enough, the reviewer finder was recommending more male than female reviewers. Why was that? Was it the fault of the algorithm?

In this case the algorithm had no means of identifying the gender of the reviewer; it only matched by the content of the submission and the content of the articles authored by the would-be reviewer. A recent estimate of the proportion of women to men in research science is 30% female to 70% male. So even if all the scientific articles ever published were all written yesterday, there would be an imbalance of male authors over female, and of course that female/male imbalance will have been considerably greater further back in time. So I’m not surprised that a trawl of science researchers finds more males than females. The question is, how should we deal with this limitation?

Does it mean we should abandon AI tools? Certainly not. We know that Google is often wrong. It means, however, that users should treat these tools with caution, or, if you will, with common sense. One day, perhaps all research scientists will be tagged by gender. In the meantime, if you wish to select by gender, you can for the most part select from the reviewers’ names from the proposed reviewers you wish to find. Perhaps common sense should form part of data science.