I don’t think the title “Deep Text” does this book any favours – a more accurate description might be “Text Analytics within the Enterprise” – less catchy, but certainly more intelligible, and more indicative of what this book covers. From the title, you might think this is yet another business book inventing a catchphrase and spinning the idea out to 220 pages. In reality this is a detailed and thoughtful overview of the use of text analytics for content-based organizations, written by a highly experienced practitioner.
Who is this book for? It is aimed at an audience that is involved in making business decisions (which means investment decisions) but that also needs to understand something about the technology involved. Large organisations will have senior management who would not open a book of this kind; very small organisations will not have the resources to build anything. Somewhere in the middle is the organisation Reamy is aiming at: trying to make sense of new technology without being able to turn to the resources of a research or semantic team. It is aimed at a business market, but there are references throughout the book on good IT practices, such as lean development, and build to fail models.
What makes the book readable is that Tom Reamy isn’t afraid to speak his mind. While most consultants have spent years learning to bite their tongue and provide the advice that the client asks, Reamy states in no uncertain terms what he thinks has worked – and what hasn’t. For example, he is clear that “most metadata projects – particularly asking authors to add keywords to documents as they publish them into content management systems – have been failures.” That’s a bit of an indictment of a process that has been undertaken by many publishers, but it is today quite widely agreed that the result is no more than a folksonomy. But, Reamy continues, “the other component that was supposed to improve search is adding taxonomies to the mix. I have to admit that I used to believe that this was the best answer, and spent a few years developing taxonomies for organizations which, while they helped somewhat, were rarely worth the effort and time.” You have to admire the author’s honesty, as he goes on to clarify: “The basic problem was not with the taxonomy, but with trying to apply the taxonomy to documents, in other words, manual tagging with all its well-known problems.”
The tone, then, throughout the book is one of experience and candour: practical advice that has come from lots of experience, and not being afraid to admit he has changed his mind at times.
At its simplest, then, the book has straightforward practical advice. Creating a taxonomy, for example, is described as a recommendation to compile something simple and usable, ideally six to eight items per level, and around 200-500 nodes overall. For information gathering, Reamy provides templates, such as one for the information interview, which you can adapt to your own needs. For me, the most interesting conclusion of the book, however, was more than these practical details; it is something that informs the entire approach of the author: the recommendation to use pragmatic, hybrid solutions that combine machine and human skills (see, e.g. chapter 8). As the author states, “Taxonomies by themselves are not enough” [ch 15].
Given the innovative nature of this book, it’s not surprising that I (or any other reader with some familiarity with the subject) will disagree with some of the conclusions. For one thing, the technology is moving too fast for a book like this to keep up. For example, the latest generation of text-analytics tools that work without any pre-existing taxonomy suggest some ways of working that differ substantially to earlier tools.
One of the problems Tom Reamy faces is the lack of standardization of terminology. As he states, there is little agreement over definitions of “text mining” and “text analytics”, although we can all agree that both topics cover the use of machines to interpret text. Since there is so little agreement, is there any point (in chapter one) trying to draw a distinction? For him, text mining is counting words and discovering patterns, while text analytics is the use of software models to analyse text. But one of the challenges (or delights) of natural language is how common usage takes a term and starts to apply it in other areas, until the original meaning starts to get lost (such as the word “semantic”, as Reamy points out).
As for “ontology” and “taxonomy”, Reamy makes a valiant attempt to distinguish them, stating that taxonomies are hierarchical while ontologies are based around multiple relationships. Unfortunately, this distinction too is probably too corrupted by widespread use of the two terms as synonymous to preserve the difference.
More fundamentally, text analytics is moving so fast that some of the assumptions in the book are increasingly being side-stepped. For example, much of Deep Text assumes a rule-based approach to classification, and even gives examples of it in chapter 15. Certainly, a limitation of human rule-based classification is that the rules are still created and managed by humans. Any evaluation will be to a human measure. In contrast, much of the “bag of words” approach to text analytics reduces the need for any detailed linguistic analysis.
One major development of taxonomy is the trend towards turnkey solutions. Deep Text includes instructions for how an organization can develop its own entity rules, for example, but increasingly all this will be managed by the vendor or by outsourced teams. Another example is chapter 8, where Reamy includes describes how in one project his team created an entire set of positive and negative sentiment terms – the kind of project for which one would hope an off-the-shelf collection of terms would now be available.
At some points in the text, the need to describe leads to what I think can be an over-generalisation. For example, “taxonomies … are based on a higher level of abstraction, while clusters are more tied to specific sets of content. Also, taxonomies can normally be developed to more granular levels than clusters”. Taxonomies, being human constructs, can be developed to any granularity the users feel is appropriate. MeSH, one of the most widely used scientific terminologies, has around 220,000 terms, while some automatic concept extraction tools create (roughly speaking) around one concept (what Reamy calls a cluster) for every ten words in the corpus, which means there may be millions of concepts in a large corpus of text.
The book could have benefited from some better copy-editing. There are five mentions of how much text in businesses in unstructured. The most detailed reference is in chapter 13: “unstructured content contains 90% of all the important business information. That’s up from the traditional 80% – mostly because of social media content.” To use specific percentages in this way and to repeat them draws attention to what should be just an indicative measure.
There’s also quite a bit of repetition. Sentiment analysis is done (as Reamy reminds us), by counting the number of positive – negative terms in a document. For example,
- “Sentiment analysis software was originally based on fairly simple text analytics capabilities, mostly extraction and the use of dictionaries of positive and negative terms.” (ch6)
- “early sentiment analysis applications, which just took dictionaries of positive and negative words and counted them up.”
- “the field has matured lately with more advanced rule-based analysis that replaced simple positive and negative vocabulary dictionaries”
- “Sentiment analysis started with very simple rules which tried to score the positive and negative polarity expressed in text. The early applications simply used generic dictionaries of positive and negative terms.”
All of these statements are I’m sure true, but we don’t need to read them this many times.
But let that not detract from the success of this book. Deep Text is a clear leader in a field of one (I don’t know any other book with this scope), meeting a great need in businesses to understand and to deploy a highly valuable strategic tool. Tom Reamy deserves credit for attempting to make sense of the fast-moving and bewildering word of text mining and text analytics – not just technical sense, nor even just business sense, but pulling the two together.