Reading Time: 5 minutes

This was one of the admirable principles enunciated by Anthony Groves in his recent Haystack Conference webinar on how to deliver relevant results from a content-rich website, which I wrote about here. Yet, as I will show, sometimes it’s not such a bad idea to match strings rather than things.

After his presentation, I spent some time looking at the O’Reilly Online Learning site, and the number of routes to navigate into the content looked very clear and inviting: I could choose from expert playlists, emerging trends, case studies, trending (meaning very popular) titles, and others. Each of these looked to correspond with clear use cases.

Perhaps I’m not a typical user, but when I tried actual searches, given the obvious care that had gone into the discovery process, I expected a state-of-the art interface, but I got rather unexpected results.

I entered “taxonomy” in the search box. By the time I had finished keying “taxonomy” I had seen (but could not longer see) a title “Taxonomies A-Z”. Clearly there is an autocomplete feature, but tantalisingly not a semantic search. The title that I wanted to see was no longer in the list of suggested titles! So I cheated a little, and carefully removed the last “y”, and then found the title that looks interesting:

However, the other results in this list (presumably sorted by relevance) seemed to have little or nothing to do with taxonomies (or with a taxonomy). What does “Perspective on Data Science for Software Engineering” have to do with taxonomy? I entered “taxonomies” into the search box and found over 4,500 hits, so clearly there are plenty of results with the word “taxonomies” in the title.

Perhaps I had misunderstood and I should be searching for “taxonomy” as a topic. I remember Anthony Groves talking about topics as an important means of getting to relevant content (as in “Python” or “JavaScript”. So I browsed through the list of topics, only to find it is a long list with over a hundred terms, and, it would seem, in order of number of hits, not in alphabetical order. That makes sense to the engineer, perhaps, but not to the user trying to find a specific topic:

“Taxonomy” is not included in this list as a topic, as far as I could see. Not only is the list of topics ordered by relevance, but the topics appear to be rather arbitrary. Why “Microsoft SharePoint” and not just “SharePoint”? “Ontology” is not a topic. I clicked on the first topic, “FinTech”, and got 23 search results  – but these were not about fintech nor about “taxonomies”. I’m not quite sure that this list was based on. The first book was “mobile learning”, which was tagged “fintech” but seemed to be about neither fintech nor about taxonomies. Perhaps I should have cleared my search in the search box before selecting topics. I was lost.

Perhaps I was looking for a subject that was too peripheral to the O’Reilly content. So I chose something very mainstream, “web design”, and this produced a good set of very relevant results, with what looked like very relevant hits in the book title:

But as soon as I tried to drill down to something more specific, I started to get less relevant or even completely irrelevant results. I assumed the headings across the top of the results screen are filters that enable me to refine my results, and I selected the drop-down “publishers” and chose Dorling Kindersley, to see what they have published on web design – but it appears to be nothing at all. Instead, I got a set of travel guides, published ten years ago and of no relevance to web design or to anything else on this platform. What are these books doing here?

One implication of this very odd result is that the site appears to be configured to show results even when there are none. For a site like O’Reilly Online Learning I would expect as a user to be told if there are no hits. This is not a site such as Netflix, where the goal is to show a film result at all costs, even if you search for a specific film and Netflix doesn’t have it (as I showed here when I searched for Citizen Kane and got hundreds of irrelevant hits).

How good is OOL at finding detailed content?

Here is another example where the site produces some surprising results. There are two major open-source search and discovery tools, Solr and Elasticsearch. It would seem a relevant search on O’Reilly Online Learning to look for comparisons between the two. I am familiar with Enterprise Search, a book by Martin White, that is available on the OOL site. So I searched for “Solr v Elasticsearch”, and explored the hits. I know that Martin White has a section in his book on just that, so I expect to find it in the hits. In fact it appears only fourth in the results. Each result includes the keyword found in context, which is very revealing:

  1. Solr in Action (2014) highlighting a sentence with “Jetty vs. Tomcat  We recommend staying with Jetty when first learning Solr.” – not relevant for this search
  2. Apache Solr: A Practical Approach to Enterprise Search – this looks relevant, with a sentence “Solr vx. Other Options”
  3. Relevant Search: with applications for Solr and Elasticsearch (2016) – the book is relevant, but the keyword in context is “Star Trek V: the Final Frontier”.
  4. Enterprise Search, 2nd Edition – and it shows a relevant sentence: “If the requirement is to support text searching and analytical queires, then Elastic could be the best option.”

In other words, two out of four of the most relevant hits are identifying the “v” (which is expanded to “vs” or “vx” by the engine) rather than the more important terms “Solr” and “Elasticsearch”. In this case, surely searching for a string would be better than searching for the thing, if the thing is not identified correctly?

The gold standard is, of course, Google, and it is instructive to compare the results of any site with those of Google. When I enter “Solr v Elasticsearch” in Google, I get plenty of perfectly relevant hits at the top of the results. Google understands me (I skipped the sponsored results at the top, to show only the relevance ranked hits):

I don’t understand why the site should be so counter-intuitive. Despite all the hard work, and the excellent principles (“Match things, not strings!”) there are some very unexpected results to be found when searching for content.

Perhaps it is unfair to complain about the presence of a few irrelevant content items in a collection of hundreds of thousands of items, but why should a discovery service show me things that are irrelevant? Perhaps every content collection has some dusty corners where nobody ever looks; but I don’t want to be shown them.