This is the engine we all use for general searching

In his introduction to Karen Blakeman’s talk about Google search tools (“Google’s Family of Databases”), Martin White suggested that nobody knows more about Google search than Karen Blakeman – not even Google. From this talk, I’m inclined to agree.

I’ve heard Karen Blakeman’s presentations about how Google search works over several years, and I’ve always found them fascinating. Unlike many topics for talks, this one is never definitive, because, of course, Google keeps adding (and taking away) whole sites, as well as modifying site performance on a regular basis, usually without telling the user. Karen has spent a large part of her career describing how Google works, notably her series of workshops for librarians and researchers.

Google is probably the most widely used search engine of all time. We all make use of it, several times a day, without ever thinking about what it does, in detail, and why. As Karen Blakeman pointed out at the beginning, Google search is primarily designed as a revenue generation tool for use by consumers. Its goal is to build its sites with minimal human intervention; in the Q&A after the talk, Karen Blakeman said she had heard that Google Scholar originated as a research project to see if it was possible to build a site that extracted author and title metadata from academic papers automatically. That approach has continued, which is why a search for, say, “2012” may retrieve a publication year of 2012 or a page number of 2012.

Her talk described all the major Google databases, including one I had never heard of: Google Arts and Culture, which, she warned us, you should only open if you have plenty of time on your hands – it is wonderful for browsing.

As always, I learned things about Google I never knew. The main Google search engine has truncation limit of 32 words for any search: anything longer than this is simply ignored. For whatever reason, the limit on Google Scholar is 256 characters. Why the inconsistency? The main Google engine will search for synonyms without informing you, while Google Scholar does not. While Google does not use Boolean, it tends to look at strings of words as if they have OR between them – except for some strangely unexpected results, as you get when you key “banana”, “bananas”, and then “banana or bananas” in Google Scholar.

For academic searching, the key database is of course Google Scholar, and the question on everyone’s mind is why Google continues to provide this site, which has no advertising and which appears to have no revenue generation associated with it. It is the most widely used search engine in academic publishing and yet nobody pays for it.

What did I learn this time around? That Google uses your IP address when you search to determine your nationality, so if, for example, you go to Google Norway, you see the same results as you would from your home country. That means that searches for places will find local placenames first, based on your IP address: Cambridge, Massachusetts or Cambridge, England.

But while specific details are fascinating, the main take-away is that Google attempts to provide in one interface (and Google’s strategy is to have a single interface for desktop and for mobile searching) a general solution to all the queries brought by users. Clearly, that is an impossible goal, and all the more so because Google does not do what we think it does:

  • It does not index the whole of the Web, only much of the freely accessible part of it.
  • It might find millions of results, but it frequently only displays a few hundred or even fewer results.
  • The main Google search engine tends to prioritise more consumer-focused hits in its results. So the display above, the result for a search for the term “pizza”, finds pizza companies, restaurants, recommendations, before getting to the Wikipedia article for pizza.
  • The various Google tools are not equally comprehensive. Google finds different content in the different services, for example, Google Scholar has access to the full text of documents behind the subscription firewall, unlike the main Google.
  • Google may substitute a different title to the site title, if it believes there is a more relevant title elsewhere on the page.  

I haven’t even touched on collections such as Google Datasets Search, the latest Google database, formally launched in January 2020, or Google Trends.

Everyone who uses search as part of their work should attend this course.