SOUE News Issue 10

The 37th Maurice Lubbock Memorial Lecture, 7 June 2011: Google - 1 Billion Searches a Day and Counting

Lecture by Professor Thomas Hofmann

The speaker was currently Director of Engineering, Commerce and Research at Google Switzerland in Zurich, but until five years ago he had been an academic researcher into machine learning, and then an entrepreneur founding his own company. He saw this lecture as a "talk about Google", and thought there were seven aspects to this:

  1. Google as in "to Google"
  2. as in "Google-scale"
  3. as a mindset and engineering principle
  4. as a paradise for computer scientists
  5. as an economic force
  6. as a research power house
  7. for the Fine(r) Arts

The lecture covered all of these, but with most emphasis on the first two.

"To Google"

Google's mission is to organise the world's information and make it universally accessible and useful. Currently there are some two billion Internet users, 30% of the world's population. They are of course unevenly spread, with the highest penetration in North America, Europe and Australia, but the rest of the world is catching up fast. In fact c. 44% of all users are in Asia.

In the past, an information search would often need the assistance of a skilled librarian, but this is rapidly being replaced by the Web, and if you ask people how they have found websites they have visited, the commonest answer is "via a search engine".

There are currently around a trillion URLs (web addresses); Google handles about a billion searches a day; but spammers generate around a million pages per hour, which a search engine has to be able to reject. This task is not trivial. A key to solving it is the "Page Rank Algorithm", which exploits the linked structure of the Web to assess the authority of each page, based on how long people look at any item. Also, although Google accepts advertisements to generate an income, it never accepts payment to bias its selection or prioritising of pages in any search.

How it is done

It is not possible to do a search in real time - it has already been done off-line. For each search term, a file index has already been made, an ordered list of all the documents in which the term occurs. This index is stored over numerous computers, with several identical replicas of it, to guard against faults. The documents themselves, to which the index refers, are stored in a separate set of computers, again with replicas. The first response to any search request is to see if the same thing has been asked for recently. If so, use the results again. With 1000 computers doing the work in parallel, a typical search time is 200 ms.

Guessing what the searcher is about to type

When the searcher has typed the first few letters of his search term, the system starts to guess, and offer, various ways in which he might continue. For example, typing "Oxf.." could lead to offerings of Oxford, Oxfam, Oxford dictionary, Oxford Brookes, Oxford Mail. This takes around 20-30 ms. The searcher can then opt for one of these, or continue typing. This facility speeds up the process, especially for someone typing on a mobile phone, but it has to be designed to filter out guesses that might be offensive to some users.

Broadening the sources of information

Initially searchers could be directed only to websites. More recently a "real-time search" has been incorporated enabling one to find current news items. For example, by typing "Federer" on 5 June, one could get the current score in the Wimbledon final. Ideally it should be possible to implement a "universal search" of images, books, news and video, as well as the Web, but this will require an enormous expansion of storage capacity.

Web search as an arms race

The activities of spammers, often very clever people if with rather dubitable economic motives, could, if successful, render any search engine ineffective. An example quoted was the money that could be made by influencing Stock Exchange prices, even for only a few hours. Google is perpetually improving its anti-spam algorithms, but these are naturally secret.

Experiment-driven engineering

Google employees are encouraged to develop their own ideas, trying them initially on particular cases in what Hofmann called a "sandbox". The changes in system performance that they generated were evaluated by an external panel, and if they showed promise, a small sample of the real traffic was exposed to them. Depending on how this worked, the idea might be adopted for the whole system, or "sent back to the drawing board". In 2010, of about 23,000 ideas, 516 were finally adopted!

Google-scale computing

As already stated, Google handles about one billion searches a day, or approximately 10,000/s, each dealt with in about 200 ms. The computing power needed to handle this is enormous. The amount of data stored is around 70 PB (70x1015 bytes), in numerous data centres around the world, and about 10 million operations are carried out per second.

The hardware is made up basically of PC-class motherboards, each with its own CPU, DRAM and disk. 40-80 of these server boards are stacked in a "rack", and 30 racks form a "cluster". A cluster thus has 30 TB of DRAM and 48 PB of disk storage, both of which can be accessed at 10 MB/s. Failures obviously occur in the hardware, but everything is stored in multiple places, so software can cope with any likely failure. Storage spread around the world helps to compensate for the finite speed of light, which as the lecturer said "is not fast enough for Google!"

All this computing clearly consumes lots of energy, around half of which goes on cooling, but this is being improved by the use of evaporative cooling, on the same lines as power-station cooling towers. The energy used per search is modest, around one joule, about that needed to propel a car for two metres.

Working at Google Zurich - a paradise for computer scientists

There are plenty of relaxation opportunities. And 20% of one's time can be spent on ideas of one's own, to which end there are white-boards everywhere! Groups of around ten people are empowered to make their own suggestions.

The economics of it

Google gets its income from the advertisements which appear on the top and right-hand side of the search page. Other adverts can be shown alongside any web page displayed to the searcher. The world-wide income is currently of the order of $20 billion per annum. Part of the computation task is to work out which adverts are most appropriate for any particular search query, and advertisers are allowed to bid in an auction to have their adverts associated with particular key-words. It has been estimated that the economic activity generated by Google in the United States in 2009 amounted to $54 billion.

Other activities

Google's "Product Search" can look for where any product can be bought most cheaply, and an extension of this now being developed will allow one to inspect the stock inventory of any particular store, to see if they have what is wanted.

They have built a very large machine-translation system, with 60 languages (and growing) and a much better quality standard of translation.

And finally, their Art Project is bringing some of the world's pictorial art on-line. Pictures are being scanned from 17 of the world's principal museums, some of them to unusually fine detail, several giga-pixels. The lecturer showed as an example The Ambassadors painted by Hans Holbein the Younger in 1533, now in the National Gallery.

Down on a shelf and rather unnoticeable in the full picture is a world globe. By successively zooming in on this it eventually showed a (rather inexact) map of England and Scotland, with the cracks in the paintwork clearly visible.

The lecture was followed by a question-and-answer session. The lecture, and the lecturers slides, may be viewed on the Department's website, www.eng.ox.ac.uk.


<<   Previous article Contents Next article   >>