Saturday, March 12, 2005

How Google works?

http://www.internetnews.com/xSP/article.php/3487041

March 2, 2005 Peeking Into Google By Susan Kuchinskas

BURLINGAME, Calif. -- The key to the speed and reliability of Google (Quote, Chart) search is cutting up data into chunks, its top engineer said.

Urs Hoelzle, Google vice president of operations and vice president of engineering, offered a rare behind-the-scenes tour of Google's architecture on Wednesday. Hoelzle spoke here at EclipseCon 2005, a conference on the open source, extensible platform for software tools.
To deal with the more than 10 billion Web pages and tens of terabytes of information on Google's servers, the company combines cheap machines with plenty of redundancy, Hoelzle said. Its commodity servers cost around $1,000 apiece, and Google's architecture places them into interconnected nodes.

All machines run on a stripped-down Linux kernel. The distribution is Red Hat (Quote, Chart), but Hoelzle said Google doesn't use much of the distro. Moreover, Google has created its own patches for things that haven't been fixed in the original kernel.

"The downside to cheap machines is, you have to make them work together reliably," Hoelzle said. "These things are cheap and easy to put together. The problem is, these things break."
In fact, at Google, many will fail every day. So, Google has automated methods of dealing with machine failures, allowing it to build a fast, highly reliable service with cheap hardware.

Google replicates the Web pages it caches by splitting them up into pieces it calls "shards." The shards are small enough that several can fit on one machine. And they're replicated on several machines, so that if one breaks, another can serve up the information. The master index is also split up among several servers, and that set also is replicated several times. The engineers call these "chunk servers."

As a search query comes into the system, it hits a Web server, then is split into chunks of service. One set of index servers contains the index; one set of machines contains one full index. To actually answer a query, Google has to use one complete set of servers. Since that set is replicated as a fail-safe, it also increases throughput, because if one set is busy, a new query can be routed to the next set, which drives down search time per box.

In parallel, clusters of document servers contain copies of Web pages that Google has cached. Hoelzle said that the refresh rate is from one to seven days, with an average of two days. That's mostly dependent on the needs of the Web publishers.

"One surprising limitation is we can't crawl as fast as we would like, because [smaller] webmasters complain," he said.

Each set of document servers contains one copy of the Web. These machines are responsible for delivering the content snippets that show searchers relevant text from the page.

"When we have your top 10 results, they get sent to the document servers, which load the 10 result pages into memory," Hoelzle said. "Then you parse through them and find the best snippet that contains all the query words."

Google uses three software systems built in-house to route queries, balance server loads and make programming easier.

The Google File System was written specifically to deal with the cheap machines that will fail.
"We take our files and chunk them up, then you randomly distribute the chunks across different machines, making sure each chunk has at least two copies that are not physically adjacent -- not on same power strip or same switch," Hoelzle said. "We try to make sure that even if one copy goes away, another copy is still here." Chunks typically are 64 megabytes and are replicated three times.

All this replication makes it easier to make changes, Hoelzle said. Google simply takes one replica at a time offline, updates it, then plugs the machines back in.

Because these chunks are randomly distributed all over, Google needs a master containing metadata to keep track of where the chunks are. When a query comes into the system, the file system master tells it which chunk server has the data. "From there on, you just talk to the chunk servers," he said.

Client machines are responsible for dealing with fault tolerance. If a client requests a file from the specified chunk server and gets no response within the designated time period, it uses the meta information to locate another chunk server, while sending the file master a hint that the first chunk server might have died. If the master confirms the chunk went out, it will replicate the chunks that were on it to another server, making sure that the information is replicated at least the minimum number of times.

"You were vulnerable for only a very brief period," he said.

To enable Google programmers to write applications to run in parallel on 1,000 machines, engineers created the Map/Reduce Framework in 2004.

"The Map/Reduce Framework provides automatic and efficient parallelization and distribution," Hoelzle said. "It's fault tolerant and it does the I/O scheduling, being a little bit smart about where the data lives."

Programmers write two simple functions, map and reduce, to create a long list of key/value pairs. Then, the mapping function produces other key/value pairs. "You just map one pair to another pair," he said.

For example, if an application is needed to count URLs on one host, the programmer would take the URL and the contents and map them into the pair consisting of hostname and 1. "This produces an intermediate set of key/value pairs with different values."
Next, a reduction operation takes all the outputs that have the same key and combines them to produce a single output.

"Map/Reduce is simplified large-scale data processing," Hoelzle said, "a very simple abstraction that makes it possible to write programs that run over these terabytes of data with little effort."
The third homegrown application is Google's Global Work Queue, which is for scheduling.
Global Work Queue works like old-time batch processing. It schedules queries into batch jobs and places them on pools of machines. The setup is optimized for running random computations over tons of data.

"Mostly, you want to split the huge task into lots of small chunks, which provides even load balancing across machines," Hoelzle said. The idea is to have more tasks than machines so machines are never idle.

Hoelzle also demonstrated how Google uses its massive architecture to learn from data. It analyzes the most common misspellings of queries, and uses that information to power the function that suggests alternate spellings for queries.

The company also is applying machine learning to its system to give better results. Theoretically, he said, if someone searches for "Bay Area cooking class," the system should know that "Berkeley courses: vegetarian cuisine" is a good match even though it contains none of the query words.

To do this, the system tries to cluster concepts into "reasonably coherent" subclusters that seem related. These clusters, some tiny and some huge, are named automatically. Then, when a query comes in, the system produces a probability score for the various clusters. This kind of machine learning has had little success in academic trials, Hoelzle said, because they didn't have enough data. "If you have enough data, you get reasonably good answers out of it."

In addition to improving query results, Google uses this learning to better deliver contextual ads for its AdSense service to Web publishers, as well as to more accurately cluster news stories within Google News.

Google's redundancy theory works on a meta level, as well, according to Hoelzle. One literal meltdown -- a fire at a datacenter in an undisclosed location -- brought out six fire trucks but didn't crash the system.

"You don't have just one data center," he said, "you have multiples."
-----------------------------------------------------------------------------------------------
Data taken from How Stuff Works:
http://www.howstuffworks.com/search-engine.htm

Google.com began as an academic search engine. In the paper that describes how the system was built, Sergey Brin and Lawrence Page give an example of how quickly their spiders can work. They built their initial system to use multiple spiders, usually three at one time. Each spider could keep about 300 connections to Web pages open at a time. At its peak performance, using four spiders, their system could crawl over 100 pages per second, generating around 600 kilobytes of data each second.

Keeping everything running quickly meant building a system to feed necessary information to the spiders. The early Google system had a server dedicated to providing URLs to the spiders. Rather than depending on an Internet service provider for the domain name server (DNS) that translates a server's name into an address, Google had its own DNS, in order to keep delays to a minimum.

When the Google spider looked at an HTML page, it took note of two things:
The words within the page
Where the words were found

Words occurring in the title, subtitles, meta tags and other positions of relative importance were noted for special consideration during a subsequent user search. The Google spider was built to index every significant word on a page, leaving out the articles "a," "an" and "the." Other spiders take different approaches.

These different approaches usually attempt to make the spider operate faster, allow users to search more efficiently, or both. For example, some spiders will keep track of the words in the title, sub-headings and links, along with the 100 most frequently used words on the page and each word in the first 20 lines of text. Lycos is said to use this approach to spidering the Web.
Other systems, such as AltaVista, go in the other direction, indexing every single word on a page, including "a," "an," "the" and other "insignificant" words. The push to completeness in this approach is matched by other systems in the attention given to the unseen portion of the Web page, the meta tags.

Meta Tags
Meta tags allow the owner of a page to specify key words and concepts under which the page will be indexed. This can be helpful, especially in cases in which the words on the page might have double or triple meanings -- the meta tags can guide the search engine in choosing which of the several possible meanings for these words is correct. There is, however, a danger in over-reliance on meta tags, because a careless or unscrupulous page owner might add meta tags that fit very popular topics but have nothing to do with the actual contents of the page. To protect against this, spiders will correlate meta tags with page content, rejecting the meta tags that don't match the words on the page.

All of this assumes that the owner of a page actually wants it to be included in the results of a search engine's activities. Many times, the page's owner doesn't want it showing up on a major search engine, or doesn't want the activity of a spider accessing the page. Consider, for example, a game that builds new, active pages each time sections of the page are displayed or new links are followed. If a Web spider accesses one of these pages, and begins following all of the links for new pages, the game could mistake the activity for a high-speed human player and spin out of control. To avoid situations like this, the robot exclusion protocol was developed. This protocol, implemented in the meta-tag section at the beginning of a Web page, tells a spider to leave the page alone -- to neither index the words on the page nor try to follow its links.

Building the Index
Once the spiders have completed the task of finding information on Web pages (and we should note that this is a task that is never actually completed -- the constantly changing nature of the Web means that the spiders are always crawling), the search engine must store the information in a way that makes it useful. There are two key components involved in making the gathered data accessible to users:

The information stored with the data
The method by which the information is indexed

In the simplest case, a search engine could just store the word and the URL where it was found. In reality, this would make for an engine of limited use, since there would be no way of telling whether the word was used in an important or a trivial way on the page, whether the word was used once or many times or whether the page contained links to other pages containing the word. In other words, there would be no way of building the ranking list that tries to present the most useful pages at the top of the list of search results.

To make for more useful results, most search engines store more than just the word and URL. An engine might store the number of times that the word appears on a page. The engine might assign a weight to each entry, with increasing values assigned to words as they appear near the top of the document, in sub-headings, in links, in the meta tags or in the title of the page. Each commercial search engine has a different formula for assigning weight to the words in its index. This is one of the reasons that a search for the same word on different search engines will produce different lists, with the pages presented in different orders.

Regardless of the precise combination of additional pieces of information stored by a search engine, the data will be encoded to save storage space. For example, the original Google paper describes using 2 bytes, of 8 bits each, to store information on weighting -- whether the word was capitalized, its font size, position, and other information to help in ranking the hit. Each factor might take up 2 or 3 bits within the 2-byte grouping (8 bits = 1 byte). As a result, a great deal of information can be stored in a very compact form. After the information is compacted, it's ready for indexing.

An index has a single purpose: It allows information to be found as quickly as possible. There are quite a few ways for an index to be built, but one of the most effective ways is to build a hash table. In hashing, a formula is applied to attach a numerical value to each word. The formula is designed to evenly distribute the entries across a predetermined number of divisions. This numerical distribution is different from the distribution of words across the alphabet, and that is the key to a hash table's effectiveness.

In English, there are some letters that begin many words, while others begin fewer. You'll find, for example, that the "M" section of the dictionary is much thicker than the "X" section. This inequity means that finding a word beginning with a very "popular" letter could take much longer than finding a word that begins with a less popular one. Hashing evens out the difference, and reduces the average time it takes to find an entry. It also separates the index from the actual entry. The hash table contains the hashed number along with a pointer to the actual data, which can be sorted in whichever way allows it to be stored most efficiently. The combination of efficient indexing and effective storage makes it possible to get results quickly, even when the user creates a complicated search.

Building a Search
Searching through an index involves a user building a query and submitting it through the search engine. The query can be quite simple, a single word at minimum. Building a more complex query requires the use of Boolean operators that allow you to refine and extend the terms of the search.

The Boolean operators most often seen are:

AND - All the terms joined by "AND" must appear in the pages or documents. Some search engines substitute the operator "+" for the word AND.
OR - At least one of the terms joined by "OR" must appear in the pages or documents.
NOT - The term or terms following "NOT" must not appear in the pages or documents. Some search engines substitute the operator "-" for the word NOT.
FOLLOWED BY - One of the terms must be directly followed by the other.
NEAR - One of the terms must be within a specified number of words of the other.
Quotation Marks - The words between the quotation marks are treated as a phrase, and that phrase must be found within the document or file.

Future Search
The searches defined by Boolean operators are literal searches -- the engine looks for the words or phrases exactly as they are entered. This can be a problem when the entered words have multiple meanings. "Bed," for example, can be a place to sleep, a place where flowers are planted, the storage space of a truck or a place where fish lay their eggs. If you're interested in only one of these meanings, you might not want to see pages featuring all of the others. You can build a literal search that tries to eliminate unwanted meanings, but it's nice if the search engine itself can help out.

One of the areas of search engine research is concept-based searching. Some of this research involves using statistical analysis on pages containing the words or phrases you search for, in order to find other pages you might be interested in. Obviously, the information stored about each page is greater for a concept-based search engine, and far more processing is required for each search. Still, many groups are working to improve both results and performance of this type of search engine. Others have moved on to another area of research, called natural-language queries.

The idea behind natural-language queries is that you can type a question in the same way you would ask it to a human sitting beside you -- no need to keep track of Boolean operators or complex query structures. The most popular natural language query site today is AskJeeves.com, which parses the query for keywords that it then applies to the index of sites it has built. It only works with simple queries; but competition is heavy to develop a natural-language query engine that can accept a query of great complexity.

For more information on search engines and related topics, check out the links on the next page.

Comments: Post a Comment

Subscribe to Post Comments [Atom]





<< Home

This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]