mercredi 7 mars 2007

Google: introduction


The World Wide Web is a vast source of information which grows bigger at exponential speed. Indeed more than 7 million pages are added to the Web each day (study of Cyveillance in Washington). The search engines must thus continuously adapt in order to follow this unrestrained rhythm and not to take delay in the indexing of all these new pages. They must also obtain effective criteria in order to select in this vastness of the best data and to present the search results in a relevant order.

If Google won and preceded the stars of yesterday (Altavista and Yahoo!), it is because it made a success of these 2 challenges:

* indexing the most documents possible, quickly
* presenting sorted results, most relevant in first.

Will it manage to hold this incredible rhythm? Will you manage to get your website in the first rankings? The purpose of this site is to help you to do it!

Presentation of Google

Should it still be presented? With nearly 50% of the traffic generated by the whole of the search engines and directories (in France), Google must not be neglected. From a PhD research subject for two American academics (Larry Page and Sergey Brin), Google became a company on the international scene.

The success of this engine comes on the one hand from the algorithm worked out by the 2 founders, and on the other hand of the application of an elementary principle: the simplest things are sometimes most effective. In this case, Google chose a very stripped interface, without advertisement, by concentrating its services on the search for Web pages and nothing else. The engine also enjoys a very great speed in the interrogation of its data base.

In addition to results of research considered to be relevant by many users, Google succeeded to index a very great number of pages: its "index" is from now on one of the largest in the world (if it is not the first), with approximately 2 billion pages. Recently, new types of documents were indexed, in addition to the traditional HTML: Word, Excel, Acrobat, PowerPoint, WordPad, etc.

The algorithm is based on two systems:

How Google Indexes the Web

Google set up a crawler-type software, named Googlebot. It is a robot indexing Web pages (and now other types). Its principle is simple (but not its implementation!): when it reads a page, it adds to its list of pages to visit all those linked to the page in the current process.

Theoretically, it should thus be able to know the majority of the pages of the Web, i.e. all those which are not orphan (a page is known as orphan if no other links to it). The volume of data to be treated being important, this robot is a program distributed on hundreds of servers.

In addition to the knowledge of the greatest number of pages, Google also wants to index them regularly, because many the pages are updated from time to time. Moreover the frequency of visit of Googlebot on a Web page depends on its PageRank : the larger it is, the more it will often index it. From one passage to another, Googlebot can detect a page become non-existent ("error 404").

This colossal mass of information will be analyzed by Google in full details. Each word or sentence will be associated to a type, based on HTML tags. Thus a word contained in the title will be considered to be more significant than in the body text. These types may be classified according to their importance (title of the page , headings H1 to H6, bold, italic, etc). This preprocessing, associated with other criteria including the PageRank, makes it possible to provide the most relevant results in first.

The Google Toolbar


To make it possible to the Net surfer to make a search on the Web even more easily and quickly, but of course also to increase the number of its users, Google developed a small program called Google Toolbar , which adds a toolbar to Internet Explorer. You can type in your keywords directly inside and obtain the results in the navigator as if you had gone on the homepage of Google to make your search.

In addition to a fast access to search, this bar also indicates information on the current page: its PageRank and the related category of the Google Directory.

The PageRank is schematized by a horizontal green bar, more or less wide. By pointing the mouse above, you obtain a number between 0 and 10. This number is only one rough approximation of the actual value of PageRank, which is a number much more precise (either a big integer number, or a number with decimals).