Understanding Crawler and Page ranking
Posted On November 8, 2006 by Ramdas S filed under Internet
In this article, you will get to know the technology of crawling and page ranking. When you enter a string to search, a lot of things take place before you get the result for your query. The behind the scene tasks are performed by “Googlebot”, a spider of Google that espies the web, searches and crawls websites to fetch documents.
The property of a crawler is to request web search to return a specified web page. It then scans the returned web page for hyperlinks, which provide new documents that are fetched in the same way. “Googlebot” allots each refined page a number so that it can refer to these pages that it has fetched. “Googlebot” would now have produced enormous set of documents, but these documents are not searchable yet without an index.
For example, if you searched for the term “Jesus Prayers”, then Google server will have to read the text of every document every time you searched.
What does it do next? It builds an index; to do this, Google “Inverts” the crawl data. Instead of having to scan each word in every document, it scatters the data in order to list every document that contains a certain word.
For Example, the word “Jesus” might occur in document 3, 7, 14, 21, 55, 67 and 97, while the word “Prayers” might occur in documents numbered 2, 7, 10, 14, 21, 67 and 90. Once it has built an index, then it is ready to rank the documents and determine how relevant they are. Generally, Google does two things to achieve the abovementioned tasks:
1. Find the set of pages that contain the user’s query somewhere; and
2. Rank the matching pages in order of relevance.
Are you wondering how this is possible in a short time, i.e. in microseconds? Google has developed a process that speeds up the first step; instead of storing the entire index on one very powerful computer Google uses hundreds of computers to do the job through which the task is segregated among the machines and results can be found much faster.
In the figure 1, a user sends a query to Google server. Google receives and segregates the task among many machines. It searches the results from the Internet clouds by requesting various web servers for specific pages, which is done by Googlebot.
Let us think of a different scenario. Suppose a book’s index is 20 pages long. If one person had to search for several pieces of information in the index, it will take at least several seconds for each search. But if you gave each page to several persons, then it will make life easier.20 people could search their portion of index much more quickly than one person can search the entire index alone. This is something similar to what happens when Google segregate its data between many machines to find matching documents.
Now the question is how does it rank pages from users’ queries? Let us follow our “Jesus Prayers” example. The word “Jesus” was in document 3, 7, 14, 21, 55, 67 and 91 while “Prayers” was in documents 2, 7, 10, 14, 21, 67 and 90.
| Jesus
| 3
| 7
| 14
| 21
| 55
| 67
| 91
|
| Prayers
| 2
| 7
| 10
| 14
| 21
| 67
| 90
|
Fig. 2
By arranging the documents as shown in figure 2, it becomes clear that the words “Jesus” and “Prayers” have appeared together in four documents (7, 14, 21 and 67). Refer figure 3.
| Jesus
| 3
| 7
|
| 14
| 21
| 55
| 67
|
| 91
|
| Prayers
| 2
| 7
| 10
| 14
| 21
|
| 67
| 90
|
|
| Both Words
|
| 7
|
| 14
| 21
|
| 67
|
|
|
The list of documents that contain a word is called “Posting List” and looking for documents with both words is called “Intersecting a posting list”.
Note: The fast way to intersect two posting lists is to walk down both at the same time. If one list skips from 21 to 67, you can skip ahead to document 67 on the other list as well.
Google ranking result
Now we have pure pages containing both words from the user’s query and it is time to rank them in terms of relevance. Google algorithms use many factors in ranking. Of these, “Pages rank algorithm” is well known.
Note: On April 1 2002, Google implemented the page rank algorithm into his search Engine presenting a detailed explanation of “Pigeon Rank”.
It evaluates two things, how many links there are to a web page from other pages and the quality of the ranking sites. A page with ranking from five or six high-quality links from websites such as www.nytimes.com, www.rediff.com, etc. would be valued much more highly than twice as many links from less reputable or established sites.
However, Google uses many other factors besides page rank. For Example, if a document contains the words “Jesus” and “Prayers” right next to each other, it might be more relevant than document discussions on the “Lord Jesus Prayers” that happens to use the word “Jesus” somewhere else on the page. Also, if a page includes the words “Jesus Prayers” in its title, that’s a hint, it might be more relevant than a document with the title “Son of God”. In the same way, if the word “Jesus Prayers” appears several times throughout the page, that page is more likely to be about “Jesus Prayers” than if the words only appeared once.
Google tries to find pages that are both reputable and relevant. If two pages appear to provide roughly the same amount of information, Google generally picks pages that have been marked by trusted sites. However, Google also elevates a page with fewer links or lower page ranking if other signals suggest that the page is totally dedicated to the search query. For example, “Jesus Prayer” is often more useful than an article that mentions Jesus Prayer in passing, even if the article is part of a reputable site like about.com.
I hope you now understand how to make and optimize your web pages for search engines. Your every effort is important; No matter whether you are a good programmer, designer or content writer. Nowadays, search engines are taking lots of computing resources to show you relevant documents regarding your search term/s, may be over 500 computers working together to find out accurate and relevant document. And all this happens in under a few microseconds!
The author is a software engineer with eTechnoverb Pvt. Ltd. He is available at: sashikanta@yahoo.com.
