google the wizard

12
The Life and Lies of Google the Wizard -Lakshit Dabas -Makkunda Sharma -Shikhin Sethi -Shrey Gupta

Upload: idraumr

Post on 22-Oct-2015

20 views

Category:

Documents


1 download

DESCRIPTION

Google the Wizard

TRANSCRIPT

Page 1: Google the Wizard

The Life and Lies of Google the Wizard-Lakshit Dabas-Makkunda Sharma-Shikhin Sethi-Shrey Gupta

Page 2: Google the Wizard

The Life and Lies of Google the Wizard

S. No.

Title Page No.

1.IntroductionA brief into how search technology is relevant to current times, and how our project plans to impact it.

2

2.

A Little about GoogleA short history on how Google came into existence, and an insight on its current activities. 2

3.Mathematical PrinciplesAn explanation of the mathematical tools used to analyze the predicament presented by current search engines.

3

4.The AlgorithmInvestigation of how pages are sorted and ranked in the order of their importance.

4

5.ProblemsTo scrutinize the inability of search engines to present results in a context-aware manner.

8

6.

OptimizationsOur bit to the solution! Yayy!

8

7. ConclusionHow our research helps pave a path to a more context-aware search engine.

9

1

Page 3: Google the Wizard

The Life and Lies of Google the Wizard

Introduction

Imagine a library containing 25 billion documents but with no centralized organization and no librarians. In addition, anyone may add a document at any time without telling anyone. You may feel sure that one of the documents contained in the collection has a piece of information that is vitally important to you, and, being impatient like most of us, you'd like to find it in a matter of seconds. Of course, amidst that vital piece of information would lie several similar, yet totally unrelated content.

With the recent developments in search technology, increasing usage of existing engines, and the information overload our society is facing, context-awareness is more important than ever. Our project aims at analyzing Google’s ingenious algorithm, and how particular improvements can be applied for a more context-aware experience.

A Little about Google

"They are Google. They are Legion. They do not forgive. They do not forget. Expect them."

Google is a multinational corporation specializing in Internet-related services and products, including search, cloud computing, software, and online advertising technologies.

Google was founded by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University. Its mission statement from the outset was "to organize the world's information and make it universally accessible and useful", and its unofficial slogan has always been "Don't be evil".

The corporation has been estimated to run more than one million servers in data centers around the world, and to process over one billion search requests and about 24 petabytes of user-generated data each day.

2

Page 4: Google the Wizard

The Life and Lies of Google the Wizard

Mathematical Principles

Matrices

A matrix is a rectangular array of elements arranged in rows and columns, where each individual item in the matrix is known as its entry.

A square matrix can be used for representation of interlinks between pages.

Stochastic Matrices

If a matrix H exhibits the following properties,

All non-negative entries. Sum of all the entries in a column is unity.

then the matrix H can be called stochastic.

Eigenvectors

An eigenvector of a square matrix is a non-zero vector that, when the matrix is multiplied by , yields a constant multiple of . The multiplier is commonly denoted by , and is known as the eigenvalue of the eigenvector. The eigenvector is also known as the stationary vector of the matrix.

3

Figure 1. An m-by-n matrix.

Figure 2. After shear mapping the image of Mona Lisa, the blue vector retains its direction. Hence, for the shear mapping transformation, the blue vector is the eigenvector.

Figure 2. An example of a stochastic matrix.

Page 5: Google the Wizard

The Life and Lies of Google the Wizard

The Algorithm

To display the best possible results for all search terms, Google uses an amalgam of factors to sort the pages relevant to the keywords entered. Although the algorithm consists of two-hundred factors, it revolves around the single concept of PageRank to sort all relevant pages on the basis of their relative importance.

Google starts out by crawling through the web formed by multitude of web pages, using hyperlinks between pages to ‘traverse’ the web. It then indexes all these web pages, also forming a database of all the common words used in the page. Googlebot, Google’s automated web spider, counts links and gather other information on web pages.

As soon as the user enters the search term, Google queries its index to figure out which web page is related to the keywords entered. Once the ‘relevant’ list is created, several factors, that including PageRank, are used to arrange the pages in the order of their importance.

PageRank

PageRank is an algorithm used by the Google web search engine to rank websites in their search engine results. Named after Larry Page, one of the founders of Google, PageRank is a way of measuring the importance of website pages:

“PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.”

—Facts about Google and Competition

PageRank works on the basic principle that good content is appreciated. The model copies from that of how the scientific community rates the importance of theses. For a theses to be marked as credible, it must’ve have a number of citations from reliable sources. The two factors that hence impact the relative importance of any theses are:

The number of citations. The quality of each individual citation.

Furthermore, if a particular thesis cites dozens of other articles, the importance it imparts is relatively less, and vice versa.

In simple terms, a parallel between this model of citations and that of PageRank can be drawn.

While PageRank is not the only algorithm used by Google to order search engine results, it is the first algorithm that was used by the company, and it is the most well-known.

Calculating PageRank

4

Page 6: Google the Wizard

The Life and Lies of Google the Wizard

To determine the PageRank of any page, we look at the number of links coming in to the particular page, and the importance of the pages linking to it.

Let’s assign each web page P an importance I(P), called its (relative) PageRank. Suppose that page Pj has lj links. If one of those links is to page Pi, then Pj will pass on 1/lj of its importance to Pi. The importance ranking of Pi is then the sum of all the contributions made by pages linking to it. That is, if we denote the set of pages linking to Pi by Bi, then

That raises the classic chicken & egg issue – to calculate the importance of any page, you’ve to know the importance of pages linking to it, and so forth. To solve this issue, we will create a matrix describing the interlinking of pages, known as the hyperlink matrix. The hyperlink

matrix, , in which the entry in the ith row and jth column will be

Let’s take a small web of eight webpages, with the pages interlinking as followed:

We then proceed by generating the hyperlink matrix of this small web (henceforth called the springweb):

5

Page 7: Google the Wizard

The Life and Lies of Google the Wizard

The above matrix, H, can be easily shown to comply with the definition of being fantastically stochastic.

To begin with, let’s define a vector I with the elements being the PageRank (importance) of each element, I i. If we multiply the vector I with the matrix H, we obtain the following vector:

=As is noticeable, the vector multiplied by the matrix results in the vector itself; thus, the vector can be called the eigenvector of the matrix with eigenvalue as unity.

To calculate the eigenvector for any matrix, especially optimally, we exploit the property that,

and the series will converge to the stationary vector I.

Illustrating the same with our springweb,

I 0 I 1 I 2 I 3 I 4 ... I 60 I 611 0 0 0 0.0278 ... 0.06 0.060 0.5 0.25 0.1667 0.0833 ... 0.0675 0.06750 0.5 0 0 0 ... 0.03 0.030 0 0.5 0.25 0.1667 ... 0.0675 0.06750 0 0.25 0.1667 0.1111 ... 0.0975 0.09750 0 0 0.25 0.1806 ... 0.2025 0.2025

6

(0∗I 1 )+ (0∗I2 )+(0∗I 3 )+(0∗I 4 )+(0∗I 5 )+(0∗I 6 )+( 13∗I 7)+(0∗I 8 )

( 12∗I 1)+(0∗I 2 )+( 12∗I 3)+( 13∗I 4)+(0∗I 5 )+(0∗I 6 )+(0∗I 7 )+(0∗I 8 )

( 12∗I 1)+(0∗I 2 )+(0∗I3 )+(0∗I 4 )+(0∗I5 )+(0∗I 6 )+(0∗I 7 )+(0∗I 8 )

(0∗I 1 )+ (1∗I 2 )+(0∗I 3 )+(0∗I 4 )+(0∗I 5 )+(0∗I6 )+ (0∗I 7 )+(0∗I 8 )

(0∗I 1 )+ (0∗I2 )+( 12∗I3)+( 13∗I 4)+(0∗I 5 )+(0∗I 6 )+( 13∗I 7)+(0∗I 8 )

(0∗I 1 )+ (0∗I2 )+(0∗I 3 )+(13∗I 4)+( 13∗I5)+(0∗I 6 )+(0∗I 7 )+( 12∗I 8)

I 1

I 2

I 3

I 4

I 5

I 6

I 7

I 8

Page 8: Google the Wizard

The Life and Lies of Google the Wizard

0 0 0 0.0833 0.0972 ... 0.18 0.180 0 0 0.0833 0.3333 ... 0.295 0.295

The series converges to the relative page ranks of each page, and its observable that page 8 has the highest relative importance.

7

Page 9: Google the Wizard

The Life and Lies of Google the Wizard

Problems

At a long drawn glance, several faults can be found in the original approach adopted by Page and Brin. Some of these are:

Current search engines find pages relevant to the keywords entered only on the basis of the occurrence of the keywords in the article. The problem with the approach is that ‘spammers’ can pack a web page with the keyword at random intervals, and achieve a better ranking in the search results.

Since a better result can simply be obtained by increased frequency of the entered keywords, the quality of web pages trying to achieve a better ranking can take a hit.

Optimization

Since PageRank places a correct emphasis on the pages ‘citing’ any particular target page to determine its importance, more factors dependent on the content of every page linking to the target page can be used. One of these factors is the correlation between frequency of occurrence of active words – words that add content – on the pages linking, and the target page.

We hope to model the Internet with a small collection of web pages, and develop a tool to calculate the PageRank of each page. We further hope to statistically analyze frequency of word occurrence on the pages, and derive a method of sorting the pages on basis of the content.

8

Page 10: Google the Wizard

The Life and Lies of Google the Wizard

Conclusion

Since the dawn of civilization, no human has held more power for creation in his hands, than with the advent of the Internet age. To avoid probable information overload, organizing this bucket of information should be one of the major priorities of the human race. A study of Google is but a primer into the vast field of search technology.

9