how google search algorithm works ??

23
How Google search engine algorithm works Prepared by:- Viral Shah (120570107014) Guided by :- Prof. Sahista Machhar, MEFGI

Upload: viralshahb

Post on 20-Aug-2015

220 views

Category:

Education


1 download

TRANSCRIPT

How Google search engine algorithm works

Prepared by:- Viral Shah (120570107014)Guided by :- Prof. Sahista Machhar, MEFGI

It is a program that searches for and identifies items in a database that correspond to keywords or characters specified by the user, used especially for finding particular sites on the World Wide Web.

What is SEARCH ENGINE ??

There are 759 Million websites on the Web & 60 Trillion webpages of this websites.

AND IT’S CONSTANTLY GROWING !!!!!

GOOGLE navigates WEB by crawling.

To find information on the hundreds of millions of Web pages that exist, a search engine employs special software robots, called SPIDERS, to build lists of the words found on Web sites. When a spider is building its lists, the process is called Web crawling.

CRAWLING

The usual starting points are lists of heavily used servers and very popular pages. The spider will begin with a popular site, indexing the words on its pages and following every link found within the site. In this way, the spidering system quickly begins to travel, spreading out across the most widely used portions of the Web.

How does any spider start its journey over the Web ???

When the Google spider looked at an HTML page, it took note of following things:-

Words occurring in the title, subtitles, meta tags and other positions of relative importance were noted for special consideration during a subsequent user search. The Google spider was built to index every significant word on a page, leaving out the articles “a”, “an” and "the”. Other spiders take different approaches.

For example, some spiders will keep track of the words in the title, sub-headings and links, along with the 100 most frequently used words on the page and each word in the first 20 lines of text. Lycos is said to use this approach to spidering the Web.

GOOGLE built their initial system to use multiple spiders, usually three at one time. Each spider could keep about 300 connections to Web pages open at a time.

Google’s spider name is Googlebot.

Googlebot is the search bot software used by Google, which collects documents from the web to build a searchable index for the Google Search engine.

Continue…

By following the web-pages, INDEX is prepared. The index includes text from millions of books from several libraries and other partners.

That means GOOGLE follow links from page to page. Also they sort pages by their content and other factors.

INDEXING

These all activities Google carry out is tracked in the INDEX. Google continuously updates index and it is stored over large servers.

Currently, Google’s Index size is over 100 million Gigabyte.

Continue…

Site owners choose whether their sites are crawled.

To prevent most search engine web crawlers from indexing a page on your site, place the following meta tag into the<head> section of your page:

<meta name="robots" content="noindex">

To prevent only Google web crawlers from indexing a page:

<meta name="googlebot" content="noindex">

1) AUTOCOMPLETE Predicts what you might be searching for. This includes understanding terms with more than one meaning.

2) SYNONYMS Recognizes words with similar meanings.

ALGORITHMS

3) QUERY UNDERSTANDING Gets to the deeper meaning of the words you type.

4) GOOGLE INSTANT Displays immediate results as you type.

5) SPELLING Identifies and corrects possible spelling errors and provides alternatives.

Continue…

Based on all the above factors, Google picks some web-pages from the index.

Then, Google ranks the result on various factors.

1) Site & Page Quality:- It is checked by how you are writing key-words.

2) Freshness:- How much fresh the content is & at how much regular interval it is updated !!

3) Safe-Search:- Google tries to find out how much it is safe and doesn’t contains spams.

Along with these, there are 200+ factors used by Google to rank any particular webs-page.

Continue…

After all these operations, you will get the desired result and these all happens in one nano-second !!!

Google fights with spam every second to give true & relevant result.

The majority of spam removal is automatic. Google examine other questionable documents by hand. If Google find spam, they take manual action.

SPAM…

1) PURE SPAMSite appears to use aggressive spam techniques such as automatically generated gibberish, cloaking, scraping content from other websites, and/or repeated or egregious violations of Google's Webmaster Guidelines.

2) HIDDEN TEXT AND/OR KEYWORD STUFFINGSome of the pages may contain hiddentext and/or keyword stuffing.

Types of SPAM

3) USER-GENERATED SPAMSite appears to contain spammy user-generated content. The problematic content may appear on forum pages, guestbook pages, or user profiles.

4) PARKED DOMAINSParked domains are placeholder sites with little unique content, so Google doesn't typically include them in search results.

5) THIN CONTENT WITH LITTLE OR NO ADDED VALUESite appears to consist of low-quality or shallow pages which do not provide users with much added value (such as thin affiliate pages, doorway pages, cookie-cutter sites, automatically generated content, or copied content).

6) UNNATURAL LINKS TO A SITEGoogle has detected a pattern of unnatural artificial, deceptive or manipulative links pointing to the site. These may be the result of buying links that pass PageRank or participating in link schemes.

Besides these all there are thousands other factors Google uses to detect Spam and decides the page-rank of web-page accordingly which is constantly updated and finally Google only keeps trusted documents in index.

And the point of Interest is that to make presentation on google, I used

Behind your simple page of results is a complex system, carefully crafted andtested, to support more than one-hundred billion searches each month !!!!