controlling crawler for better indexation and ranking

16
Controlling Search Engine Crawlers For Better Indexation and Rankings - Inspired from MOZ

Upload: rajesh-magar

Post on 15-Aug-2015

348 views

Category:

Marketing


2 download

TRANSCRIPT

Controlling Search Engine Crawlers

For Better Indexation and Rankings -

Inspired from MOZ

Controlling Crawling and Indexing using 1. Robots.txt

2. Meta Robot tags• NoFollow

Tag3. X-Robots-Tag HTTP header

Some sample robots.txt files

1. Allow crawling of all content

User-agent: * Disallow: orUser-agent: * Allow: /

Disallow crawling of the whole website

User-agent: * Disallow: / Disallow crawling of certain parts of the website

User-agent: *Disallow: /calendar/ Disallow: /junk/

Allowing access to a single crawler

User-agent: Googlebot-newsDisallow:User-agent: *Disallow: / Allowing access to all but a single crawler

User-agent: UnnecessarybotDisallow: /User-agent: *Disallow:

2. Using the robots meta tagPage-specific approach to controlling how an individual

page should be indexed and served to users in search results.

<meta name="robots" content ="all"><meta name="robots" content="noindex, nofollow"><meta name="robots" content="index, nofollow"><meta name="robots" content="noindex, follow">

<meta name="GOOGLEBOT" CONTENT="all"><meta name="GOOGLEBOT" content="noindex, nofollow"><meta name="GOOGLEBOT" content="index, nofollow"><meta name="GOOGLEBOT" content="noindex, follow">

Several other directives can be used to control indexing and serving with the robots meta tag

3. X-Robots-Tag HTTP headerHTTP/1.1 200 OK

Date: Tue, 25 May 2010 21:42:43 GMT(…)X-Robots-Tag: noindex(…)

_________________________________HTTP/1.1 200 OKDate: Tue, 25 May 2010 21:42:43 GMT(…)X-Robots-Tag: noarchiveX-Robots-Tag: unavailable_after: 25 Jun 2010 15:00:00 PST(…)_________________________________

HTTP/1.1 200 OKDate: Tue, 25 May 2010 21:42:43 GMT(…)X-Robots-Tag: googlebot: nofollowX-Robots-Tag: otherbot: noindex, nofollow(…)

<Files ~ "\.(png|jpe?g|gif)$"> Header set X-Robots-Tag "noindex"</Files>

<Files ~ "\.pdf$"> Header set X-Robots-Tag "noindex, nofollow"</Files>

Practical implementation of X-Robots-Tag with Apache

Best Practices1. Content that isn't ready yet

A. Large Quantity? : Robots.txtB. Small Quantity? : Use Meta Tag

2. Dealing with duplicate or thin contentA. Probably use : rel=canonnicalB. Disallow if crawl budget is an issue using Robots.txt file

3. Passing link equity without appearing in search results

A. Meta Robots : NoIndex, FollowB. Do not Disallow in Robots.txt

4. Search results-type pagesA. Make the most common/popular one into category-style or Landing page with unique valueB. Disallow in Robots.txt (If you sure about the post-action)

Which brought us to the Information overload of

Source: http://www.google.com/insidesearch/howsearchworks/thestory/index.html

THATS ALL…

But wait….

What if I say it’s just the 5% of entire WORLD WIDE WEB.

YES INDEED….

"The internet is much, much bigger than people think,"

Still, 95% of the web is completely invisible for your be loving

Google, Yahoo, Bing. In-fact any search engine is available on the planet.

Sounds hard to believe…But that’s

what….what it is!

Source: http://www.cbsnews.com/news/new-search-engine-exposes-the-dark-web/

Friendly References: 1. https://developers.google.com/webmasters/control-crawl-ind

ex/2. https://developers.google.com/webmasters/control-crawl-ind

ex/docs/robots_txt3. https://support.google.com/webmasters/answer/10619434. https://moz.com/blog/controlling-search-engine-crawlers-for-

better-indexation-and-rankings-whiteboard-friday

And most important

5.http://www.cbsnews.com/news/new-search-engine-exposes-the-dark-web/Big Thanks : https://moz.com/blog/controlling-search-engine-crawlers-for-better-indexation-and-rankings-whiteboard-friday