controlling crawler for better indexation and ranking

Controlling Search Engine Crawlers

For Better Indexation and Rankings -

Inspired from MOZ

https://moz.com/blog/controlling-search-engine-crawlers-for-better-indexation-and-rankings-whiteboard-friday

Controlling Crawling and Indexing using 1. Robots.txt

2. Meta Robot tags• NoFollow

Tag3. X-Robots-Tag HTTP header

Some sample robots.txt files

1. Allow crawling of all content

User-agent: * Disallow: orUser-agent: * Allow: /

Disallow crawling of the whole website

User-agent: * Disallow: / Disallow crawling of certain parts of the website

User-agent: *Disallow: /calendar/ Disallow: /junk/

Allowing access to a single crawler

User-agent: Googlebot-newsDisallow:User-agent: *Disallow: / Allowing access to all but a single crawler

User-agent: UnnecessarybotDisallow: /User-agent: *Disallow:

2. Using the robots meta tagPage-specific approach to controlling how an individual

page should be indexed and served to users in search results.

<meta name="robots" content ="all"><meta name="robots" content="noindex, nofollow"><meta name="robots" content="index, nofollow"><meta name="robots" content="noindex, follow">

<meta name="GOOGLEBOT" CONTENT="all"><meta name="GOOGLEBOT" content="noindex, nofollow"><meta name="GOOGLEBOT" content="index, nofollow"><meta name="GOOGLEBOT" content="noindex, follow">

Several other directives can be used to control indexing and serving with the robots meta tag

3. X-Robots-Tag HTTP headerHTTP/1.1 200 OK

Date: Tue, 25 May 2010 21:42:43 GMT(…)X-Robots-Tag: noindex(…)

_________________________________HTTP/1.1 200 OKDate: Tue, 25 May 2010 21:42:43 GMT(…)X-Robots-Tag: noarchiveX-Robots-Tag: unavailable_after: 25 Jun 2010 15:00:00 PST(…)_________________________________

HTTP/1.1 200 OKDate: Tue, 25 May 2010 21:42:43 GMT(…)X-Robots-Tag: googlebot: nofollowX-Robots-Tag: otherbot: noindex, nofollow(…)

<Files ~ "\.(png|jpe?g|gif)$"> Header set X-Robots-Tag "noindex"</Files>

<Files ~ "\.pdf$"> Header set X-Robots-Tag "noindex, nofollow"</Files>

Practical implementation of X-Robots-Tag with Apache

Best Practices1. Content that isn't ready yet

A. Large Quantity? : Robots.txtB. Small Quantity? : Use Meta Tag

2. Dealing with duplicate or thin contentA. Probably use : rel=canonnicalB. Disallow if crawl budget is an issue using Robots.txt file

3. Passing link equity without appearing in search results

A. Meta Robots : NoIndex, FollowB. Do not Disallow in Robots.txt

4. Search results-type pagesA. Make the most common/popular one into category-style or Landing page with unique valueB. Disallow in Robots.txt (If you sure about the post-action)

Which brought us to the Information overload of

Source: http://www.google.com/insidesearch/howsearchworks/thestory/index.html

http://www.google.com/insidesearch/howsearchworks/thestory/index.html

http://www.google.com/insidesearch/howsearchworks/thestory/index.html

THATS ALL…

But wait….

What if I say it’s just the 5% of entire WORLD WIDE WEB.

YES INDEED….

"The internet is much, much bigger than people think,"

Still, 95% of the web is completely invisible for your be loving

Google, Yahoo, Bing. In-fact any search engine is available on the planet.

Sounds hard to believe…But that’s

what….what it is!

Source: http://www.cbsnews.com/news/new-search-engine-exposes-the-dark-web/

http://www.cbsnews.com/news/new-search-engine-exposes-the-dark-web/


Friendly References: 1. https://developers.google.com/webmasters/control-crawl-ind

ex/2. https://developers.google.com/webmasters/control-crawl-ind

ex/docs/robots_txt3. https://support.google.com/webmasters/answer/10619434. https://moz.com/blog/controlling-search-engine-crawlers-for-

better-indexation-and-rankings-whiteboard-friday

And most important

5.http://www.cbsnews.com/news/new-search-engine-exposes-the-dark-web/Big Thanks : https://moz.com/blog/controlling-search-engine-crawlers-for-better-indexation-and-rankings-whiteboard-friday

https://developers.google.com/webmasters/control-crawl-index/

https://developers.google.com/webmasters/control-crawl-index/

https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt

https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt

https://support.google.com/webmasters/answer/1061943










Thank You….

@RMmagar

https://twitter.com/@RMMagar

https://twitter.com/@RMMagar

controlling crawler for better indexation and ranking

Marketing

nofollow xrobots

gmt xrobots

meta robots

noarchive xrobots

content useragent

tag http header http1

website useragent

single crawler useragent