bringing in the family to emphasise importance and win during crawling

USING ’PAGE IMPORTANCE’ AND FAMILY VOTES IN ONGOING CONVERSATION WITH GOOGLEBOT TO GET MORE THAN YOUR ALLOCATED CRAWL BUDGET & ’WIN’ IN THE BATTLE FOR ‘IMPORTANCE EMPHASIS’

BRINGING IN THE FAMILY DURING CRAWLING Dawn Anderson @ dawnieando

http://webpromo.expert/google-‐qa-‐duplicate-‐content/

Thanks for the mention Mr Mu J

https://youtu.be/KxCAVmXfVyI?t=3074

1994 - 1998

“THE GOOGLE INDEX IN 1998 HAD 60 MILLION PAGES” (GOOGLE)

(Source:Wikipedia.org)

2000

“INDEXED PAGES REACHES THE ONE BILLION MARK” (GOOGLE)

“IN OVER 17 MILLION WEBSITES” (INTERNETLIVESTATS.COM)

2001 ONWARDSENTER WORDPRESS, DRUPAL CMS’, PHP DRIVEN CMS’, ECOMMERCE PLATFORMS, DYNAMIC SITES, AJAX

WHICH CAN GENERATE 10,000S OR 100,000S OR 1,000,000S OF DYNAMICURLS ON THE FLY WITH DATABASE ‘FIELD BASED’ CONTENT

DYNAMIC CONTENT CREATION GROWS

ENTER FACETED NAVIGATION (WITH MANY # PATHS TO SAME CONTENT)

2003 – WE’RE AT 40 MILLION WEBSITES

2003 ONWARDS – USERS BEGIN TO JUMP ON THE CONTENT GENERATION BANDWAGGON

LOTS OF CONTENT – IN MANY FORMS

“WE KNEW THE WEB WAS BIG…” (GOOGLE, 2008)

https://googleblog.blogspot.co.uk/2008/07/we-‐knew-‐web-‐was-‐big.html

“1 trillion (as in 1,000,000,000,000) unique URLs on the web at once!”(Jesse Alpert on Google’s Official Blog, 2008)

2008 – EVEN GOOGLE ENGINEERS STOPPED IN AWE

2010 – USER GENERATED CONTENT GROWS

“Let me repeat that: we create as much information in two days now as we did from the dawn of man through 2003”

“The real issue is user-‐generated content.” (Eric Schmidt, 2010 – TechonomyConference Panel)

SOURCE: http://techcrunch.com/2010/08/04/schmidt-‐data/

Indexed Web contains at least 4.73 billion pages (13/11/2015)

CONTENT KEEPS GROWINGTotal number of websites

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

1,000,000,000

750,000,000

500,000,000

250,000,000

THE NUMBER OF WEBSITES DOUBLED IN SIZE BETWEEN 2011 AND 2012AND AGAIN BY 1/3 IN 2014

EVEN SIR TIM BERNERS-‐LEE(Inventor of www) TWEETED

2014 – WE PASS A BILLION INDIVIDUAL WEBSITES ONLINE

2014 – WE ARE ALL PUBLISHERS

SOURCE: http://wordpress/activity/posting

“Bloody brands becoming bloody publishers… Grumble grumble content marketing grumble.” (Jono Alderson, Twitter)

EVEN WETHERSPOONS

Grab your copy of the Wetherspoon, Smarties or Greggs News today

“Big lols” ;pppppp

WHO KNEW?

“Grab your copy of the Wetherspoon News today.” (Wetherspoons, Twitter)

WETHERSPOON NEWS

“Big lols” ;pppppp

ALL THE FACTS AND VITAL OPINION

YUP - WE ALL‘LOVE CONTENT’

– A LOT

http://www.internetlivestats.com/total-‐number-‐of-‐websites/

“As of the end of 2003, the WWW is believed to include well in excess of 10 billion distinct documents or web pages, while a search engine may have a crawling capacity that is less than half as many documents” (MANY GOOGLE PATENTS)

EVERYTHING HAS A FINITE LIMIT –CAPACITY LIMITATIONS – EVEN FOR SEARCH ENGINES

Source: Scheduler for search engine crawler Google PatentUS 8042112 B1, (Zhu et al)

“So how many unique pages does the web really contain? We don't know; we don't have time to look at them all! :-‐)”

(Jesse Alpert, Google, 2008)

Source: https://googleblog.blogspot.co.uk/2008/07/we-‐knew-‐web-‐was-‐big.html

NOT ENOUGH TIME

SOME THINGS MUST BE FILTERED

A LOT OF THE CONTENT IS ‘KIND OF THE SAME’

“There’s a needle in here somewhere”

“It’s an important needle too”

Capacity limits on Google’s

crawling system

By prioritising URLs for crawling

By assigning crawl period

intervals to URLs

How have search engines responded?

By creating work ‘schedules’ for Googlebots

WHAT IS THE SOLUTION?

“To keep within the capacity limits of the crawler, automated selection mechanisms are needed to determine not only which web pages to crawl, but which web pages to avoid crawling”. -‐Scheduler for search engine crawler, (Zhu et al)

‘Managing items in a crawl schedule’

IncludeGOOGLE CRAWL SCHEDULER PATENTS

‘Scheduling a recrawl’

‘Web crawler scheduler that utilizes sitemaps from websites’

‘

‘Document reuse in a search engine crawler’

‘Minimizing visibility of stale content in web searching including revising web crawl intervals of documents’

‘Scheduler for search engine’

EFFICIENCY IS NECESSARY

CRAWL BUDGET

1. Crawl Budget – “An allocation of crawl frequency visits to a host (IP LEVEL)”

3. Pages with a lot of links get crawled more

4. The vast majority of URLs on the web don’t get a lot of budget allocated to them (low to 0 PageRank URLs).

2. Roughly proportionate to PageRank and host load / speed / host capacity

https://www.stonetemple.com/matt-‐cutts-‐interviewed-‐by-‐eric-‐enge-‐2/

BUT… MAYBE THINGS HAVE CHANGED?

CRAWL BUDGET / CRAWL FREQUENCY IS NOT JUST ABOUT HOST-LOAD AND PAGERANK ANY MORE

STOP THINKING IT’S JUST ABOUT ‘PAGERANK’

http://www.youtube.com/watch?v=GVKcMU7YNOQ&t=4m45s

“You keep focusing on PageRank”…

“There’s a shit-‐ton of other stuff going on” (Illyes, G, Google -‐2016)

THERE’S A LOT OF OTHER THINGS AFFECTING ‘CRAWLING’

Transcript: https://searchenginewatch.com/2016/04/06/webpromos-‐qa-‐with-‐googles-‐andrey-‐lipattsev-‐transcript/

WEB PROMOS Q & A WITH GOOGLES ANDREY LIPATTSEV

WHY?BECAUSE…

THE WEB GOT ‘MAHOOOOOSIVE’

AND CONTINUES TO GET ‘MAHOOOOOOSIVER’

SITES GOT MORE DYNAMIC, COMPLEX, AUTO-GENERATED, MULTI-FACETED, DUPLICATED, INTERNATIONALISED, BIGGER, BECAME PAGINATED AND SORTED

WE NEED MOREWAYS TO GETMORE EFFICIENTAND FILTER OUTTIME-WASTINGCRAWLING SO WE CAN FIND IMPORTANT CHANGES QUICKLY

GOOGLEBOT’S TO-DO LIST GOT REALLY BIG

Hard and Soft Crawl Limits

Importance Thresholds

Min and Max Hints & ‘Hint

ranges’

ImportanceCrawl Periods

Scheduling

FURTHER IMPROVED CRAWLING EFFICIENCY SOLUTIONS NEEDED

Prioritization TieredCrawlingBuckets

(‘Real Time, Daily, Base Layer)

SEVERAL PATENTS UPDATED

‘Managing URLs’ (Alpert et al, 2013) (PAGE IMPORTANCE DETERMINING SOFT AND HARD LIMITS ON CRAWLING)

‘Managing Items in a Crawl Schedule’ (Alpert, 2014)

‘

‘Scheduling a Recrawl’ (Anerbach, Alpert, 2013) (PREDICTING CHANGE FREQUENCY IN ORDER TO SCHEDULE NEXT VISIT, EMPLOYING HINTS (Min & Max)

(SEEM TO WORK TOGETHER)

‘Minimizing visibility of stale content in web searching including revising web crawl intervals of documents’ (INCLUDES EMPLOYING HINTS TO DETECT PAGES ‘NOT’ TO CRAWL)

Crawled multiple times daily

Crawled daily Or bi-‐daily

Crawled least on a ‘round robin’ basis – only ‘active’ segment is crawledSplit into segments

on random rotation

MANAGING ITEMS IN A CRAWL SCHEDULE (GOOGLE PATENT)

Real TimeCrawl

Daily Crawl

Base Layer Crawl

3 layers / tiers / buckets for scheduling

URLs are moved in and out of layers based on past visits data

Most Unimportant

3 TIERED SCHEDULING FOR GOOGLEBOTS

CAN WE ESCAPE THE ‘BASE LAYER’ CRAWL BUCKET RESERVED FOR ‘UNIMPORTANT’ URLS?

10 typesof

Googlebot

SOME OF THE MAJOR SEARCH ENGINE CHARACTERS

History Logs / History Server

The URL Scheduler / Crawl Manager

HISTORY LOGS / HISTORY SERVERS

HISTORY LOGS / HISTORY SERVER -‐ Builds a picture of historical data and past behaviour of the URL and ‘importance’ score to predict and plan for future crawl scheduling

• Last crawled date• Next crawl due• Last server response• Page importance score• Collaborates with link

logs• Collaborates with

anchor logs• Contributes info to

scheduling

‘BOSS’- URL SCHEDULER / URL MANAGER

Think of it as Google’s line manager or ‘air traffic controller’ for Googlebots in the web crawling system

• Schedules Googlebot visits to URLs• Decides which URLs to ‘feed’ to Googlebot• Uses data from the history logs about past visits (Change rate and

importance)• Calculates importance crawl threshold• Assigns visit regularity of Googlebot to URLs• Drops ‘max and min hints’ to Googlebot to guide on types of

content NOT to crawl or to crawl as exceptions.• Excludes some URLs from schedules• Assigns URLs to ‘layers / tiers’ for crawling schedules• Scheduler checks URLs for ‘importance’, ‘boost factor’ candidacy,

‘probability of modification’• Budgets are allocated to IPs and shared amongst domains there

JOBS

• ‘Ranks nothing at all’• Takes a list of URLs to crawl from URL Scheduler• Runs errands & makes deliveries for the URL server, indexer /

ranking engine and logs• Makes notes of outbound linked pages and additional links

for future crawling• Follows directives (robots) and takes ‘hints’ when crawling• Tells tales of URL accessibility status, server response codes,

notes relationships between links and collects content checksums (binary data equivalent of web content) for comparison with past visits by history and link logs

• Will go beyond the crawl schedule if it finds something more important than URLs scheduled

GOOGLEBOT - CRAWLERJOBS

WHAT MAKES THE DIFFERENCE BETWEEN BASE LAYER AND ‘REAL TIME’ SCHEDULE ALLOCATION?

CONTRIBUTING FACTORS

1. Page Importance (which may include PageRank)

3. Soft limits and hard crawl limits

4. Host load capability & past site performance (speed and access) (IP level and domain level within)

2. Hints (max and min)

5. Probability / predictability of ‘CRITICALMATERIAL’ change + importance crawl period

1 - PAGE IMPORTANCE - Page importance is the importance of a page independent of a query

• Location in Site (e.g. home page more important than parameter 3 level output)

• PageRank• Page type / file type• Internal PageRank• Internal Backlinks (IBP)• In-‐site Anchor Text Consistency• Relevance (content, anchors and elements) to a

topic (ONTOLOGY) (Similarity Importance)• Directives from in-‐page robot and robots.txt

management• Parent quality brushes off on child page qualityIMPORTANT PARENTS LIKELY SEEN TO HAVE IMPORTANT CHILD PAGES

2 - HINTS - ’MIN’ HINTS & ’MAX’ HINTS

MIN HINT / MIN HINT RANGES• e.g. Programmatically generated

content which changes content checksum on load

• Unimportant duplicate parameter URLs

• Canonicals• Rel=next, rel=prev• HReflang• Duplicate content• SpammyURLs?• Objectionable content

MAX HINT / MAX HINT RANGES• CHANGE CONSIDERED ‘CRITICAL

MATERIAL CHANGE’ (useful to users e.g. availability, price) & / or improved site sections or change to IMPORTANT but infrequently changing content?

• Important pages / page range updates

E.G. rel="prev" and rel="next" act as hints to Google, not absolute directives

https://support.google.com/webmasters/answer/1663744?hl=en&ref_topic=4617741

3 - HARD AND SOFT LIMITS ON CRAWLING

If URLs are discovered during crawling that are more important than those scheduled to be crawled then Googlebot can go beyond its schedule to include these up to a hard crawl limit

‘Soft’ crawl limit is set (Original schedule)

‘Hard’ crawl limit is set (E.G. 130% of schedule)

FOR IMPORTANT FINDINGS

4 – HOST LOAD CAPACITY / PAST SITE PERFORMANCE

Googlebot has a list of URLs to crawl

Naturally, if your site is fast that list can be crawled quicker

If Googlebotexperiences 500s e.g. she will retreat & ‘past performance’ is noted

If Googlebotdoesn’t get ‘round the list’ you may end up with ‘overdue’ URLs to crawl

SO WHAT?

5 - CHANGE

Not all change is considered equal

5 - CHANGE

WHAT MATTERS IS ‘CRITICAL MATERIAL

CHANGE’

Features are weighted for change importance to user (price > colour e.g.)

5 - CHANGE

What is the ‘importance crawl period’ set for your URL?

5 - CHANGESO WHAT?

Is your URLs ‘change rate’ much higher than your ‘importance crawl period’?

5 - CHANGE

Random shuffling is useless if your URL is unimportant

5 - CHANGE “shuffle($variable), rand($variable)” === ‘FAIL on ‘CRITICAL MATERIAL CHANGE’

In a different order

MEH “shuffle($variable), rand($variable)” === ‘FAIL on ‘CRITICAL MATERIAL CHANGE’

Your URL may even trip ‘hints’…

And get visited less

5 - CHANGE “I know your game buddy”

5 - CHANGEGUESS WHAT? -‐ CHANGE ON THE CNN HOME PAGE IS KIND OF MORE IMPORTANT THAN YOUR ‘ABOUT US’ PAGE

#WHOKNEW?

Hence – ‘Real Time API’ for ‘news sites’ to avoid ‘The Embarrassment Factor’

• There are many dynamic sites with low importance pages

changing frequently – SO WHAT• Constantly changing your page just to get Googlebot

back won’t work if the page is low importance (crawl importance period < change rate) POINTLESS

• Hints are employed to determine pages which simply change the content checksum with every visit

• Don’t just try to randomise things to catch Googlebot’seye

• That counter or clock you added probably isn’t going to help you get more attention, nor random or shuffle

• Change on some types of pages is more important than other pages (e.g. Home page CNN > SME about us page)

5 - CHANGE

• Current capacity of the web crawling system is high• Your URL has a high ‘importance score’• Your URL is in the real time (HIGH IMPORTANCE), daily crawl

(LESS IMPORTANT) or ‘active’ base layer segment (UNIMPORTANT BUT SELECTED)

• Your URL changes a lot with CRITICAL MATERIAL CONTENT change (AND IS IMPORTANT)

• Probability and predictability of CRITICAL MATERIAL CONTENT change is high for your URL (AND URL IS IMPORTANT)

• Your website speed is fast and Googlebot gets the time to visit your URL on its bucket list of scheduled URLs that visit

• Your URL has been ‘upgraded’ to a daily or real time crawl layer as it’s importance is detected as raised

• History logs and URL Scheduler ’learn’ together

FACTORS AFFECTING GOOGLEBOT HIGHER VISIT FREQUENCY

• Current capacity of web crawling system is low• Your URL has been detected as a ‘spam’ URL• Your URL is in an ‘inactive’ base layer segment (UNIMPORTANT)• Your URLs are ‘tripping hints’ built into the system to detect non-‐

critical change dynamic content• Probability and predictability of critical material content change is

low for your URL• Your website speed is slow and Googlebot doesn’t get the time to

visit your URL• Your URL has been ‘downgraded’ to an ‘inactive’ base layer

(UNIMPORTANT) segment• Your URL has returned an ‘unreachable’ server response code

recently• In-‐page robots management or robots.txt send wrong signals

FACTORS AFFECTING LOWER GOOGLEBOT VISIT FREQUENCY

GET MORE CRAWL BY ‘TURNING GOOGLEBOT’S HEAD’ – MAKE YOUR URLs MORE IMPORTANT AND ‘EMPHASISE’ IMPORTANCE

• Hard limits and soft limits• Follows ‘min’ and ‘max’ Hints• If she finds something important she will go beyond a

scheduled crawl (SOFT LIMIT) to seek out importance (TO HARD LIMIT)

• You need to IMPRESS Googlebot• If you ‘bore’ Googlebot she will return to boring URLs less

(e.g. with pages all the same (duplicate content) or dynamically generated low usefulness content)

• If you ’delight’ Googlebot she will return to delightful URLs more (they became more important or they changed with ‘CRITICAL MATERIAL CHANGE’)

• If she doesn’t get her crawl completed you will end up with an ‘overdue’ list of URLs to crawl

GOOGLEBOT DOES AS SHE’S TOLD –WITH A FEW EXCEPTIONS

• Your URL became more important and achieved a higher ‘importance score’ via increased PageRank

• Your URL became more important via increased IB(P) (INTERNAL BACKLINKS IN OWN SITE) relative to other URLs within your site (You emphasised importance)

• You made the URL content more relevant to a topic and improved the importance score

• The parent of your URL became more important (E.G. IMPROVED TOPIC RELEVANCE (SIMILARITY), PageRank OR local (in-‐site) importance metric)

• YOUR ‘IMPORTANCE SCORE’ OF SOME URLS EXCEEDED THE ‘IMPORTANCE SOFT LIMIT THRESHOLD’ SO THAT IT IS INCLUDED FOR CRAWLING WHILST BEING VISITED UP TO A POINT OF ‘HARD LIMIT’ CRAWLING (E.G. 130% OF SCHEDULED CRAWLING)

GETTING MORE CRAWL BY IMPROVING PAGE IMPORTANCE

HOW DO WE DO THIS?

INCREASE URL ‘IMPORTANCE’

AS BASE LAYER URL’S BECOME MORE IMPORTANT THEY WILL BE CRAWLED MORE… AND

47GOOD THINGS HAPPEN 40,000+ towns, cities and villages across the UK multiplied by X site categories (THAT’S A LOT OF LONG TAIL QUERY VOLUME)

THEY ARE PROMOTED TO THE ’DAILY OR REAL TIME’ CRAWL LAYER

TO DO - FIND GOOGLEBOTAUTOMATE SERVER LOG RETRIEVAL VIA CRON JOB

grep Googlebotaccess_log>googlebot_access.txt

ANALYSE THE LOGS

LOOK THROUGH SPIDER-EYESPREPARE TO BE HORRIFIED

Incorrect URL header response codes 301 redirect chainsOld files or XML sitemaps left on server from years agoInfinite/ endless loops (circular dependency)On parameter driven sites URLs crawled which produce same outputAJAX content fragments pulled in aloneURLs generated by spammersDead image files being visitedOld CSS files still being crawled and loading EVERYTHINGYou may even see ’mini’ abandoned projects within the siteLegacy URLs generated by long forgotten .htaccess regex pattern matchingGooglebot hanging around in your ‘ever-‐changing’ blog but nowhere else

URL CRAWL FREQUENCY ’CLOCKING’

Spreadsheet provided by @johnmu during Webmaster Hangout -‐ https://goo.gl/1pToL8

Identify your ‘real time’, ‘daily’ and ‘base layer’ URLs-‐ ARE THEY THE ONES YOU WANT THERE? WHAT IS BEING SEEN AS UNIMPORTANT?

NOTE GOOGLEBOT

Do you recognise all theURLs and URL ranges thatAre appearing?If not… Why not?

IMPROVE & EMPHASISE PAGE IMPORTANCE• Cross modular internal linking• Canonicalization• Important URLs in XML sitemaps• Anchor text target consistency (but not spammyrepetition of anchors

everywhere (it’s still output))• Internal links in right descending order – emphasise IMPORTANCE• Reduce boiler plate content and improve relevance of content and elements to

specific topic (if category) / product (if product page) / subcategory (if subcategory)

• Reduce duplicate content parts of page to allow primary targets to take ’IMPORTANCE’

• Improve parent pages to raise IMPORTANCE reputation of the children rather than over-‐optimising the child pages and cannibalising the parent.

• Improve content as more ‘relevant’ to a topic to increase ‘IMPORTANCE’ and get reassigned to a different crawl layer

• Flatten ‘architectures’• Avoid content cannibalisation• Link relevant content to relevant content• Build strong highly relevant ‘hub’ pages to tie together strength & IMPORTANCE

LOCAL ‘IMPORTANCE’ (IBP)

LOCAL IMPORTANCE IN DESCENDING ORDER (ROUGHLY)

https://support.google.com/webmasters/answer/138752?hl=en

Most Important Page 1



IS THIS YOUR BLOG?? HOPE NOT

#BIGSITEPROBLEMS – INTERNAL BACKLINKS SKEWED

IMPORTANCE DISTORTED BY DISPROPORTIONATE INTERNAL LINKING -LOCAL IB (P) – INTERNAL BACKLINKS

THE PARENTS REPUTATION BRUSHES OFF ON THE KIDS

Cat Cat

Root

Sub Sub Sub Sub

P P P P P P P P P P P

MAKE CATEGORY AND SUBCATEGORY PARENTS AWESOME

PRODUCT PAGES FROM AWESOME PARENT CATEGORIES BECOME MORE IMPORTANT

OR MAKE AN AWESOME ‘FAMILY GATHERING’ OF HIGHLY RELATED ‘NEEDS MET’ CONTENT IN A ‘HUB’

FAQ GUIDES

HELP HUB

C C FF T T G G S S S

MAKE AWESOMEHUB PAGES – MAKE AWESOME ‘BRIDGES’ TO SIGNAL IMPORTANCE

IDENTIFY ‘NEEDS’ AND TARGET A STARTING ‘HUB’ PAGE TO CONNECT RELATED ‘BROTHERS, SISTERS, AUNTIES, UNCLES & GRANNY URLS

SUPPORT TEAM

TUTORIALSFIND A LIVE CLASS

GET STARTED

AWESOMENESS ON CATEGORY PAGES IS NOT JUST REWRITING COMPETITOR CONTENT

Cat Cat

Root

Sub Sub Sub Sub


PRODUCT PAGES FROM AWESOME PARENT CATEGORIES BECOME MORE IMPORTANT

’ADD VALUE’

WHERE IS THE ‘CRITICAL MATERIAL DIFFERENCE’??

ADD ‘CRITICAL MATERIAL

VALUE’WHAT IS MISSING?

ADD ‘CRITICAL MATERIAL DIFFERENCE’

HELP HUB HERO

What more can you add to the existing offerings out there?

What is the user seeking now?

Answer questions Engage community Wow transactional

EMPHASISE IMPORTANCEVIA SIBLING VOTES

Cat Cat

Root

Sub Sub Sub Sub


TRIP ‘MAX HINTS’ NOT ‘MIN HINTS’

“Hold the diary… I found some unexpected stuff which is more important than I planned to see today… I’ll be here a while longer”

BUT… BE CAREFUL

WRONG TARGET RANKING

SKEWED AWESOMENESS

ADDRESS SKEWED INTERNAL LINKING VIA ‘AUNTIE & UNCLE INTERNALLINKING’

Cat Cat

Root

Sub Sub Sub Sub


AT A ‘TEMPLATE LEVEL’

MOST INTERNAL LINKS

USE COMPOUNDING ‘HELP’, ‘HUB’, ‘HERO’FAMILY MEMBERS

Hero (Transactional & Brand hero subs)

Hub

Root theme

Sub Sub

Sub Sub

P P P P P P

F F F F F

Intent

Sub

Compounding Hero ‘Intent’

Sell product (convince)

Entertain / inspire

F

Compounding Hub ‘Intent’

Help

Sub Sub SubSub Sub

Compounding Help‘Intent’

Inform (Answer questions)

K K K K K K K

STRONG LOCAL IMPORTANCE

EMPHASISE IMPORTANCE WISELY

USE CUSTOMXMLSITEMAPS

E.G. XML UNLIMITEDSITEMAP GENERATOR

PUT IMPORTANT URLS IN HERE

IF EVERYTHING IS IMPORTANT THEN IMPORTANCE IS NOT DIFFERENTIATED

KEEP CUSTOM SITEMAPS ‘CURRENT’ AUTOMATICALLY

AUTOMATEUPDATESWITH CRON JOBS OR WEB CRON JOBS

IT’S NOT AS TECHNICAL AS YOU MAY THINK – USE WEB CRON JOBS

BE ‘PICKY’ ABOUT WHAT YOU INCLUDE IN XML SITEMAPS

EXCLUDE ANDINCLUDE CRAWLPATHS IN XML SITEMAPS TO EMPHASISEIMPORTANCE

IF YOU CAN’T IMPROVE - EXCLUDE (VIA NOINDEX) FOR NOW • YOU’RE OUT FOR NOW

• When you improve you can come back in

• Tell Googlebot quickly that you’re out (via temporary XML sitemap inclusion)

• But ‘follow’ because there will be some relevance within these URLs

• Include again when you’ve improved

• Don’t try to canonicalizeme to something in theindex

OR REMOVE – 410 GONE(IF IT’S NEVER COMINGBACK)

http://faxfromthefuture.bandcamp.com/track/410-‐gone-‐acoustic-‐demo

EMBRACE THE ‘410 GONE’

There’s Even A SongAbout It

#BIGSITEPROBLEMS – LOSE THE INDEX BLOAT

LOSE THE BLOAT TO INCREASE THE CRAWLNo. of unimportant URLs indexed extend far beyond the available importance crawl threshold allocation

Tags: I, must, tag, this, blog, post, with, every, possible, word, that, pops, into, my, head, when, I, look, at, it, and, dilute, all, relevance, from, it, to, a, pile, of, mush, cow, shoes, sheep, the, and, me, of, it

Image Credit: Buzzfeed

Creating ‘thin’ content and Even more URLs to crawl

#BIGSITEPROBLEMS - LOSE THE CRAZY TAG MAN




IS THIS YOUR BLOG?? HOPE NOT

#BIGSITEPROBLEMS – INTERNAL BACKLINKS SKEWED

IMPORTANCE DISTORTED BY DISPROPORTIONATE INTERNAL LINKING -LOCAL IB (P) – INTERNAL BACKLINKS

Optimize Everything: I must optimize ALL the pages across a category descendants for the same terms as my primary target category page so that each of them is of almost equal relevance to the target page and confuse crawlers as to which isthe important one. I’ll put them all in a sitemap as standard too just for good measure.


HOW CAN SEARCH ENGINESKNOW WHICH IS MOST IMPORTANTTO A TOPIC IF ‘EVERYTHING’ ISIMPORTANT??

#BIGSITEPROBLEMS - WARNING SIGNS – LOSE THE ‘MISTER OVER-OPTIMIZER’

‘OPTIMIZE ALL THE THINGS’

Duplicate Everything: I must have a massive boiler plate area in the footer, identical sidebars and a massive mega menu with all the same output in sitewide. I’ll put very little unique content into the page body and it will also look very much like it’s parents and grandparents too. From time to time I’ll outrank my parents and grandparent pages but ‘Meh’…


HOW CAN SEARCH ENGINESKNOW WHICH IS MOST IMPORTANTPAGE IF ALL IT’S CHILDREN AND GRANDCHILDREN ARE NEARLY THE SAME??

#BIGSITEPROBLEMS - WARNING SIGNS – LOSE THE ‘MISTER DUPLICATER’

‘DUPLICATE ALL THE THINGS’

IMPROVE SITE PERFORMANCE - HELP GOOGLEBOT GET THROUGH THE ‘BUCKET LIST’ – GET FAST AND RELIABLE

Avoid wasting time on ‘overdue-‐URL’ crawling (E.G. Send correct response codes, speed up your site, etc)

8,666,964 B1

½ time

> 2 x page crawl p/day

Added to Cloudflare CDNWatch out for CDNsThough – It’s a shared IP (shared budget / capacity ??)

GOOGLEBOT GOES WHERE THE ACTION IS

USE ‘ACTION’ WISELY

DON’T TRY TO TRICK GOOGLEBOT BY FAKING ‘FRESHNESS’ ON LOW IMPORTANCE PAGES – GOOGLEBOT WILL REALISE

UPDATE IMPORTANT PAGES OFTEN

NURTURE SEASONAL URLs TO GROW IMPORTANCE WITH FRESHNESS (regular updates) & MATURITY (HISTORY)

DON’T TURN GOOGLEBOT’S HEAD INTO THE WRONG PLACES


’GET FRESH’ AND STAY ‘FRESH’

‘BUT DON’T TRY TO FAKE FRESH & USE FRESH WISELY’

IMPROVE TO GET THE HARD LIMITS ON CRAWLING

By improving yourURL importance on an ongoing basis viaIncreased pagerank, content improvements (e.g. quality hub pages), internal link strategies, IB (P), restructuring,You can get the ‘hard limit’ or get visited more generally

CAN IMPROVING YOUR SITE HELP TO ‘OVERRIDE’ SOFT LIMIT CRAWL PERIODS SET?

YOU THINK IT DOESN’T MATTER… RIGHT?

YOU SAY…

” GOOGLE WILL WORK IT OUT”

”LET’S JUST MAKE MORE CONTENT”

WRONG – ‘CRAWL TANK’ IS UGLY

WRONG – CRAWL TANK CAN LOOK LIKE THIS

SITE SEO DEATH BY TOO MANY URLS AND INSUFFICIENT CRAWL BUDGET TO SUPPORT (EITHER DUMPING A NEW ‘THIN’ PARAMETER INTO A SITE OR INFINITE LOOP (CODING ERROR) (SPIDER TRAP))

WHAT’S WORSE THAN AN INFINITE LOOP?

‘A LOGICAL INFINITE LOOP’

IMPORTANCE DISTORTED BY BADLY CODED PARAMETERS GENERATING ‘JUNK’ OR EVEN WORSE PULLING LOGIC TO CRAWLERS BUT NOT HUMANS

WRONG –SITE DROWNED

- IN IT’SOWN SEA OF UNIMPORTANT URLS

VIA ‘EXPONENTIAL URL UNIMPORTANCE’Your URLs exponentially confirmed unimportant with each iterative crawl visit to other similar or duplicate content checksum URLs. Fewer and fewer internal links and ‘thinner and thinner’ relevant content.

MULTPLE RANDOM URLs competing for same query confirm irrelevance of all competing in-‐site URLs with no dominant single relevant IMPORTANT URL

WRONG – ‘SENDING WRONG SIGNALS TO GOOGLEBOT’ COSTS DEARLY

(Source:Sistrix)

“2015 was the year where website owners managed to be mostly at fault, all by themselves” (Sistrix 2015 Organic Search Review -‐2016)

WRONG - NO-ONE IS EXEMPT

(Source:Sistrix)

“It doesn’t matter how big your brand is if you ‘talk to the spider’ (Googlebot) wrong ” – You can still ‘tank’

WRONG – GOOGLE THINKS SEOS SHOULD UNDERSTAND CRAWL BUDGET

”EMPHASISE IMPORTANCE”“Make sure the right URLs get on Googlebot’smenu and increase URL

importance to build Googlebot’s appetite for your site more”

Dawn Anderson @ dawnieando

SORT OUT CRAWLING

TWITTER -‐ @dawnieandoGOOGLE+ -‐ +DawnAnderson888LINKEDIN -‐ msdawnandersonTHANK YOUDawn Anderson @ dawnieando

REFERENCES

Efficient Crawling Through URL Ordering (Page et al) -‐ http://oak.cs.ucla.edu/~cho/papers/cho-‐order.pdfCrawl Optimisation (Blind Five Year Old – A J Kohn -‐ @ajkohn) http://www.blindfiveyearold.com/crawl-‐optimizationScheduling a recrawl (Auerbach) -‐ http://www.google.co.uk/patents/US8386459Scheduler for search engine crawler (Zhu et al) -‐ http://www.google.co.uk/patents/US8042112Efficient crawling through URL ordering (Page et al) -‐ http://oak.cs.ucla.edu/~cho/papers/cho-‐order.pdfGoogle Explains Why The Search Console Reporting Is Not Real Time (SERoundtable) https://www.seroundtable.com/google-‐explains-‐why-‐the-‐search-‐console-‐has-‐reporting-‐delays-‐21688.htmlCrawl Data Aggregation Propagation (Mueller) -‐ https://goo.gl/1pToL8Matt Cutts Interviewed By Eric Enge -‐ https://www.stonetemple.com/matt-‐cutts-‐interviewed-‐by-‐eric-‐enge-‐2/Web Promo Q and A with Google’s Andrev Lippatsev -‐https://searchenginewatch.com/2016/04/06/webpromos-‐qa-‐with-‐googles-‐andrey-‐lipattsev-‐transcript/Google Number 1 SEO Advice – Be Consistent -‐ https://www.seroundtable.com/google-‐number-‐one-‐seo-‐advice-‐be-‐consistent-‐21196.html

REFERENCESInternet Live Stats -‐ http://www.internetlivestats.com/total-‐number-‐of-‐websites/Scheduler for search engine crawler Google PatentUS 8042112 B1, (Zhu et al) -‐ https://www.google.com/patents/US8707313Managing items in crawl schedule – Google Patent (Alpert) http://www.google.ch/patents/US8666964Document reuse in a search engine crawler -‐ Google Patent (Zhu et al)https://www.google.com/patents/US8707312Web crawler scheduler that utilizes sitemaps (Brawer et al) -‐http://www.google.com/patents/US8037054Distributed crawling of hyperlinked documents (Dean et al) -‐http://www.google.co.uk/patents/US7305610Minimizing visibility of stale content (Carver) -‐http://www.google.ch/patents/US20130226897

REFERENCEShttps://www.sistrix.com/blog/how-‐nordstrom-‐bested-‐zappos-‐on-‐google/https://www.xml-‐sitemaps.com/generator-‐demo/

bringing in the family to emphasise importance and win during crawling

Marketing