seo (search engine optimization) · what will we do today? this talk discuses just the most...

SEO (Search Engine Optimization)

Dragiša Miljković[email protected] of electrical engineering and computer scienceFaculty of technical sciencesUniversity of Priština

Whatwillwedotoday?

ThistalkdiscusesjustthemostimportantandinterestingideasinSearchEngineOptimization.SEOistheprocessofimprovingthevisibilityofawebsitewithinsearchresults,sothatitbecomeseasiertofind,morerelevanttosearchengines,andmoreaccessibletothesearchenginecrawlers.Shortlyput,itisaprocessofimprovingawebsite'ssearchenginerankposition.

Whoisthisguy?

M.Sc.Eng.DragišaMiljkovićTeachingfellowatDepartmentofelectricalengineeringandcomputerscience

Facultyoftechnical sciencesUniversityofPrištinaSerbia

Basicterminology

Searchengine isasoftwaresystemusedforsearchingforinformationontheWWW.Theuserenterskeyphrasesofhisinterestintosearchfield,andsearchenginereturnswebcontentresultsinalistofsocalledsearchengineresultspages (SERPs).Thisresultscanbeamixturecomprisedofdifferentwebpages,images,videos,andotherfiletypes.Webisperpetuallychanging,sosearchenginesmustmaintainanearreal-timeindexation.Thisisdonebyconstantlyrunninganalgorithmonawebcrawler.

SERPWebsearchengineprocessakeywordquerysubmittedbya“searcher”,and,asaresponse,itpresentsSERPs.SERPiscomprisedofthelistofresults(usuallyorderedbyrelevancetothequery)thatarereturnedbythesearchengine,butitmayalsocontainotherresults,suchasadvertisements.Therearetwomaincategoriesofresults,organicsearch(returnedbythealgorithmofasearchengine)andsponsored search (i.e.advertisements).– Aresultisdisplayedwithatitle,ahyperlink tothatwebpage,andabrief description (itshowshowthatresultmatchesthequery).

http://www.unibo.it/it SERP

Contextmatters

InternetisacuriosplaceTherearepartsofwebthatarenotbeingaccessedbysearchengines:– deepweb,hiddenfromconventionalsearchengines(e.g.byencryption);

– darkweb,intentionally hiddenfromsearchengines,itusesmaskedIPaddresses,andisaccessibleonlywithaspecialwebbrowser.• Noticethatthedarkwebispartofthedeepweb.

Searchengines–howdotheywork?

Searchengineiscomprisedofthreeprocesses:

– Crawler,

– Indexer,

– Queryengine.

ThisisthetechnicalpartofSEO

TypesofJavaScript

JavaScriptmaybeusedtoenhance HTML&CSS– Itcanbeusedtoimproveuserexperienceandtoaddsomefunctionality,

– ThiseffectsSEOverylittle.

JavaScript maybeusedtoreplace the content ofawebpage– it’sHTMLandCSS– Inthisway,webpagesbecomewebapplications,– Thiscausestroublestosearchengines.

CrawlerAlsoknownas(web)spider.Thisisaninternetbot thatbrowsestheWWWforthepurposeofwebindexing.Therearebillionsofwebpages,sothecrawlerneedstobeconstantlyexecutedonalargenumberofcomputers.CrawlertakestheURLsfrompreviouscrawlsandfromXMLsitemaps,itthentriestofindneworupdatedpagesandaddthemtoGoogleindex(itdetectsSRCandHREF links).

– Itextractshyperlinksandaddsthemtoqueueforcrawling,– Itonlyretrievesapageifitisneworifitischanged,andremovesthedeadones.

Crawler

Criteriathatmattertothecrawler:– Howlong doesittaketoloadawebpage?– HowimportantisthatURL?– Aretheremorehyperlinksonagivenwebpage?

CrawlerparsesHTML

SometextSomemoretext– Somenestedtext– Secondnestedelement

SearchenginecaresonlyaboutURLs

Searchenginedoesnotcare(too)muchaboutthecontentofawebpage,rather– itcrawls,indexes,andranks onlyURLs.– Ageneralrule:onepieceofcontentshouldbeassociatedwithoneURL!

URLsthatcrawlerreturnsareaddedtothesearchengine'sindex.Whenauserentersaqueryinasearchengine,relevantresultswillbereturnedbasedonthesearchengine'salgorithm.

Robotsexclusionprotocol

Knownsimplyasrobots.txtItmustbeinthetopdirectoryoftheserver.Thisisastandardwhichwebsitesuseinordertoregulatewhichareasofthewebsiteshouldthewebcrawlersandotherwebrobotsbeallowedtoprocessandcategorize.Thisissolelyastandard,andnotanenforcedrule,sonotallrobotswillcomply.– E-mailharvesters,malware,andspambots areevenlikelytostartattheareasofthewebsitethatshouldbeomitted.

Unibo robots.txt

http://www.unibo.it/robots.txt

User-agent: *Disallow: /NR/exeresDisallow: /NR/rdonlyresDisallow: /it/allegati/allegati-non-indicizzatiDisallow: /en/attachments/unindexed-attachmentsDisallow: /modelloDisallow: /modello-aDisallow: /modello-bDisallow: /uniboweb/sites/UniboSearch/results.aspxDisallow: /UniboWeb/UniboSearch/results.aspxDisallow: /_layouts

SitemapsinclusionprotocolSitemapsareaURLinclusion protocol.WebmastercanusesitemapsinordertoprovidethesearchengineswithURLsonawebsitethatareavailableforcrawling.SitemapisjustaXMLfilewhichlistsalltheURLsononewebsite.Inordertoenablesearchenginestocrawlwebpagesmoreefficiently,someadditionalmetadata canbeincluded:– Howimportant givenURLis(inrelationtoURLsonthesamewebsite),

– WhenwasthatURLupdated,– Howfrequent changes are,etc.– Itcanalsoincludeinformationaboutspecifictypesofcontentonwebsite(e.g.images,videos).

Sitemaps– Unibo website

JavaScriptaffectscrawling

JavaScriptframeworksareusedtocreateinteractivewebpages,andtocontrolthebehaviouroftheelementsonthepage.InorderforsearchenginetoaccessWebpagecontent,itneedstoberenderedit!Thisisnotajobforacrawler,butforanindexer– InthecaseofGoogle,it’sindexer,calledCaffeine,renderswebpages,andGooglebot doesnotexecuteJavaScriptatall.

– LinksthatareembeddedintoJavaScriptarenotvisibletothecrawler.

Earlierworkaround–AJAX-crawling

InOctober2009Googlecameoutwith“AproposalformakingAJAXcrawlable”.ThisinitiativewasintroducedtomakeJavaScript-basedwebpagesaccessibletoGooglebot.Googlebot sendstotheserverURLofaJSwebpagethatitneeds,andserverrespondswithawebpagethat’sfully-renderedintoaHTMLsnapshot(thisrepresentstheresultofexecutingtheJavaScriptonaheadless browser(browserwithnoGUI)),whichisthenreturnedbacktothecrawler.Nowadays,thisisadeprecatedmethod,asmodernGooglebot hastrulyadvancedJavaScriptrenderingcapabilities.

AJAX-crawlingscheme

ThisschemeacceptsanURLcontainingeithera"#!",ora"fragmentmetatag“(<metaname="fragment"content="!">).ThisURLiscalledprettyURL.Crawlerthenrequeststhecontentofthatpagefromtheserver,butitmodifiesURL byreplacing#! ormetatagswith"?_escaped_fragment_=".ThisURLiscalleduglyURL.

AJAX-crawlingscheme

Eversince2015,Googlebot wasable(atleasttosomeextent)torenderthe“#!”URLsdirectly,soprovidingitwitharenderedversionofthewebpagebecameobsolete.(Though,thisdoesnotapplytoothersearchengines.)Inthesecondquarterof2018,GooglecompletelyswitchedtorenderingJavaScriptpagesonGoogle'sside,anditnolongerrequiresthatwebsitesdothisbythemselves.However,AJAX-crawlingscheme URLsarestillsupportedinGoogle'ssearchresults.

Indexerdoestherendering

Whenthecrawlerprocessesthepage,itsearchesforhyperlinks.ItthensendsthemtotheindexerwhichrendersthemandexecutesJavaScript,whichoftenresultsinfindingnewURLs,whicharethansentbacktothecrawler.Theprocessstopswhencrawlercannotcrawlanyfurther.Butthisdoesnotalwaysworkideally,bottomlineisthatthereisnosearchenginethatcanunderstandandprocessJavaScriptatthelevelmodernbrowserscan.

Indexer

ThemostadvancedindexerisGoogle’sCaffeineEveryindexeranalysesthefollowingthings– Content,– Links,– Layout.

Metatagsfromrobots.txt canbeusedtocontrolindexer’sexecution.

Indexer

Anindexerprovidesthreeservices– Canonicalization

• ItfindsthecanonicalURL,

– WRS (webrenderingservice)• Itrendersawebpage(likeabrowser),

– PageRanker• Indexercalculatesrankofagivenwebpage.

Canonicalization

Canonicalization– IndexerfindsthecanonicalURL(themastercopyofapage)• I.e.allofthefollowingwebpagesaredifferentpagestothecrawler(eventhoughthecontentisthesame):– http://www.example.com– https://www.example.com– http://example.com– http://example.com/index.php– http://example.com/index.php?r...

<linkrel="canonical"href="http://www.unibo.it/it"></head>

Webrenderingservice

Google'sJavaScriptindexingcapabilitiesarewithoutprecedence,itusesChrome41forwebrenderingservice (WRS).– Chrome41 wasreleasedon3rd ofMay2015(sotherearesomemodernfeaturesitdoesn’tsupport),

– WRSisstateless,itdoesn’tstorecookiesorsessiondata,

– IfJavaScriptrequiresanyuseraction,thatwebpagewon’tberendered.

PageRanker

PageRank representsamathematical algorithm usedtodetermineimportanceofapage,andthatprocessisessentiallybasedonassessingthequantityandqualityoflinksleadingtothatwebpage.PageRanker isusedtocalculaterankforagivenwebpage– Whatmattersarethehyperlinks,internal andexternal,– Dampingfactor – theprobabilitythattheuserwillcontinueclicking,ratherthanleavingthatwebsite.

PageRanker sendstheresultstothecrawler.Pageswithhigherimportancearecrawledwithhigherpriority!

Unibo pagerank


HowisJavaScriptindexed?

Thegoldenruleis:ifanuseractionisrequiredinordertoloadsomecontent,thatcontentwon’tbeindexed.Also,anythingthatrequiresuserconsent(e.g.accesstothecamera)isblockedaswell.LinksthatareembeddedintoJavaScriptareextracteduponexecution

– Caffeinedoestheprocessing,– NewlydiscoveredURLsaresenttoGooglebot.

Whatifabuttonshouldbeclicked?

Whenabuttonshouldbeclicked,Googlebot willrenderthatcontentifthecontentresidesonthesamewebpage,butitwillnotindexthecontentasapartofthewebpageifitiscalledfromanotherwebpageusingsomesortofactionthattheusermustperform.

Heading


Googlebot

Google'swebcrawlingbot,itdiscoversnewandupdatedpagesthataretobeaddedtotheGoogle’sindex.Itisdesignedtobedistributedonseveralmachinesinordertoimproveperformanceandtoscaleasthewebgrows.Googlebot usesanalgorithmicprocess:computerprogramsdeterminewhichsitestocrawl,howoften,andhowmanypagestofetchfromeachsite.

Fetchandrender(asGooglebot)


Googlebot – JavaScriptcrawlingandindexing


Source:https://www.elephate.com/blog/javascript-seo-experiment/(updatedon5th ofMarch2018)

SearchEnginesJavaScriptcrawlingandindexing

Source:https://moz.com/blog/search-engines-ready-for-javascript-crawling(publishedon29th ofAugust2017)

JavaScriptframeworksareamust!

Websiteapplicationdevelopmenttechnologies,suchasReact,Angular,Vue,Backbone,etc,areescalatingthroughoutboththefrontendandbackendwebdevelopment.HavingatleastabasicunderstandingofthistechnologiesisoneoftherequirementsinefficientSEO.

GooglebotcanrenderJavaScript

Googlebot isabletorenderJavaScriptpages(ifitisnotblocked,say,withrobots.txt file,fromaccessingrequiredresources– JavaScriptfiles/frameworks,CSSfiles,serverresponses,3rd-partyAPIs,etc).Throughtherenderprocess,Googlebot extractstitles,descriptiontags,structureddata,andothermeta-data,muchlikeanymodernWebbrowser.Ifresourcesareblockedortemporarilyunavailable,client-sidecodeshouldbemadeinsuchwaythatitfails gracefully.WebpagecontentshouldbeavailableeventhroughbrowsersthatarenotcompatiblewithJavaScriptimplementationsusedforthatwebsite.

AvoidAJAX-crawling

GooglerecommendsthatAJAX-crawlingshouldbeavoidedonnewwebsites,andtomigratetheoldsitesthatstillusethisscheme.– Whenmigrating,“metafragment”tagsshouldberemoved.

– “Metafragment”tagshouldbeusedonlyifthe“escapedfragment”URLdoesn’tservefullyrenderedcontent.

NoteverybotisGoogle

Otherwebbotsarefarlesscapableofrenderingdynamicsites;someofthem mightnotevensupportJavaScriptatall,andjustexpectplainHTML.Sothingsshouldbemadethroughimplementingdown-levelexperiencesothatbotsarenotpreventedfrombeingableofcrawlingthroughnavigation,orseeingthecontentembeddedinawebpage.Oneveryefficientandflexiblewaytoenableallofthesearchenginestoaccessawebpagecontentisserver-sidepre-rendering.– MajorJavaScriptframeworkssupportthisfeaturenatively.

Progressiveenhancement

Websiteshouldbemadethrough“progressiveenhancement”technique,sothatthecontentismadeavailabletoalloftheusers,regardlessofthebrowsertheyuse.Onetechniquethatshouldbeavoidedisredirectinguserstoanunsupportedbrowser page.Whereneeded,apolyfill (oranyotherfallback)shouldbeused!

JavaScriptredirects

Thebestpracticeistopreformredirectionontheserverside,butitisalegitimatepractice(andsometimesitistheonlypossibleoption) touseJavaScript.Googlebot isnotsopatientwithwaiting,soredirectionontheclient-sidewithJavaScriptshouldbedoneasquicklyaspossible.301 redirects arethebestoptionwhenmovingawebpagetodifferentaddress(thispreservestherankingsasmuchaspossible),anditshouldbeusedratherthanJavaScriptredirection.

Mobile-firstindexing

Googlebot,bydefault,willalwaysrenderthepagetheycancrawllinksincludedbyJavaScriptonmobilepages. Googlebot won’tseehyperlinks thatexistondesktopwebpagesandnotonmobilewebpages.– Thosemobilesitesanddesktopsitesneedtobeequivalent!

Mobile-firstpolicy

TestifGooglecanrenderyourpagesproperly

GoogleSearchConsolehasaFetchandRender toolwhichcanbeusedforpreviewinghowGooglebot seesagivenwebpages.– Thistooldoesnotsupport“#!”and“#”URLs.– Generaly,URLswith“#”(notethatisapartfrom“!”)shouldbeavoided,thisisbecauseGooglebot rarelyindexesthoseURLs.

SEOforJavaScriptwebsites

JavaScriptframeworkshelpcreatemodernexperienceforusers,butJavaScriptpresentsachallengeforsearchengines.Luckily,mostofJSframeworkssupportbothfrontandbackendrendering.Thingstokeepinmind:– URLsshouldlooklikestaticURLs

• AvoidhashtagsinURLs

– Usestandard<ahref>linksinHTMLalongwith“onclick”events• Thisensuresdiscoverabilitybysearchengines

– Server-sideJavaScript

Server-sideJavaScriptThisensuresthatthereisplainHTMLforsearchenginestouse.ThoughGooglebot canmanagewithcrawlingandrenderingclient-sideJavaScript,othersearchenginesdonot.Thingstokeepinmind:– Ifyouhavealotofusersusingothersearchengines,youwilllosetraffic.

– EvenGooglehastroublewithheavyclient-siderendering• Thisisalsomuchslower

– Googlemightmisinterpretthecontent• Eventheslightesterrorsaredangerous

– Evencorrectcrawlingandrenderingcanresultinstrangebehaviour

www.unibo.it

Dragiša Miljković

Department of electrical engineering and computer scienceFaculty of technical sciences

University of Priština

[email protected]

seo (search engine optimization) · what will we do today? this talk discuses just the most...

Documents