seo (search engine optimization) · what will we do today? this talk discuses just the most...
TRANSCRIPT
SEO (Search Engine Optimization)
Dragiša Miljković[email protected] of electrical engineering and computer scienceFaculty of technical sciencesUniversity of Priština
Whatwillwedotoday?
ThistalkdiscusesjustthemostimportantandinterestingideasinSearchEngineOptimization.SEOistheprocessofimprovingthevisibilityofawebsitewithinsearchresults,sothatitbecomeseasiertofind,morerelevanttosearchengines,andmoreaccessibletothesearchenginecrawlers.Shortlyput,itisaprocessofimprovingawebsite'ssearchenginerankposition.
Whoisthisguy?
M.Sc.Eng.DragišaMiljkovićTeachingfellowatDepartmentofelectricalengineeringandcomputerscience
Facultyoftechnical sciencesUniversityofPrištinaSerbia
Basicterminology
Searchengine isasoftwaresystemusedforsearchingforinformationontheWWW.Theuserenterskeyphrasesofhisinterestintosearchfield,andsearchenginereturnswebcontentresultsinalistofsocalledsearchengineresultspages (SERPs).Thisresultscanbeamixturecomprisedofdifferentwebpages,images,videos,andotherfiletypes.Webisperpetuallychanging,sosearchenginesmustmaintainanearreal-timeindexation.Thisisdonebyconstantlyrunninganalgorithmonawebcrawler.
SERPWebsearchengineprocessakeywordquerysubmittedbya“searcher”,and,asaresponse,itpresentsSERPs.SERPiscomprisedofthelistofresults(usuallyorderedbyrelevancetothequery)thatarereturnedbythesearchengine,butitmayalsocontainotherresults,suchasadvertisements.Therearetwomaincategoriesofresults,organicsearch(returnedbythealgorithmofasearchengine)andsponsored search (i.e.advertisements).– Aresultisdisplayedwithatitle,ahyperlink tothatwebpage,andabrief description (itshowshowthatresultmatchesthequery).
InternetisacuriosplaceTherearepartsofwebthatarenotbeingaccessedbysearchengines:– deepweb,hiddenfromconventionalsearchengines(e.g.byencryption);
– darkweb,intentionally hiddenfromsearchengines,itusesmaskedIPaddresses,andisaccessibleonlywithaspecialwebbrowser.• Noticethatthedarkwebispartofthedeepweb.
Searchengines–howdotheywork?
Searchengineiscomprisedofthreeprocesses:
– Crawler,
– Indexer,
– Queryengine.
ThisisthetechnicalpartofSEO
TypesofJavaScript
JavaScriptmaybeusedtoenhance HTML&CSS– Itcanbeusedtoimproveuserexperienceandtoaddsomefunctionality,
– ThiseffectsSEOverylittle.
JavaScript maybeusedtoreplace the content ofawebpage– it’sHTMLandCSS– Inthisway,webpagesbecomewebapplications,– Thiscausestroublestosearchengines.
CrawlerAlsoknownas(web)spider.Thisisaninternetbot thatbrowsestheWWWforthepurposeofwebindexing.Therearebillionsofwebpages,sothecrawlerneedstobeconstantlyexecutedonalargenumberofcomputers.CrawlertakestheURLsfrompreviouscrawlsandfromXMLsitemaps,itthentriestofindneworupdatedpagesandaddthemtoGoogleindex(itdetectsSRCandHREF links).
– Itextractshyperlinksandaddsthemtoqueueforcrawling,– Itonlyretrievesapageifitisneworifitischanged,andremovesthedeadones.
Crawler
Criteriathatmattertothecrawler:– Howlong doesittaketoloadawebpage?– HowimportantisthatURL?– Aretheremorehyperlinksonagivenwebpage?
SearchenginecaresonlyaboutURLs
Searchenginedoesnotcare(too)muchaboutthecontentofawebpage,rather– itcrawls,indexes,andranks onlyURLs.– Ageneralrule:onepieceofcontentshouldbeassociatedwithoneURL!
URLsthatcrawlerreturnsareaddedtothesearchengine'sindex.Whenauserentersaqueryinasearchengine,relevantresultswillbereturnedbasedonthesearchengine'salgorithm.
Robotsexclusionprotocol
Knownsimplyasrobots.txtItmustbeinthetopdirectoryoftheserver.Thisisastandardwhichwebsitesuseinordertoregulatewhichareasofthewebsiteshouldthewebcrawlersandotherwebrobotsbeallowedtoprocessandcategorize.Thisissolelyastandard,andnotanenforcedrule,sonotallrobotswillcomply.– E-mailharvesters,malware,andspambots areevenlikelytostartattheareasofthewebsitethatshouldbeomitted.
Unibo robots.txt
http://www.unibo.it/robots.txt
User-agent: *Disallow: /NR/exeresDisallow: /NR/rdonlyresDisallow: /it/allegati/allegati-non-indicizzatiDisallow: /en/attachments/unindexed-attachmentsDisallow: /modelloDisallow: /modello-aDisallow: /modello-bDisallow: /uniboweb/sites/UniboSearch/results.aspxDisallow: /UniboWeb/UniboSearch/results.aspxDisallow: /_layouts
SitemapsinclusionprotocolSitemapsareaURLinclusion protocol.WebmastercanusesitemapsinordertoprovidethesearchengineswithURLsonawebsitethatareavailableforcrawling.SitemapisjustaXMLfilewhichlistsalltheURLsononewebsite.Inordertoenablesearchenginestocrawlwebpagesmoreefficiently,someadditionalmetadata canbeincluded:– Howimportant givenURLis(inrelationtoURLsonthesamewebsite),
– WhenwasthatURLupdated,– Howfrequent changes are,etc.– Itcanalsoincludeinformationaboutspecifictypesofcontentonwebsite(e.g.images,videos).
JavaScriptaffectscrawling
JavaScriptframeworksareusedtocreateinteractivewebpages,andtocontrolthebehaviouroftheelementsonthepage.InorderforsearchenginetoaccessWebpagecontent,itneedstoberenderedit!Thisisnotajobforacrawler,butforanindexer– InthecaseofGoogle,it’sindexer,calledCaffeine,renderswebpages,andGooglebot doesnotexecuteJavaScriptatall.
– LinksthatareembeddedintoJavaScriptarenotvisibletothecrawler.
Earlierworkaround–AJAX-crawling
InOctober2009Googlecameoutwith“AproposalformakingAJAXcrawlable”.ThisinitiativewasintroducedtomakeJavaScript-basedwebpagesaccessibletoGooglebot.Googlebot sendstotheserverURLofaJSwebpagethatitneeds,andserverrespondswithawebpagethat’sfully-renderedintoaHTMLsnapshot(thisrepresentstheresultofexecutingtheJavaScriptonaheadless browser(browserwithnoGUI)),whichisthenreturnedbacktothecrawler.Nowadays,thisisadeprecatedmethod,asmodernGooglebot hastrulyadvancedJavaScriptrenderingcapabilities.
AJAX-crawlingscheme
ThisschemeacceptsanURLcontainingeithera"#!",ora"fragmentmetatag“(<metaname="fragment"content="!">).ThisURLiscalledprettyURL.Crawlerthenrequeststhecontentofthatpagefromtheserver,butitmodifiesURL byreplacing#! ormetatagswith"?_escaped_fragment_=".ThisURLiscalleduglyURL.
AJAX-crawlingscheme
Eversince2015,Googlebot wasable(atleasttosomeextent)torenderthe“#!”URLsdirectly,soprovidingitwitharenderedversionofthewebpagebecameobsolete.(Though,thisdoesnotapplytoothersearchengines.)Inthesecondquarterof2018,GooglecompletelyswitchedtorenderingJavaScriptpagesonGoogle'sside,anditnolongerrequiresthatwebsitesdothisbythemselves.However,AJAX-crawlingscheme URLsarestillsupportedinGoogle'ssearchresults.
Indexerdoestherendering
Whenthecrawlerprocessesthepage,itsearchesforhyperlinks.ItthensendsthemtotheindexerwhichrendersthemandexecutesJavaScript,whichoftenresultsinfindingnewURLs,whicharethansentbacktothecrawler.Theprocessstopswhencrawlercannotcrawlanyfurther.Butthisdoesnotalwaysworkideally,bottomlineisthatthereisnosearchenginethatcanunderstandandprocessJavaScriptatthelevelmodernbrowserscan.
Indexer
ThemostadvancedindexerisGoogle’sCaffeineEveryindexeranalysesthefollowingthings– Content,– Links,– Layout.
Metatagsfromrobots.txt canbeusedtocontrolindexer’sexecution.
Indexer
Anindexerprovidesthreeservices– Canonicalization
• ItfindsthecanonicalURL,
– WRS (webrenderingservice)• Itrendersawebpage(likeabrowser),
– PageRanker• Indexercalculatesrankofagivenwebpage.
Canonicalization
Canonicalization– IndexerfindsthecanonicalURL(themastercopyofapage)• I.e.allofthefollowingwebpagesaredifferentpagestothecrawler(eventhoughthecontentisthesame):– http://www.example.com– https://www.example.com– http://example.com– http://example.com/index.php– http://example.com/index.php?r...
<linkrel="canonical"href="http://www.unibo.it/it"></head>
Webrenderingservice
Google'sJavaScriptindexingcapabilitiesarewithoutprecedence,itusesChrome41forwebrenderingservice (WRS).– Chrome41 wasreleasedon3rd ofMay2015(sotherearesomemodernfeaturesitdoesn’tsupport),
– WRSisstateless,itdoesn’tstorecookiesorsessiondata,
– IfJavaScriptrequiresanyuseraction,thatwebpagewon’tberendered.
PageRanker
PageRank representsamathematical algorithm usedtodetermineimportanceofapage,andthatprocessisessentiallybasedonassessingthequantityandqualityoflinksleadingtothatwebpage.PageRanker isusedtocalculaterankforagivenwebpage– Whatmattersarethehyperlinks,internal andexternal,– Dampingfactor – theprobabilitythattheuserwillcontinueclicking,ratherthanleavingthatwebsite.
PageRanker sendstheresultstothecrawler.Pageswithhigherimportancearecrawledwithhigherpriority!
HowisJavaScriptindexed?
Thegoldenruleis:ifanuseractionisrequiredinordertoloadsomecontent,thatcontentwon’tbeindexed.Also,anythingthatrequiresuserconsent(e.g.accesstothecamera)isblockedaswell.LinksthatareembeddedintoJavaScriptareextracteduponexecution
– Caffeinedoestheprocessing,– NewlydiscoveredURLsaresenttoGooglebot.
Whatifabuttonshouldbeclicked?
Whenabuttonshouldbeclicked,Googlebot willrenderthatcontentifthecontentresidesonthesamewebpage,butitwillnotindexthecontentasapartofthewebpageifitiscalledfromanotherwebpageusingsomesortofactionthattheusermustperform.
Googlebot
Google'swebcrawlingbot,itdiscoversnewandupdatedpagesthataretobeaddedtotheGoogle’sindex.Itisdesignedtobedistributedonseveralmachinesinordertoimproveperformanceandtoscaleasthewebgrows.Googlebot usesanalgorithmicprocess:computerprogramsdeterminewhichsitestocrawl,howoften,andhowmanypagestofetchfromeachsite.
Googlebot – JavaScriptcrawlingandindexing
SometextSomemoretext– Somenestedtext– Secondnestedelement
Source:https://www.elephate.com/blog/javascript-seo-experiment/(updatedon5th ofMarch2018)
SearchEnginesJavaScriptcrawlingandindexing
Source:https://moz.com/blog/search-engines-ready-for-javascript-crawling(publishedon29th ofAugust2017)
JavaScriptframeworksareamust!
Websiteapplicationdevelopmenttechnologies,suchasReact,Angular,Vue,Backbone,etc,areescalatingthroughoutboththefrontendandbackendwebdevelopment.HavingatleastabasicunderstandingofthistechnologiesisoneoftherequirementsinefficientSEO.
GooglebotcanrenderJavaScript
Googlebot isabletorenderJavaScriptpages(ifitisnotblocked,say,withrobots.txt file,fromaccessingrequiredresources– JavaScriptfiles/frameworks,CSSfiles,serverresponses,3rd-partyAPIs,etc).Throughtherenderprocess,Googlebot extractstitles,descriptiontags,structureddata,andothermeta-data,muchlikeanymodernWebbrowser.Ifresourcesareblockedortemporarilyunavailable,client-sidecodeshouldbemadeinsuchwaythatitfails gracefully.WebpagecontentshouldbeavailableeventhroughbrowsersthatarenotcompatiblewithJavaScriptimplementationsusedforthatwebsite.
AvoidAJAX-crawling
GooglerecommendsthatAJAX-crawlingshouldbeavoidedonnewwebsites,andtomigratetheoldsitesthatstillusethisscheme.– Whenmigrating,“metafragment”tagsshouldberemoved.
– “Metafragment”tagshouldbeusedonlyifthe“escapedfragment”URLdoesn’tservefullyrenderedcontent.
NoteverybotisGoogle
Otherwebbotsarefarlesscapableofrenderingdynamicsites;someofthem mightnotevensupportJavaScriptatall,andjustexpectplainHTML.Sothingsshouldbemadethroughimplementingdown-levelexperiencesothatbotsarenotpreventedfrombeingableofcrawlingthroughnavigation,orseeingthecontentembeddedinawebpage.Oneveryefficientandflexiblewaytoenableallofthesearchenginestoaccessawebpagecontentisserver-sidepre-rendering.– MajorJavaScriptframeworkssupportthisfeaturenatively.
Progressiveenhancement
Websiteshouldbemadethrough“progressiveenhancement”technique,sothatthecontentismadeavailabletoalloftheusers,regardlessofthebrowsertheyuse.Onetechniquethatshouldbeavoidedisredirectinguserstoanunsupportedbrowser page.Whereneeded,apolyfill (oranyotherfallback)shouldbeused!
JavaScriptredirects
Thebestpracticeistopreformredirectionontheserverside,butitisalegitimatepractice(andsometimesitistheonlypossibleoption) touseJavaScript.Googlebot isnotsopatientwithwaiting,soredirectionontheclient-sidewithJavaScriptshouldbedoneasquicklyaspossible.301 redirects arethebestoptionwhenmovingawebpagetodifferentaddress(thispreservestherankingsasmuchaspossible),anditshouldbeusedratherthanJavaScriptredirection.
Mobile-firstindexing
Googlebot,bydefault,willalwaysrenderthepagetheycancrawllinksincludedbyJavaScriptonmobilepages. Googlebot won’tseehyperlinks thatexistondesktopwebpagesandnotonmobilewebpages.– Thosemobilesitesanddesktopsitesneedtobeequivalent!
TestifGooglecanrenderyourpagesproperly
GoogleSearchConsolehasaFetchandRender toolwhichcanbeusedforpreviewinghowGooglebot seesagivenwebpages.– Thistooldoesnotsupport“#!”and“#”URLs.– Generaly,URLswith“#”(notethatisapartfrom“!”)shouldbeavoided,thisisbecauseGooglebot rarelyindexesthoseURLs.
SEOforJavaScriptwebsites
JavaScriptframeworkshelpcreatemodernexperienceforusers,butJavaScriptpresentsachallengeforsearchengines.Luckily,mostofJSframeworkssupportbothfrontandbackendrendering.Thingstokeepinmind:– URLsshouldlooklikestaticURLs
• AvoidhashtagsinURLs
– Usestandard<ahref>linksinHTMLalongwith“onclick”events• Thisensuresdiscoverabilitybysearchengines
– Server-sideJavaScript
Server-sideJavaScriptThisensuresthatthereisplainHTMLforsearchenginestouse.ThoughGooglebot canmanagewithcrawlingandrenderingclient-sideJavaScript,othersearchenginesdonot.Thingstokeepinmind:– Ifyouhavealotofusersusingothersearchengines,youwilllosetraffic.
– EvenGooglehastroublewithheavyclient-siderendering• Thisisalsomuchslower
– Googlemightmisinterpretthecontent• Eventheslightesterrorsaredangerous
– Evencorrectcrawlingandrenderingcanresultinstrangebehaviour
www.unibo.it
Dragiša Miljković
Department of electrical engineering and computer scienceFaculty of technical sciences
University of Priština