behind the scenes at a search engine william denton web librarian, york university libraries 20...

Behind the Scenes at a Search Engine

William Denton <[email protected]>

Web Librarian, York University Libraries

20 March 2008http://www.library.yorku.ca/binaries/frontiers/20080320-denton-search-engine.ppt

Denton: Search Engines / 20 March 2008 / York 2

To be covered The three basic parts of a web search Search engine optimization Advertising, and how to avoid it Library databases and the deep web

Ask questions any time.


Google, Yahoo, Ask, Live, A9

www.google.com: australopithecussearch.yahoo.com: australopithecuswww.ask.com: australopithecussearch.live.com: australopithecusa9.com: australopithecus


The computing power, bandwidth, and electricity use is mind-blowing

David F. Carr, How Google Works

Urs Hölze talk on Google’s Linux cluster (2002)

Ginger Strand, Keyword: Evil (Harper’s, March 2008)


What happens when you search?

You enter in some words

You get good links

And sometimes other good stuff


Three things

1. How does it know about everything?


Three things

2. How does it decide what’s relevant?


Three things

3. How does it serve you the results?


1. How does it know about everything?

Crawlers are continually moving around the web, looking for whatever they can find.

Different search engines crawl different numbers of pages, but they all do in the billions.


A visit from a Googlebot

66.249.73.229 - - [09/Mar/2008:04:15:47 -0400] "GET /ccm/jsp/homepage.jsp HTTP/1.1" 200 13670 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; http://

www.google.com/bot.html) - -"


robots.txt

Polite web crawlers first check the robots.txt file on a web site and obey its rules.

Wikipedia’s robots.txt York’s robots.txt www.robotstxt.org


Crawl, download, scan, repeat

Crawlers crawl and crawl and download all the pages (and Word files and PDFs, and spreadsheets) they come across. They scan the page for links and add them to their list of pages to crawl, ad infinitum.


2. How does it decide what’s relevant? What’s on the page Inbound and outbound links (PageRank) Frequency of updates The kind of site it is Clickthroughs and usage analysis People tweaking the rules Secret other stuff


Term frequency

How many times a word appears in a document


Inverse document frequency

Number of documents /

number of documents containing the term

(Actually the logarithm of this.)


TF-IDF

TF-IDF of a keyword in a page = TF * IDF


Example100 web pages. Keyword: naillie#1 has 8 mentions. TF = 8.#2, 17, 19, 76 have 4 mentions. TF = 4.20 pages have 1 mention. TF = 1.

IDF = log2 (100 / 25) = 2

TF-IDF of naillie in #1 = 8 * 2 = 16 High!TF-IDF of naillie in #2, 17, 19, 76 = 4 * 2 = 8 Not so highTF-IDF of naillie in 20 others = 1 * 2 = 2 SmallTF-IDF of naillie in all the rest = 0 * 2 = 0 Irrelevant


HTML helps a lot

<title>The title of the page</title>

<h1>The most important heading</h1>

<h2>Lesser headings</h2>

<a href=“http://www.yorku.ca/”>Hyperlinks</a>

Text at the top of the page


Semantic markupWrong: say how it

should look

<b><font size=“26”>

<i>Upon the Distinction Between the Ashes of the Various Tobaccos</i>

</font></b>

Right: say what it is (then apply a look)

<h1>Upon the Distinction Between the Ashes of the Various Tobaccos</h1>


PageRank! Algorithm designed by Larry Page and

Sergey Brin when they were at Stanford One of the things Google uses in deciding

how important a web page is US Patent 6,285,999 The Anatomy of a Large-Scale Hypertextual

Web Search Engine (Brin, Page, 1998)


http://en.wikipedia.org/wiki/PageRank

Wikipedia has a good explanation of PageRank


Other weightings Frequency of updates The kind of site it is: blog? wiki? institutional

repository? Who runs the site? Search engine companies tweak the rules They don’t give away their secrets, but lots

of people try to reverse engineer the algorithms


Google’s explanation“Google combines PageRank with sophisticated

text-matching techniques to find pages that are both important and relevant to your search. Google goes far beyond the number of times a term appears on a page and examines dozens of aspects of the page's content (and the content of the pages linking to it) to determine if it's a good match for your query.”

http://www.google.com/technology/


Live’s explanation“Live Search website ranking is completely

automated. The Live Search ranking algorithm analyzes factors such as web page content, the number and quality of websites that link to your pages, and the relevance of your website’s content to keywords. The algorithm is complex and is never human-mediated. You can't pay to boost your website’s relevance ranking; however, we do offer advertising options for website owners.”

http://help.live.com/help.aspx?mkt=en-us&project=wl_webmasters


What doesn’t it look at? Crawlers don’t see inside fancy things like

Flash plug-ins, so if your home page is a Flash intro, the search engines may go no further

But they can infer what’s in an image Most search engines ignore <meta> tags

and other metadata


3. How does it serve you the results?

You enter terms,

it looks them up in a reverse index,

then it formats the results on the way out.


Reverse index

naillie 1, 2, 17, 19, 76, etc.

partridge 1, 2, 35, 76, 8, 65

Not showing weights and rankings etc.


Formatting on the way out

Look at the cached copy;

get the title, keywords in

context, etc.

Google: toronto shoes


Search engine optimization

SEO is bringing in people by designing your web site so that search engines will list you high on results pages when people search for certain keywords—without buying ads.


White hat SEO

Improving your organic search results by making your site understandable by search engine crawlers, and by getting people to link to you.


White hat SEO

Have good content

Semantic markup again: <title>Good Titles Matter</title> <h1>So do headings</h1>

Use text

Host on a reliable, trustworthy site


Black hat SEO

Pulling tricks to make search engines think your site is more popular than it is or to mislead them about the content.


SEO browser extensions

Google Toolbar Search Firefox Add-ons for “pagerank” and

“seo”, but mind the privacy issues on what’s being reported where about the pages you view


Advertising

Search engines sell ads. You can pay to get your web page listed at the top of the results page for desired keywords.

Search engines are advertising companies.


Google: AdWords and AdSense

AdWords is Google’s program for selling ads on its site in results pages. Try their Keyword Tool.

AdSense puts little boxes of context-relevant ads on web pages of people who want to make a little money. (Or a lot.)


Yahoo

Video explaining how “sponsored search” works

See how much it would cost to buy some keywords


Avoiding advertising

The Firefox extensions Adblock and Customize Google will make your web browsing ad-free and much more pleasant.


Invisible web or deep web

Search engines miss a lot of web content: It’s behind a login It’s dynamically generated It’s embedded in Flash or Java applets


2/3 of the web goes unseen

He, Patel, Zhang, Chang, Accessing the Deep Web: A Survey (Communications of the ACM, 50: 5, May 2007)


Library databases

Library databases (JSTOR, PsycInfo, Scholars Portal) are part of the deep web. They have enormous amounts of information that’s hidden from the public … except when it’s not, as through Google Scholar.


Library databases

Differences in rankings, algorithmsFull use of metadataUsually sorted by datePageRank is based on citation analysis, but

these databases don’t use citation analysis to rank relevant papers

Scholars Portal: search Natural Sciences for australopithecus


Final note on privacy

Everything you do at a search engine is logged. They track you by IP address and with cookies and logins.

Assume all that information is stored forever.

See http://blog.searchenginewatch.com/blog/060206-150030


Further reading: online

searchenginewatch.com John Battelle’s Searchblog Online: Exploring Technology and Resource

s for Information Professionals

Wikipedia (usually quite strong on technical articles)

The library’s Computer Science Research Guide


Further reading: books

Battelle, John. 2006. The Search: The Inside Story of How Google and Its Rivals Changed Everything (Portfolio)

Berry, Michael W., and Murray Browne. 2005. Understanding Search Engines: Mathematical Modelling and Text Retrieval (SIAM)

Levene, Mark. 2006. An Introduction to Search Engines and Web Navigation (Addison-Wesley)

behind the scenes at a search engine william denton web librarian, york university libraries 20...

Documents

different search engines

web site

yorkexample100 web pages

txtpolite web crawlers

list of pages

different numbers of

whats relevant

repeatcrawlers crawl