behind the scenes at a search engine william denton web librarian, york university libraries 20...
TRANSCRIPT
Behind the Scenes at a Search Engine
William Denton <[email protected]>
Web Librarian, York University Libraries
20 March 2008http://www.library.yorku.ca/binaries/frontiers/20080320-denton-search-engine.ppt
Denton: Search Engines / 20 March 2008 / York 2
To be covered The three basic parts of a web search Search engine optimization Advertising, and how to avoid it Library databases and the deep web
Ask questions any time.
Denton: Search Engines / 20 March 2008 / York 3
Google, Yahoo, Ask, Live, A9
www.google.com: australopithecussearch.yahoo.com: australopithecuswww.ask.com: australopithecussearch.live.com: australopithecusa9.com: australopithecus
Denton: Search Engines / 20 March 2008 / York 4
The computing power, bandwidth, and electricity use is mind-blowing
David F. Carr, How Google Works
Urs Hölze talk on Google’s Linux cluster (2002)
Ginger Strand, Keyword: Evil (Harper’s, March 2008)
Denton: Search Engines / 20 March 2008 / York 5
What happens when you search?
You enter in some words
You get good links
And sometimes other good stuff
Denton: Search Engines / 20 March 2008 / York 6
Three things
1. How does it know about everything?
Denton: Search Engines / 20 March 2008 / York 7
Three things
2. How does it decide what’s relevant?
Denton: Search Engines / 20 March 2008 / York 8
Three things
3. How does it serve you the results?
Denton: Search Engines / 20 March 2008 / York 9
1. How does it know about everything?
Crawlers are continually moving around the web, looking for whatever they can find.
Different search engines crawl different numbers of pages, but they all do in the billions.
Denton: Search Engines / 20 March 2008 / York 10
A visit from a Googlebot
66.249.73.229 - - [09/Mar/2008:04:15:47 -0400] "GET /ccm/jsp/homepage.jsp HTTP/1.1" 200 13670 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; http://
www.google.com/bot.html) - -"
Denton: Search Engines / 20 March 2008 / York 11
robots.txt
Polite web crawlers first check the robots.txt file on a web site and obey its rules.
Wikipedia’s robots.txt York’s robots.txt www.robotstxt.org
Denton: Search Engines / 20 March 2008 / York 12
Crawl, download, scan, repeat
Crawlers crawl and crawl and download all the pages (and Word files and PDFs, and spreadsheets) they come across. They scan the page for links and add them to their list of pages to crawl, ad infinitum.
Denton: Search Engines / 20 March 2008 / York 13
2. How does it decide what’s relevant? What’s on the page Inbound and outbound links (PageRank) Frequency of updates The kind of site it is Clickthroughs and usage analysis People tweaking the rules Secret other stuff
Denton: Search Engines / 20 March 2008 / York 14
Term frequency
How many times a word appears in a document
Denton: Search Engines / 20 March 2008 / York 15
Inverse document frequency
Number of documents /
number of documents containing the term
(Actually the logarithm of this.)
Denton: Search Engines / 20 March 2008 / York 16
TF-IDF
TF-IDF of a keyword in a page = TF * IDF
Denton: Search Engines / 20 March 2008 / York 17
Example100 web pages. Keyword: naillie#1 has 8 mentions. TF = 8.#2, 17, 19, 76 have 4 mentions. TF = 4.20 pages have 1 mention. TF = 1.
IDF = log2 (100 / 25) = 2
TF-IDF of naillie in #1 = 8 * 2 = 16 High!TF-IDF of naillie in #2, 17, 19, 76 = 4 * 2 = 8 Not so highTF-IDF of naillie in 20 others = 1 * 2 = 2 SmallTF-IDF of naillie in all the rest = 0 * 2 = 0 Irrelevant
Denton: Search Engines / 20 March 2008 / York 18
HTML helps a lot
<title>The title of the page</title>
<h1>The most important heading</h1>
<h2>Lesser headings</h2>
<a href=“http://www.yorku.ca/”>Hyperlinks</a>
Text at the top of the page
Denton: Search Engines / 20 March 2008 / York 19
Semantic markupWrong: say how it
should look
<b><font size=“26”>
<i>Upon the Distinction Between the Ashes of the Various Tobaccos</i>
</font></b>
Right: say what it is (then apply a look)
<h1>Upon the Distinction Between the Ashes of the Various Tobaccos</h1>
Denton: Search Engines / 20 March 2008 / York 20
PageRank! Algorithm designed by Larry Page and
Sergey Brin when they were at Stanford One of the things Google uses in deciding
how important a web page is US Patent 6,285,999 The Anatomy of a Large-Scale Hypertextual
Web Search Engine (Brin, Page, 1998)
Denton: Search Engines / 20 March 2008 / York 21
Denton: Search Engines / 20 March 2008 / York 22
http://en.wikipedia.org/wiki/PageRank
Wikipedia has a good explanation of PageRank
Denton: Search Engines / 20 March 2008 / York 23
Other weightings Frequency of updates The kind of site it is: blog? wiki? institutional
repository? Who runs the site? Search engine companies tweak the rules They don’t give away their secrets, but lots
of people try to reverse engineer the algorithms
Denton: Search Engines / 20 March 2008 / York 24
Google’s explanation“Google combines PageRank with sophisticated
text-matching techniques to find pages that are both important and relevant to your search. Google goes far beyond the number of times a term appears on a page and examines dozens of aspects of the page's content (and the content of the pages linking to it) to determine if it's a good match for your query.”
http://www.google.com/technology/
Denton: Search Engines / 20 March 2008 / York 25
Live’s explanation“Live Search website ranking is completely
automated. The Live Search ranking algorithm analyzes factors such as web page content, the number and quality of websites that link to your pages, and the relevance of your website’s content to keywords. The algorithm is complex and is never human-mediated. You can't pay to boost your website’s relevance ranking; however, we do offer advertising options for website owners.”
http://help.live.com/help.aspx?mkt=en-us&project=wl_webmasters
Denton: Search Engines / 20 March 2008 / York 26
What doesn’t it look at? Crawlers don’t see inside fancy things like
Flash plug-ins, so if your home page is a Flash intro, the search engines may go no further
But they can infer what’s in an image Most search engines ignore <meta> tags
and other metadata
Denton: Search Engines / 20 March 2008 / York 27
3. How does it serve you the results?
You enter terms,
it looks them up in a reverse index,
then it formats the results on the way out.
Denton: Search Engines / 20 March 2008 / York 28
Reverse index
naillie 1, 2, 17, 19, 76, etc.
partridge 1, 2, 35, 76, 8, 65
Not showing weights and rankings etc.
Denton: Search Engines / 20 March 2008 / York 29
Formatting on the way out
Look at the cached copy;
get the title, keywords in
context, etc.
Google: toronto shoes
Denton: Search Engines / 20 March 2008 / York 30
Search engine optimization
SEO is bringing in people by designing your web site so that search engines will list you high on results pages when people search for certain keywords—without buying ads.
Denton: Search Engines / 20 March 2008 / York 31
White hat SEO
Improving your organic search results by making your site understandable by search engine crawlers, and by getting people to link to you.
Denton: Search Engines / 20 March 2008 / York 32
White hat SEO
Have good content
Semantic markup again: <title>Good Titles Matter</title> <h1>So do headings</h1>
Use text
Host on a reliable, trustworthy site
Denton: Search Engines / 20 March 2008 / York 33
Black hat SEO
Pulling tricks to make search engines think your site is more popular than it is or to mislead them about the content.
Denton: Search Engines / 20 March 2008 / York 34
SEO browser extensions
Google Toolbar Search Firefox Add-ons for “pagerank” and
“seo”, but mind the privacy issues on what’s being reported where about the pages you view
Denton: Search Engines / 20 March 2008 / York 35
Advertising
Search engines sell ads. You can pay to get your web page listed at the top of the results page for desired keywords.
Search engines are advertising companies.
Denton: Search Engines / 20 March 2008 / York 36
Google: AdWords and AdSense
AdWords is Google’s program for selling ads on its site in results pages. Try their Keyword Tool.
AdSense puts little boxes of context-relevant ads on web pages of people who want to make a little money. (Or a lot.)
Denton: Search Engines / 20 March 2008 / York 37
Yahoo
Video explaining how “sponsored search” works
See how much it would cost to buy some keywords
Denton: Search Engines / 20 March 2008 / York 38
Avoiding advertising
The Firefox extensions Adblock and Customize Google will make your web browsing ad-free and much more pleasant.
Denton: Search Engines / 20 March 2008 / York 39
Denton: Search Engines / 20 March 2008 / York 40
Denton: Search Engines / 20 March 2008 / York 41
Invisible web or deep web
Search engines miss a lot of web content: It’s behind a login It’s dynamically generated It’s embedded in Flash or Java applets
Denton: Search Engines / 20 March 2008 / York 42
2/3 of the web goes unseen
He, Patel, Zhang, Chang, Accessing the Deep Web: A Survey (Communications of the ACM, 50: 5, May 2007)
Denton: Search Engines / 20 March 2008 / York 43
Library databases
Library databases (JSTOR, PsycInfo, Scholars Portal) are part of the deep web. They have enormous amounts of information that’s hidden from the public … except when it’s not, as through Google Scholar.
Denton: Search Engines / 20 March 2008 / York 44
Library databases
Differences in rankings, algorithmsFull use of metadataUsually sorted by datePageRank is based on citation analysis, but
these databases don’t use citation analysis to rank relevant papers
Scholars Portal: search Natural Sciences for australopithecus
Denton: Search Engines / 20 March 2008 / York 45
Final note on privacy
Everything you do at a search engine is logged. They track you by IP address and with cookies and logins.
Assume all that information is stored forever.
See http://blog.searchenginewatch.com/blog/060206-150030
Denton: Search Engines / 20 March 2008 / York 46
Further reading: online
searchenginewatch.com John Battelle’s Searchblog Online: Exploring Technology and Resource
s for Information Professionals
Wikipedia (usually quite strong on technical articles)
The library’s Computer Science Research Guide
Denton: Search Engines / 20 March 2008 / York 47
Further reading: books
Battelle, John. 2006. The Search: The Inside Story of How Google and Its Rivals Changed Everything (Portfolio)
Berry, Michael W., and Murray Browne. 2005. Understanding Search Engines: Mathematical Modelling and Text Retrieval (SIAM)
Levene, Mark. 2006. An Introduction to Search Engines and Web Navigation (Addison-Wesley)
Denton: Search Engines / 20 March 2008 / York 48