cs120: lecture 18 mp johnson hunter [email protected]

37
CS120: Lecture 18 MP Johnson Hunter [email protected]

Post on 20-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

CS120: Lecture 18

MP Johnson

Hunter

[email protected]

Agenda

• Websearch– Crawling– Ordering– PageRank– Ads

Next topic: Websearch

• Create a search engine for searching the web

• DBMS queries use tables and (optionally) indices

• First thing to understand about websearch:– we never run queries on the web– Way too expensive, for several reasons

• Instead:– Build an index of the web– Search the index– Return the results

Crawling• To obtain the data for the index, we crawl

the web– Automated web-surfing– Conceptually very simple– But difficult to do robustly

• First, must get pages– Put start page in a queue– Repeat {

remove and store first element;insert all its links into queue

} until …

Crawling issues in practice

• DNS bottleneck– to view page by text link, must get address– BP claim: 87% crawling time ~ DNS look-up

• Search strategy?

• Refresh strategy?

• Primary key for webpages– Use artificial IDs, not URLs– more popular pages get shorter DocIDs (why?)

Crawling issues in practice

• Content-seen test– compute fingerprint/hash (again!) of page content

• robots.txt– http://www.robotstxt.org/wc/robots.html

• Bad HTML– Tolerant parsing

• Non-responsive servers

• Spurious text

Inverted indices

• What’s needed to answer queries:– Create inverted index mapping words to pages

• First, think of each webpage as a tuple– One column for each possible word– True means the word appears on the page– Index on all columns

• Now can search: john bolton

select * from T where john=T and bolton=Tselect * from T where john=T and bolton=T

Inverted indices• Can simplify somewhat:

1. For each field index, delete False entries2. True entries for each index become a bucket

Create an inverted index:– One entry for each search word

• the lexicon– Search word entry points to corresponding bucket– Bucket points to pages with its word

• the postings file

• Final intuition: the inverted index doesn’t map URLs to words

– It maps words to URLs

Inverted Indices

• What’s stored?

• For each word W, for each doc D– relevance of D to W– #/% occurs. of W in D– meta-data/context: bold, font size, title, etc.

• In addition to page importance, keep in mind:– this info is used to determine relevance of

particular words appearing on the page

Google-like infrastructure

• Very large distributed system– File sizes routines in GBs Google File System

• Block size = 64MB (not kb)!

– 100k+ low-quality Linux boxes system failures are the rule, not exception

• Divide index up by words into many barrels– lexicon maps word ids to word’s barrel– also, do RAID-like strategy two-D matrix of servers

• many commodity machines frequent crashes

– Draw picture– May have more duplication for popular pages…

Google-like infrastructure

• To respond to single-word query Q(w):– send to the barrel column for word w

• pick random server in that column– return (some) sorted results

• To respond to multi-word query Q(w1…wn):– for each word wi, send to the barrel column for wi

• pick random server in that column– for all words in parallel, merge and prune

• step through until find doc containing all words, add to results

• index ordered on word;docID, so linear time– return (some) sorted results

New topic: Sorting Results

• How to respond to Q(w1,w2,…,wn)?– Search index for pages with w1,w2,…,wn

– Return in sorted order (how?)

• Soln 1: current order– Return 100,000 (mostly) useless results

• Sturgeon's Law: “Ninety percent of everything is crud.”

• Soln 2: sort by relevance– Use tech.s from Information Retrieval Theory– library science + CS = IR

Simple IR-style approach• for each word W in a doc D, compute

– # occurs of W in D / total # word occurs in D each document becomes a point in a space

– one dimension for every possible word• Like k-NN and k-means

– value in that dim is ratio from above (maybe weighted, etc.)– Choose pages with high values for query words

• A little more precisely: each doc becomes a vector in space– Values same as above– But: think of the query itself as a document vector– Similarity between query and doc = dot product / cos– Draw picture

Information Retrieval Theory

• With some extensions, this works well for relatively small sets of quality documents

• But the web has 600 billion docs– Prob: based just on percentages very short

pages with query words score very high– BP: query a “major search engine” for “bill

clinton” “Bill Clinton Sucks” page

Soln 3: sort by rel. and “quality”

• What do you mean by quality?

• Hire readers to rate my webpage (early Yahoo)

• Problem: doesn’t scale well– more webpages than Yahoo employees…

Soln 4: count # citations (links)

• Idea: you don’t have to hire webpage raters

• The rest of the web has already voted on the quality of my webpage

• 1 link to my page = 1 vote

• Similar to counting academic citations– Peer review

Soln 5: Google’s PageRank

• Count citations, but not equally – weighted sum• Motiv: we said we believe that some pages are

better than others those pages’ votes should count for more

• A page can get a high PageRank many ways• Two cases at ends of a continuum:

– many pages link to you– yahoo.com links to you

• Capitalist, not democratic

PageRank

• More precisely, let P be a page;

• for each page Li that links to P,

• let C(Li) be the number of pages Li links to.

• Then PR0(P) = SUM(PR0(Li)/C(Li)))

• Motiv: each page votes with its quality;– its quality is divided among the pages it votes for– Extensions: bold/large type/etc. links may get

larger proportions…

Understanding PageRank (skip?)• Analogy 1: Friendster/Orkut

– someone “good” invites you in– someone else “good” invited that person in, etc.

• Analogy 2: PKE certificates– my cert authenticated by your cert– your cert endorsed by someone else's…

• Both cases here: eventually reach a foundation

• Analogy 3: job/school recommendations– three people recommend you– why should anyone believe them?

• three other people rec-ed them, etc.• eventually, we take a leap of faith

Understanding PageRank

• Analogy 4: Random Surfer Model

• Idealized web surfer:– First, start at some page– Then, at each page, pick a random link…

• Turns out: after long time surfing,– Pr(were at some page P right now) = PR0(P)

– PRs are normalized

Computing PageRank• For each page P, we want:

– PR(P) = SUM(PR(Li)/C(Li)))

• But its circular – how to compute?

• Meth 1: for n pages, we've got n linear eqs and n unknowns– can solve for all PR(P)s, but too hard– see your linear algebra course…

• Meth 2: iteratively– start with PR0(P) set to E for each P– iterate until no more significant change– PB: O(50) iterations for O(30M) pages/O(300M) links

• #iters req. grows only with log of web size

Problems with PageRank

• Example (from Ullman):– A points to Y, M;– Y points to self, A;– M points nowhere draw picture

• Start A,Y,M at 1:– http://pages.stern.nyu.edu/~mjohnson/dbms/archive/spring05/eg/PageRank.java

• (1,1,1) (0,0,0)– The rank dissipates

• Soln: add (implicit) self link to any dead-end

C:\ java PageRankC:\ java PageRank

Problems with PageRank

• Example (from Ullman):– A points to Y, M;– Y points to self, A;– M points to self

• Start A,Y,M at 1:– http://pages.stern.nyu.edu/~mjohnson/dbms/archive/spring05/eg/PageRank2.java

• (1,1,1) (0,0,3)– Now M becomes a rank sink– RSM interp: we eventually end up at M and then get stuck

• Soln: add “inherent quality” E to each page

C:\ java PageRank2C:\ java PageRank2

Modified PageRank

• Apart from inherited quality, each page also has inherent quality E:– PR(P) = E + SUM(PR(Li)/C(Li)))

• More precisely, have weighted sum of the two terms:– PR(P) = .15*E + .85*SUM(PR(Li)/C(Li)))– http://pages.stern.nyu.edu/~mjohnson/dbms/archive/spring05/eg/PageRank3.java

• Leads to a modified random surfer model

C:\ java PageRank3C:\ java PageRank3

Random Surfer Model’

• Motiv: if we (qua random surfer) end up at page M, we don’t really stay there forever– We type in a new URL

• Idealized web surfer:– First, start at some page– Then, at each page, pick a random link– But occasionally, we get bored and jump to a random new

page

• Turns out: after long time surfing,– Pr(we’re at some page P right now) = PR(P)

Understanding PageRank

• One more interp: hydraulic model– picture the web graph again

– imagine each link as a tube bet. two nodes

– imagine quality as fluid

– each node is a reservoir initialized with amount E of fluid

• Now let flow…

• Steady state is: each node P w/PR(P) amount of fluid– PR(P) of fluid eventually settles in node P

– equilibrium

• Sornette: “Why Stock Markets Crash”– Si(t+1) = sign(ei + SUM(Sj(t))– trader buys/sells based on1. is inclination and2. what is associates are saying

• directions. of magnet det-ed by1. old direction and2. dirs. of neighbors

• activation of neuron det-ed by1. its props and2. activation of neighbors connected by synapses

• PR of P based on1. its inherent value and2. PR of in-links

Supervenience in PR & elsewhere (skip?)

Non-uniform Es (skip?)• So far, assumed E was const for all pages• But can make E a function E(P)

– vary by page

• How do we choose E(P)?• Idea 1: set high for pages with high PR from

earlier iterations• Idea 2: set high for pages I like

– BP paper gave high E to John McCarthy’s homepage pages he links to get high PR, etc.– Result: his own personalized search engine– Q: How would google.com get your prefs?

Next: Tricking search engines

• PR assumes linking is “honest”– Just as Stable Marr. alg assumes honesty

• “Search Engine Optimization”

• Challenge: include on your page lots of words you think people will query on– maybe hidden with same color as background

• Response: popularity ranking– the pages doing this probably aren't linked to that

much– but…

Tricking search engines• Goal: make my page look popular to Google• Chal:: create a page with 1000 links to my page• Resp: those links don’t matter

• Challenge: Create 1000 other pages linking to it• Response: limit the weight a single domain can

give to itself

• Challenge: buy a second domain and put the 1000 pages there

• Response: limit the weight from any single domain…

Another good idea: Use anchor text

• Motiv: pages may not give best descrips of selves– most S.E. pages don’t contain "search engine"– BP claim: only 1 of 4 “top search engines” could find

themselves on query "search engine"

• Anchor text also describes page:– many pages link to google.com– many of them likely say "search engine" in/near the link Treat anchor text words as part of page

• Search for “US West” or for “g++”

Tricking search engines

• This provides a new way to trick Google• Use of anchor text is a big part of result quality

– but has potential for abuse– Lets you influence when other people’s pages appear

• Google Bombs– put up lots of pages linking to my page, using some

particular phrase in the anchor text– result: search for words you chose produces my page– Examples: "talentless hack", "miserable failure",

“waffles", the last name of a prominent US senator…

Next: Ads

• Google had two really great ideas:1. PageRank

2. Bidding for ads

• Fundamental difficulty with mass-ads:– Most of the audience does want it– Most people don’t want what you’re selling– Think of car commercials on TV

• But some of them do!

Bidding for ads

• If you’re selling widgets, how do you know who wants them?– Hard question, so answer its inversion

• If someone is searching for widgets, what should you try to sell them?– Easy – widgets!– Or widget cases, etc…

• Whatever the user searches for, display ads relevant to that query

Bidding for ads• Q: How to choose correspondences?• A: Create a market, and let it decide

• Each company places the bid it’s willing to pay for an ad responding to a particular query

• Ad auction “takes place” at query-time– Relevant ads displayed in descending bid order– Company pays only if user clicks

• AdSense: place ads on external webpages, auction based on page content instead of query

• Huge huge huge business

For more info

• See sources drawn upon here:• Prof. Davis (NYU/CS) search engines course

– http://www.cs.nyu.edu/courses/fall02/G22.3033-008/

• Original research papers by Page & Brin:– The PageRank Citation Ranking: Bringing Order to the Web– The Anatomy of a Large-Scale Hypertextual Web

Search Engine• Links on class page• Interesting and very accessible

• Google Labs: http://labs.google.com