web viewparallel corpora. these are ... static pages are files stored on a web server and ... this...

50
UNIT IV WEB SEARCH – LINK ANALYSIS AND SPECIALIZED SEARCH Link Analysis – hubs and authorities – Page Rank and HITS algorithms -Searching and Ranking – Relevance Scoring and ranking for Web – Similarity - Hadoop & Map Reduce - Evaluation - Personalized search - Collaborative filtering and content-based recommendation of documents and products handling “invisible” Web - Snippet generation, Summarization, Question Answering, Cross-Lingual Retrieval. 4.1 Link Analysis: The analysis of hyperlinks and the graph structure of the web have been instrumental in the development of web search. Such link analysis is one of many factors considered by web search engines in computing a composite score for a web page on any given query. Link analysis for web search has intellectual antecedents in the field of citation analysis, aspects of which overlap with an area known as bibliometrics. These disciplines seek to quantify the influence of scholarly articles by analyzing the pattern of citations amongst them. Much as citations represent the conferral of authority from a scholarly article to others, link analysis on the Web treats hyperlinks from a web page to another as a conferral of authority. Clearly, not every citation or hyperlink implies such authority conferral; for this reason, simply measuring the quality of a web page by the number of in-links (citations from other pages) is not robust enough. For instance, Unit- IV -1

Upload: dinhthuy

Post on 01-Feb-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

UNIT IV WEB SEARCH – LINK ANALYSIS AND SPECIALIZED SEARCH

Link Analysis – hubs and authorities – Page Rank and HITS algorithms -

Searching and Ranking – Relevance Scoring and ranking for Web – Similarity - Hadoop & Map Reduce - Evaluation - Personalized search - Collaborative filtering and content-based recommendation of documents and products – handling “invisible” Web - Snippet generation, Summarization, Question Answering, Cross-Lingual Retrieval. 4.1 Link Analysis:

The analysis of hyperlinks and the graph structure of the web have been

instrumental in the development of web search. Such link analysis is one of many

factors considered by web search engines in computing a composite score for a web

page on any given query.

Link analysis for web search has intellectual antecedents in the field of citation

analysis, aspects of which overlap with an area known as bibliometrics. These

disciplines seek to quantify the influence of scholarly articles by analyzing the pattern of

citations amongst them. Much as citations represent the conferral of authority from a

scholarly article to others, link analysis on the Web treats hyperlinks from a web page to

another as a conferral of authority. Clearly, not every citation or hyperlink implies such

authority conferral; for this reason, simply measuring the quality of a web page by the

number of in-links (citations from other pages) is not robust enough. For instance, one

may contrive to set up multiple web pages pointing to a target web page, with the intent

of artificially boosting the latter’s tally of in-links. This phenomenon is referred to as link

spam. Nevertheless, the phenomenon of citation is prevalent and dependable enough

that it is feasible for web search engines to derive useful signals for ranking from more

sophisticated link analysis. Link analysis also proves to be a useful indicator of what

page(s) to crawl next while crawling the web; this is done by using link analysis to guide

the priority assignment in the front queues.

The Web as a graph: (Refer Unit –I)

Unit- IV -1

Page 2: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

Anchor text and the web graph:The following fragment of HTML code from a web page shows a hyperlink

pointing to the home page of the Journal of the ACM:

<a href="http://www.acm.org/jacm/">Journal of the ACM.</a>

In this case, the link points to the page http://www.acm.org/jacm/ and the anchor

text is Journal of the ACM. Clearly, in this example the anchor is descriptive of the

target page. But then the target page (B = http://www.acm.org/jacm/) itself contains the

same description as well as considerable additional information on the journal. So what

use is the anchor text? The Web is full of instances where the page B does not provide

an accurate description of itself. In many cases this is a matter of how the publishers

of page B choose to present themselves; this is especially common with corporate web

pages, where a web presence is a marketing statement. For example, at the time of the

writing of this book the home page of the IBM corporation (http://www.ibm.com)did not

contain the term computer anywhere in its HTML code, despite the fact that IBM is

widely viewed as the world’s largest computer maker. Similarly, the HTML code for the

home page of Yahoo! (http://www.yahoo.com) does not at this time contain the word

portal. The fact that the anchors ofmany hyperlinks pointing to http://www.ibm.com

include the word computer can be exploited by web search engines. For instance, the

anchor text terms can be included as terms under which to

index the target web page. Thus, the postings for the term

computer would include the document http://www.ibm.com and

that for the term portal would include the document

http://www.yahoo.com, using a special indicator to show that

these terms occur as anchor (rather than in-page) text. The

use of anchor text has some interesting side-effects. Searching

for big blue on most web search engines returns the home

page of the IBM corporation as the top hit; this is consistent

with the popular nickname that many people use to refer to IBM. On the other hand,

there have been (and continue to be) many instances where derogatory anchor text

such as evil empire leads to somewhat unexpected results on querying for these terms

on web search engines.

Unit- IV -2

Page 3: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

The window of text surrounding anchor text (sometimes referred to as extended

anchor text) is often usable in the same manner as anchor text itself; consider for

instance the fragment of web text there is good discussion of vedic scripture

<a>here</a>. This has been considered in a number of settings and the useful width of

this window has been studied;

PAGE RANK:Our first technique for link analysis assigns to every node in the web graph a

numerical score between 0 and 1, known as its PageRank. The PageRank of a node

will depend on the link structure of the web graph. Given a query, a web search engine

computes a composite score for each web page that combines hundreds of features

such as cosine similarity and termproximity, together with the PageRank score.

Consider a random surfer who begins at a web page (a node of the web graph) and

executes a random walk on the Web as follows. At each time step, the surfer proceeds

from his current page A to a randomly chosen web page that A hyperlinks to. Figure

shows the surfer at a node A, out of which there are three hyperlinks to nodes B, C and

D; the surfer proceeds at the next time step to one of these three nodes, with equal

probabilities 1/3.

As the surfer proceeds in this random walk from node to node, he visits some

nodes more often than others; intuitively, these are nodes with many links coming in

from other frequently visited nodes. The idea behind Page- Rank is that pages visited

more often in this walk are more important. What if the current location of the surfer, the

node A, has no out-links? To address this we introduce an additional operation for our

random surfer: the teleport operation. In the teleport operation the surfer jumps from a

node to any other node in the web graph. This could happen because he types an

address into the URL bar of his browser. The destination of a teleport operation is

modeled as being chosen uniformly at random from all web pages. In other words, if N

is the total number of nodes in the web graph1, the teleport operation takes the surfer to

each node with probability 1/N. The surfer would also teleport to his present position

with probability 1/N. In assigning a PageRank score to each node of the web graph, we

use the teleport operation in two ways: (1) When at a node with no out-links, the surfer

Unit- IV -3

Page 4: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

invokes the teleport operation. (2)At any node that has outgoing links, the surfer invokes

the teleport operation with probability 0 < α < 1 and the standard random walk with

probability 1 − α, where α is a fixed parameter chosen in advance. Typically, α might be

0.1. In Section 21.2.1, we will use the theory of Markov chains to argue that when the

surfer follows this combined process (random walk plus teleport) he visits each node v

of the web graph a fixed fraction of the time π(v) that depends on (1) the structure of the

web graph and (2) the value of α. We call this value π(v) the PageRank of v

Markov chains:

A Markov chain is a discrete-time stochastic process: a process that occurs in a series

of time-steps in each of which a random choice is made. A Markov chain consists of N

states. Each web page will correspond to a state in the Markov chain we will formulate.

AMarkov chain is characterized by an N × N transition probability matrix P each of

whose entries is in the interval [0, 1]; the entries in each row of P add up to 1. The

Markov chain can be in one of the N states at any given timestep; then, the entry Pij

tells us the probability that the state at the next timestep is j, conditioned on the current

state being i. Each entry Pij is known as a transition probability and depends only on the

current state i; this is known as the Markov property. Thus, by the Markov property, ∀i, j,

Pij [0, 1] and ∈

A matrix with non-negative entries that satisfies Equation (21.1) is known as a

stochastic matrix. A key property of a stochastic matrix is that it has a principal left

eigenvector corresponding to its largest eigenvalue, which is 1.

Unit- IV -4

Page 5: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

In a Markov chain, the probability distribution of next states for a Markov chain

depends only on the current state, and not on how the Markov chain arrived at the

current state. Above figure shows a simple Markov chain with three states. From the

middle state A, we proceed with (equal) probabilities of 0.5 to either B or C. From either

B or C, we proceed with probability 1 to A. The transition probability matrix of this

Markov chain is then

A Markov chain’s probability distribution over its states may be viewed as a

probability vector: a vector all of whose entries are in the interval [0, 1], and the entries

add up to 1. An N-dimensional probability vector each of whose components

corresponds to one of the N states of a Markov chain can be viewed as a probability

distribution over its states. For our simple Markov chain of above figure the probability

vector would have 3 components that sum to 1.

We can view a random surfer on the web graph as a Markov chain, with one

state for each web page, and each transition probability representing the probability of

moving from one web page to another. The teleport operation contributes to these

transition probabilities. The adjacency matrix A of the web graph is defined as follows: if

there is a hyperlink from page i to page j, then Aij = 1, otherwise Aij = 0. We can readily

derive the transition probability matrix P for our Markov chain from the N × N matrix A:

1. If a row of A has no 1’s, then replace each element by 1/N. For all other rows

proceed as follows.

2. Divide each 1 in A by the number of 1’s in its row. Thus, if there is a row with

three 1’s, then each of them is replaced by 1/3.

3. Multiply the resulting matrix by 1 − α.

4. Add α/N to every entry of the resulting matrix, to obtain P.

Unit- IV -5

Page 6: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

We can depict the probability distribution of the surfer’s position at any time by a

probability vector ~x. At t = 0 the surfer may begin at a state whose corresponding entry

in ~x is 1 while all others are zero. By definition, the surfer’s distribution at t = 1 is given

by the probability vector ~xP; at t = 2 by (~xP)P = ~xP2, and so on. We can thus

compute the surfer’s distribution over the states at any time, given only the initial

distribution and the transition probability matrix P.

If a Markov chain is allowed to run for many time steps, each state is visited at a

(different) frequency that depends on the structure of the Markov chain. In our running

analogy, the surfer visits certain web pages (say, popular news home pages) more

often than other pages. We now make this intuition precise, establishing conditions

under which such the visit frequency converges to fixed, steady-state quantity.

Following this, we set the Page-Rank of each node v to this steady-state visit frequency

and show how it can be computed.

Definition: A Markov chain is said to be ergodic if there exists a positive integer T0

such that for all pairs of states i, j in the Markov chain, if it is started at time 0 in state i

then for all t > T0, the probability of being in state j at time t is greater than 0.

For a Markov chain to be ergodic, two technical conditions are required of its states

and the non-zero transition probabilities; these conditions are known as irreducibility

and aperiodicity. Informally, the first ensures that there is a sequence of transitions of

non-zero probability from any state to any other, while the latter ensures that the states

are not partitioned into sets such that all state transitions occur cyclically from one set to

another.

The PageRank computation:

The left eigenvectors of the transition probability matrix P are N-vectors ~π such that

The N entries in the principal eigenvector ~π are the steady-state probabilities of the

random walk with teleporting, and thus the PageRank values for the corresponding web

pages.

Unit- IV -6

Page 7: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

if ~π is the probability distribution of the surfer across the web pages, he remains in the

steady-state distribution ~π. Given that ~π is the steady-state distribution, we have that

πP = 1π, so 1 is an eigen value of P. Thus if we were to compute the principal left

eigenvector of the matrix P—the one with eigen value 1— we would have computed the

PageRank values.

There are many algorithms available for computing left eigenvectors; the references at

the end of Chapter 18 and the present chapter are a guide to these. We give here a

rather elementary method, sometimes known as power iteration. If ~x is the initial

distribution over the states, then the distribution at time t is ~xPt. As t grows large, we

would expect that the distribution ~xPt2 is very similar to the distribution ~xPt+1, since

for large t we would expect the Markov chain to attain its steady state. This is

independent of the initial distribution ~x. The power iteration method simulates the

surfer’s walk: begin at a state and run the walk for a large number of steps t, keeping

track of the visit frequencies for each of the states. After a large number of steps t,

these frequencies “settle down” so that the variation in the computed frequencies is

below some predetermined threshold. We declare these tabulated frequencies to be the

PageRank values. We consider the web graph in Exercise 21.6 with α = 0.5. The

transition probability matrix of the surfer’s walk with teleportation is then

Imagine that the surfer starts in state 1, corresponding to the initial probability

distribution vector ~x0 = (1 0 0). Then, after one step the distribution is

Cross-Lingual Retrieval (cross-lingual search, and multilingual search.): By translating queries for one or more monolingual search engines covering

different languages, it is possible to do cross-language search. A cross-language

Unit- IV -7

Page 8: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

search engine receives a query in one language (e.g., English) and retrieves documents in a variety of other languages (e.g., French and Chinese). Users

typically will not be familiar with a wide range of languages, so a cross language search

system must do the query translation automatically. Since the system also retrieves

documents in multiple languages, some systems also translate these for the user.

Fig: Cross Language Search

The most obvious approach to automatic translation would be to use a large bilingual dictionary that contained the translation of a word in the source language

(e.g., English) to the target language (e.g., French). Sentences would then be translated

by looking up each word in the dictionary. The main issue is how to deal with ambiguity, since many words have multiple translations. Simple dictionary based

translations are generally poor, but a number of techniques have been developed, such

as query expansion, that reduce ambiguity and increase the ranking effectiveness of

a cross-language system to be comparable to a monolingual system

The most effective and general methods for automatic translation are based on

statistical machine translation models (Manning & Schütze, 1999). When translating

a document or a web page, in contrast to a query, not only is ambiguity a problem, but

the translated sentences should also be grammatically correct. Words can change

order, disappear, or become multiple words when a sentence is translated. Statistical

Unit- IV -8

Page 9: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

translation models represent each of these changes with a probability. This means that

the model describes the probability that a word is translated into another word, the

probability that words change order, and the probability that words disappear or become

multiple words. These probabilities are used to calculate the most likely translation for a

sentence

Although a model that is based on word-to-word translation probabilities has

some similarities to a dictionary-based approach, if the translation probabilities are

accurate, they can make a large difference to the quality of the translation. Unusual

translations for an ambiguous word can then be easily distinguished from more typical

translations. More recent versions of these models, called phrase based translation models, further improve the use of context in the translation by calculating the probabilities of translating sequences of words, rather than just individual words.

The probabilities in statistical machine translation models are estimated primarily

by using parallel corpora. These are collections of documents in one language

together with the translations into one or more other languages. The corpora are

obtained primarily from government organizations (such as the Govt of India), news

organizations, and by mining the Web, since there are hundreds of thousands of

translated pages. The sentences in the parallel corpora are aligned either manually or

automatically, which means that sentences are paired with their translations. The

aligned sentences are then used for training the translation model.

Special attention has to be paid to the translation of unusual words, especially

proper nouns such as people’s names. For these words in particular, the Web is a rich

resource. Automatic transliteration techniques are also used to address the problem of people’s names. Proper names are not usually translated into another

language, but instead are transliterated, meaning that the name is written in the characters of another language according to certain rules or based on similar sounds. This can lead to many alternative spellings for the same name.

For example, the Libyan leader Muammar Qaddafi’s name can found in many

different transliterated variants on web pages, such as Qathafi, Kaddafi, Qadafi, Gadafi,

Unit- IV -9

Page 10: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

Gaddafi, Kathafi, Kadhafi, Qadhafi, Qazzafi, Kazafi, Qaddafy, Qadafy, Quadhaffi,

Gadhdhafi, al-Qaddafi, Al-Qaddafi, and Al Qaddafi. Similarly, there are a number of

variants of “Bill Clinton” on Arabic web pages. Although they are not generally regarded

as cross-language search systems, web search engines can often retrieve pages in a

variety of languages. For that reason, many search engines have made translation

available on the result pages. The following figure shows an example of a page

retrieved for the query “pecheur france”, where the translation option is shown as a

hyperlink. Clicking on this link produces a translation of the page (not the snippet),

which makes it clear that the page contains links to archives of the sports magazine Le pêcheur de France, which is translated as “The fisherman of France”. Although the

translation provided is not perfect, it typically provides enough information for someone

to understand the contents and relevance of the page. These translations are generated

automatically using machine translation techniques, since any human intervention

would be prohibitively expensive.

Le pêcheur de France archives @ peche poissons - [ Translate this page ]

Le pêcheur de France Les média Revues de pêche Revue de presse Archives de la

revue

Le pêcheur de France janvier 2003 n°234 Le pêcheur de France mars 2003 ...

Figure. A French web page in the results list for the query “pecheur france”

Two Methods are used to solve the problem:

1. Query Translation

2. Document Translation

SNIPPET GENERATION, SUMMARIZATION:Snippet:- a short summary of the document, which is designed so as to allow the user to decide its relevance. Typically, the snippet consists of the document title and a short summary, which is automatically extracted. The two

bas002.00000000000ic kinds of summaries are static and dynamic.

STATIC SUMMARY: which are always the same regardless of the query.

Unit- IV -10

Page 11: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

DYNAMIC SUMMARY: (or query-dependent) : which are customized according to the

user’s information need as deduced from a query.

Dynamic summaries attempt to explain why a particular document was retrieved

for the query at hand. A static summary is generally comprised of either or both a

subset of the document and metadata associated with the document. The simplest form

of summary takes the first two sentences or 50 words of a document, or extracts

particular zones of a document, such as the title and author. Instead of zones of a

document, the summary can instead use metadata associated with the document.

This may be an alternative way to provide an author or date, or may include

elements which are designed to give a summary, such as the description metadata

which can appear in the meta element of a web HTML page. This summary is typically

extracted and cached at indexing time, in such a way that it can be retrieved and

presented quickly when displaying search results, whereas having to access the actual

document content might be a relatively expensive operation.

KEYWORD- IN- CONTEXT(Dynamic summaries): Usually these windows contain one or several of the query terms, and so

are often referred to as keyword-in-context (KWIC) snippets, though sometimes

they may still be pieces of the text such as the title that are selected for their query

independent information value just as in the case of static summarization. Dynamic

summaries are generated in conjunction with scoring.

If the query is found as a phrase, occurrences of the phrase in the document will

be shown as the summary. If not, windows within the document that contain multiple

query terms will be selected. Commonly these windows may just stretch some number

of words to the left and right of the query terms. This is a place where NLP techniques

can usefully be employed: users prefer snippets that read well because they contain

complete phrases.

Dynamic summaries are generally regarded as greatly improving the usability of

IR systems, but they present a complication for IR system design. A dynamic summary

cannot be precomputed, but, on the other hand, if a system has only a positional index, then it cannot easily reconstruct the context surrounding search engine hits in

Unit- IV -11

Page 12: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

order to generate such a dynamic summary. This is one reason for using static

summaries. The standard solution to this in a world of large and cheap disk drives is to

locally cache all the documents at index time.

Then, a system can simply scan a document which is about to appear in a

displayed results list to find snippets containing the query words. Beyond simply access

to the text, producing a good KWIC snippet requires some care.

Given a variety of keyword occurrences in a document, the goal is to choose

fragments which are:

(i) maximally informative about the discussion of those terms in the document,

(ii) self-contained enough to be easy to read, and

(iii) short enough to fit within the normally strict constraints on the space available

for summaries.

Generating snippets must be fast since the system is typically generating many

snippets for each query that it handles. Rather than caching an entire document, it is

common to cache only a generous but fixed size prefix of the document, such as

perhaps 10,000 characters. For most common, short documents, the entire document

is thus cached, but huge amounts of local storage will not be wasted on potentially vast

documents. Summaries of documents whose length exceeds the prefix size will be

based on material in the prefix only, which is in general a useful zone in which to look

for a document summary anyway.

If a document has been updated since it was last processed by a crawler and

indexer, these changes will be neither in the cache nor in the index. In these

circumstances, neither the index nor the summary will accurately reflect the current

contents of the document, but it is the differences between the summary and the actual

document content that will be more glaringly obvious to the end user.

A document summary for a web search typically contains the title and URL of the web page, links to live and cached versions of the page, and, most importantly, a short text summary, or snippet, that is used to convey the content of the page. In addition, most result pages contain advertisements consisting of short

descriptions and links. Query words that occur in the title, URL, snippet, or

Unit- IV -12

Page 13: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

advertisements are highlighted to make them easier to identify, usually by displaying

them in a bold font.

Figure 1 gives an example of a document summary from a result page for a web

search. In this case, the snippet consists of two partial sentences. Figure 2 gives more

examples of snippets that are sometimes full sentences, but often text fragments,

extracted from the web page. Some of the snippets do not even contain the query

words.

Tropical FishOne of the U.K.s Leading suppliers of Tropical, Coldwater, Marine Fish and

Invertebrates plus.. .

next day fish delivery service ...

www.tropicalfish.org.uk/tropical_fish.htm Cached page

Fig 1: Typical document summary for a web search

Fig 2: Snippet Generation

Unit- IV -13

Page 14: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

HANDLING “INVISIBLE” WEB:Not all parts of the Web are easy for a crawler to navigate. Sites that are difficult

for a crawler to find are collectively referred to as the deep Web (also called the hidden

Web or invisible web). Some studies have estimated that the deep Web is over a

hundred times larger than the traditionally indexed Web, although it is very difficult to

measure this accurately. Most sites that are a part of the deep Web fall into three broad

categories:

• Private sites are intentionally private. They may have no incoming links, or may require

you to log in with a valid account before using the rest of the site. These sites generally

want to block access from crawlers, although some news publishers may still want their

content indexed by major search engines.

• Form results are sites that can be reached only after entering some data into a form.

For example, websites selling airline tickets typically ask for trip information on the site’s

entry page. You are shown flight information only after submitting this trip information.

Even though you might want to use a search engine to find flight timetables, most

crawlers will not be able to get through this form to get to the timetable information.

• Scripted pages are pages that use JavaScript, Flash, or another client-side language

in the web page. If a link is not in the raw HTML source of the web page, but is instead

generated by JavaScript code running on the browser, the crawler will need to execute

the JavaScript on the page in order to find the link. Although this is technically possible,

executing JavaScript can slow down the crawler significantly and adds complexity to the

system.

Sometimes people make a distinction between static pages and dynamic pages.

Static pages are files stored on a web server and displayed in a web browser

unmodified, whereas dynamic pages may be the result of code executing on the web

server or the client. Typically it is assumed that static pages are easy to crawl, while

dynamic pages are hard. This is not quite true, however. Many websites have

Unit- IV -14

Page 15: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

dynamically generated web pages that are easy to crawl; wikis are a good example of

this. Other websites have static pages that are impossible to crawl because they can be

accessed only through web forms. Web administrators of sites with form results and

scripted pages often want their sites to be indexed, unlike the owners of private sites. Of

these two categories, scripted pages are easiest to deal with. The site owner can

usually modify the pages slightly so that links are generated by code on the server

instead of by code in the browser. The crawler can also run page JavaScript, or perhaps

Flash as well, although these can take a lot of time.

The most difficult problems come with form results. Usually these sites are

repositories of changing data, and the form submits a query to a database system. In

the case where the database contains millions of records, the site would need to expose

millions of links to a search engine’s crawler. Adding a million links to the front page of

such a site is clearly infeasible. Another option is to let the crawler guess what to enter

into forms, but it is difficult to choose good form input. Even with good guesses, this

approach is unlikely to expose all of the hidden data.

PERSONALIZED SEARCH:A major deficiency of current search tools is their lack of adaptation to the user’s

preferences. Although the quality of search has improved dramatically in the last few

years and as a result user satisfaction has risen, search engines fall short of

understanding an individual user’s need and, accordingly, ranking the results for that

individual. The first ingredient, that is, the collection of personal search data, is already

present, and search engines such as Google have been hard at work to gain our trust

so that they can collect this personal data without raising too many privacy concerns.

We benefit by getting more powerful tools and the search engine benefits from the

increased internet traffic through their site.

When surfers use the search engine, cookies can be used to store their past interaction with the search service, and the inference mechanism can then personalize their query results. For example, if a searcher can be identified as a man,

a query such as “shoes” may be narrowed down to “men shoes”.

Two approaches to search engine personalization based on search engine log

data may be useful. In a click-based approach, the user’s query and click pairs are

Unit- IV -15

Page 16: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

used for personalization. The idea is simple. When a user repeats queries over time, he

or she will prefer certain pages, that is, those that were more frequently clicked. The

downside of this approach is that if a search engine presents the same old pages to the

user each time a query is repeated it does not encourage the user to discover new

pages. On the other hand, this type of historical information may be quite useful to the

user. This approach can be refined by using content similarity to include similar queries

and web pages in the personalized results.

In a topic-based approach, a topical ontology is used to identify a user’s

interests. The ontology should include general topics that are of interest to web surfers

such as the top-level topics from the Open Directory. Then a classification technique,

such as naive Bayes, needs to be chosen in order to be able to classify the queries that

users submit and the pages that they visit;

The next step is to identify the user’s preferences based on their searches, and

finally these preferences can be used to personalize their results, for example, by

ranking them according to the learned preferences.

A dynamic and adaptive approach to personalization must be capable of

monitoring the users’ activity over time and to infer their interests and preferences as

their behavior changes over time. To implement dynamic user profiles, machine

learning techniques, such as Bayesian or neural networks, provide a sound basis for

improving the machine’s understanding of the human behind the machine.

Personalization versus Customization:Customization involves the layout of the user interface, for example the color

scheme to be used, the content displayed on the personalized web page and various

other settings

Personalized Results Tool:The Personalized Results Tool (PResTo!) is implemented as a plug-in to the

browser rather than being server based. This is a unique feature that bypasses some of

the privacy and security issues, which are becoming increasingly important to users,

since in the case of PResTo!, the ownership of the software and the personal data

generated from searches are in the user’s hands. A client-side approach is also more

Unit- IV -16

Page 17: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

efficient for the search engine, since it does not have to manage the user profiles, and

thus scalability will not be an issue.

A downside of the client-side approach from the users’ point of view is that the

profile is less portable, but a partial solution to this problem may be to store the profile

on a local trusted server, which would enable remote access. A downside from the

search engines’ point of view is that a client-side approach can, in principle, be used to

personalize results from any search engine that the user interacts with, using a single

profile applicable to all searching. Personalization proceeds as follows:

suppose that the user issues a query to his or her favorite search engine. The

personalization plug-in detects this and sends the query results, which have been

returned to the user’s browser, to the personalization engine (on the user’s machine),

which then reranks the results according to the user’s profile and makes its

recommendations to the user in a separate window within the browser, alongside the

results returned by the search engine.

Personalized PageRank:PageRank values are personalized to the interests of an individual user, or

biased toward a particular topic such as sports or business. The optimization step is of

prime importance because each personalized PageRank vector will need to be

computed separately, and for web search companies such as Google, scalability of their

operation is a crucial ongoing concern. We refer to this special case of personalized

PageRank when the surfer is always teleported to a single page, as the individual

PageRank for that page. A more realistic preference may be to jump to a page that the

user has bookmarked or to a page from the user’s history list, with the probability being

proportional to the number of times the user visited the page in the past.

An important result, called the linearity theorem , simplifies the computation of

personalized PageRank vectors. It states that any personalized PageRank vector can

be expressed as a linear combination of individual PageRank vectors. In particular, one

application of this is that the global PageRank vector can be expressed as the average

of the linear combination of all possible individual PageRank vectors, one for each page

in the Web. This can simplify the computation of personalized PageRank vectors by

precomputing individual PageRank vectors and then combining them on demand,

Unit- IV -17

Page 18: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

depending on the preferred web pages in a personalization instance. As we have seen

in Section 5.2.6, we can compute PageRank via a Monte Carlo simulation that samples

many random walks from each web page. The PageRank of a given page is then

computed as the proportion of random walks that end at that page. Looking at it from a

personalized perspective we can compute individual PageRank vectors by looking only

at the samples that start at the single web page being personalized, as suggested by

Fogaras et al.. The individual PageRank vectors can then be combined in an arbitrary

way, according to the linearity theorem, to obtain the required personalized PageRank

vector. An interesting variation of PageRank is topic sensitive. This version of

PageRank is biased according to some representative set of topics, based on

categories chosen, say, from the Open Directory. Another variation of PageRank, called

BlockRank, computes local PageRank values on a host basis, and then weights these

local PageRank values according to the global importance of the host.

Outride’s Personalized Search:Link analysis based on the evaluation of the authority of web sites is biased

against relevance, as determined by individual users. For example, when you submit

the query “java” to Google, you get many pages on the programming language Java,

rather than the place in Indonesia or the well-known coffee from Java. Popularity or

usage-based ranking adds to the link-based approach, by capturing the flavor of the day

and how relevance is changing over time for the user base of the search engine. In both

these approaches, relevance is measured for the population of users and not for the

individual user. Outride set out to build a model of the user, based on the context of the

activity of the user, and individual user characteristics such as prior knowledge and

history of search. The Outride system set out to integrate these features into the user

interface as follows. Once Archie submits his query, its context is determined and the

query is augmented with related terms. After it is processed, it is individualized, based

on demographic information and the past user history. A feature called “Have Seen,

Have Not Seen” allows the user to distinguish between old and new information. The

Outride user interface is integrated into the browser as a side bar that can be opened

and closed much like the favorites and history lists. According to the authors of the

research paper, searches were faster and easier to complete using Outride. It remains

Unit- IV -18

Page 19: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

to see what Google will do with this thought-provoking technology. Jeff Heer, who is

acquainted with Outride’s former employees, said in his weblog that “Their technology

was quite impressive, building off a number of PARC innovations, but they were in the

right place at the wrong time”. It is worth mentioning that Google has released a tool

enabling users to search the Web from any application within the Windows operating

system. Another significant feature of the tool is that its query results can be displayed

in a separate window rather than in the user’s browser.

HADOOP & MAP REDUCE:Hadoop is a framework that allows processing and storing huge data sets.

Basically, Hadoop can be divided into two parts: processing and storage. So,

MapReduce is a programming model which allows you to process huge data stored in

Hadoop. When you install Hadoop in a cluster, you get MapReduce as a service where

you can write programs to perform computations in data in parallel and distributed

fashion.

What is MapReduce?MapReduce is a processing technique and a program model for distributed

computing based on java. The MapReduce algorithm contains two important tasks,

namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, Reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the

name MapReduce implies, the reduce task is always performed after the map job.

The major advantage of MapReduce is that it is easy to scale data processing

over multiple computing nodes. Under the MapReduce model, the data processing

primitives are called Mappers and Reducers. Decomposing a data processing

application into mappers and reducers is sometimes nontrivial. But, once we write an

application in the MapReduce form, scaling the application to run over hundreds,

thousands, or even tens of thousands of machines in a cluster is merely a configuration

change. This simple scalability is what has attracted many programmers to use the

MapReduce model.

Unit- IV -19

Page 20: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

The Algorithm Generally MapReduce paradigm is based on sending the computer to where the

data resides! MapReduce program executes in three stages, namely map stage,

shuffle stage, and reduce stage.

Map stage : The map or mapper’s job is to process the input data. Generally the

input data is in the form of file or directory and is stored in the Hadoop file system

(HDFS). The input file is passed to the mapper function line by line. The mapper

processes the data and creates several small chunks of data.

Reduce stage : This stage is the combination of the Shuffle stage and

the Reduce stage. The Reducer’s job is to process the data that comes from the

mapper. After processing, it produces a new set of output, which will be stored in

the HDFS.

During a MapReduce job, Hadoop sends the Map and Reduce tasks to the

appropriate servers in the cluster.

The framework manages all the details of data-passing such as issuing tasks,

verifying task completion, and copying data around the cluster between the

nodes.

Most of the computing takes place on nodes with data on local disks that reduces

the network traffic.

After completion of the given tasks, the cluster collects and reduces the data to

form an appropriate result, and sends it back to the Hadoop server.

Inputs and Outputs :

Unit- IV -20

Page 21: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

The MapReduce framework operates on <key, value> pairs, that is, the

framework views the input to the job as a set of <key, value> pairs and produces a set

of <key, value> pairs as the output of the job, conceivably of different types.

The key and the value classes should be in serialized manner by the framework

and hence, need to implement the Writable interface. Additionally, the key classes have

to implement the Writable-Comparable interface to facilitate sorting by the framework.

Input and Output types of a MapReduce job: (Input) <k1, v1> -> map -> <k2, v2>->

reduce -> <k3, v3>(Output).

Input Output

Map <k1, v1> list (<k2, v2>)

Reduce <k2, list(v2)> list (<k3, v3>)

Terminology: PayLoad - Applications implement the Map and the Reduce functions, and form

the core of the job.

Mapper - Mapper maps the input key/value pairs to a set of intermediate

key/value pair.

NamedNode - Node that manages the Hadoop Distributed File System (HDFS).

DataNode - Node where data is presented in advance before any processing

takes place.

MasterNode - Node where JobTracker runs and which accepts job requests

from clients.

SlaveNode - Node where Map and Reduce program runs.

JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.

Task Tracker - Tracks the task and reports status to JobTracker.

Job - A program is an execution of a Mapper and Reducer across a dataset.

Task - An execution of a Mapper or a Reducer on a slice of data.

Task Attempt - A particular instance of an attempt to execute a task on a

SlaveNode.

Unit- IV -21

Page 22: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

COLLABORATIVE FILTERING:“The process in which the purchaser of a product or service tells friends, family,

neighbors, and associates about its virtues, especially when this happens in advance of

media advertising.”

As an example, suppose that I read a book, I like it, and recommend it to my

friends. Those of my friends who have a similar taste in books to mine may decide to

read the book and then recommend it to their friends. This is CF at work, through the

power of social networking.

In an e-commerce site, this process may be automated as follows. When I buy a book, this in itself is an implicit recommendation, but the site could ask me for an explicit rating of the book, say on a scale of 1 to 10. When my friend logs

onto the site, the CF system will be able to deduce that his taste in books is similar to

mine, since we have purchased similar items in the past. The system will also notice

that he has not yet bought the book that I have rated highly, and then recommend this

book to my friend. This is the essence of collaborative filtering. In practice, the system

will collect as many recommendations as it can and score them according to their

overall popularity before presenting the top recommendations to the user.

User-Based Collaborative Filtering:Consider the user-item matrix shown in Table.

Each row represents a user, and each column represents an item. A number in the i th

row and j th column is the rating that user i assigned to item j ; an empty cell indicates

Unit- IV -22

Page 23: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

that the user did not rate that item. The ellipsis (· · ·) at the end of each row indicates

that we have shown only a small fraction of the items. In a typical e-commerce scenario,

a user would normally rate (or purchase) only a few products, say 30, out of the millions

that may be available, so that the user–item matrix is very sparse.

This sparsity problem has a negative effect on recommendation systems, since

there may not be enough data for the system to make a reliable prediction. In order to find like-minded users, that is, users with similar tastes, there needs to be sufficient overlap in their buying habits (in case of an e-commerce site) or page

views (in case of an e-learning or e-content site), for the system to have a statistically

significant assessment of their similarity. Another related problem is the first-rater problem, when an item has not been rated yet, questioning how can it be

recommended. An e-commerce site may still want to promote items having no rating,

and in this case a content-based approach is necessary. The ratings for an item can

be collected explicitly or implicitly. Explicit rating demands the user to give feedback to

the system on the quality of the item, it is normally a number between 1 and 10, low numbers providing negative feedback and high number providing positive feedback. Implicit feedback is collected without any special user intervedntion; the

system observes the user behavior and constructs a rating for the item based on the information it has. The best indicator of positive feedback in an e-commerce

setting is when users buy the item; in other settings, such as e-learning, the amount of

time users spend and/or the number of mouse operations they carry out when viewing

the content is normally used to measure their interest in the content.

A CF algorithm takes the user–item matrix as input and produces user

recommendations for the active user as output. For each user, an item vector is

constructed, where 0 implies that the item is unrated. For example, the item vector for

Alex is <1, 0, 5, 4>, for George it is <2, 3, 4, 0>, for Mark it is <4, 5, 0, 2>, and for Peter

it is <0, 0, 4, 5>. Assume that Alex is the active user.

One measure of similarity that can be computed between the two vectors is the dot product of the vectors. This is called vector similarity and is computed by

multiplying the ratings in the two vectors item by item and summing up the results. (The

result may be normalized so that it is a number between 0 and 1.) For example, the

Unit- IV -23

Page 24: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

vector similarity between Alex and Peter is 40, between Alex and George it is 22 and

between Alex and Mark it is 12.

Another measure of similarity between two rows in the user–item matrix is to

compute the Pearson correlation between them, taking into account only the overlapping nonzero items; that is, items that were rated by both users. Correlation

measures only linear relationships between users, giving a number between −1 and 1;

more complex nonlinear relationships cannot be measured with this method. Both these

similarity measures suffer from problems related to the sparsity of the user–item matrix.

First, the similarity may be based on a few observations and therefore may not be accurate. In the extreme case of only two items in common, the Pearson correlation will

always return either 1 or −1. The second problem is the case when there is no overlap in the users nonzero rated items.

In this case, both approaches cannot detect any similarity and a content-based

approach must be used instead. The users who have positive similarity to the active user are called its neighbors. In the next step of the CF process, the predicted score

for the active user on an item he or she has not rated is computed using the k-nearest neighbors to the active users; that is, the k users who are most similar to the active

user.

More specifically, the predicated score is computed by adding to the active user’s

average score the weighted average of the deviation of the k-nearest neighbors from

their average weighting; the weighting of each neighbor is given according to his or her

similarity to the active user.

The predicted rating for search engines for Alex is computed as follows. The

nearest neighbors to Alex who have rated search engines are George and Mark.

George’s average rating is 3 and Mark’s is 3.33. The deviation of George’s average

rating from his score for search engines is zero, while the deviation from Mark’s score is

5 − 3.33 = 1.67. Weighting this deviation by Mark’s similarity and dividing by the sum of

similarities of the nearest neighbors, 22 + 12 = 34, we get 1.67(12/34) = 0.59. Finally,

adding Alex’s average, we get the prediction of 3.33 + 0.59 = 3.92 for the item search

engines.

Unit- IV -24

Page 25: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

We note that when the ratings are binary, that is, 0 for no rating and 1 for a

positive rating, then the average rating of rated items is always 1, and so the deviation

of a rated item from the average will always be 0. In this case, the predicted rating for

an item the active user did not see will always be 1, independent of the weighting of its

neighbors, as long as there is at least one other user having positive similarity to the

active user.

To summarize, the user-based CF method has the following steps:

1. users rate items either explicitly or implicitly;

2. similarity between like-minded users is computed;

3. predications are made for items that the active user has not rated, and the

nearest neighbors ratings are used for scoring the recommendations.

The formal statement of the prediction made by user-based CF for the rating of a new

item by the active user is presented in Equation, where

1. pa,i is the prediction for the active user, a, for item, i ;

2. k is the number of nearest neighbors of a used for prediction;

3. wa,u is the similarity between a and a neighbor, u of a;

4. ru,i is the rating that user u gave to item i, and ra is the average rating of a.

User-Based CF:

Item-Based Collaborative FilteringItem-to-item recommendation systems try to match similar items that have been

co-rated by different users, rather than similar users or customers that have overlapping

interests in terms of their rated items. With regards to the user–item matrix, item-to-item

CF looks at column similarity rather than row similarity, and, as in user-based methods,

vector similarity can be used. For the matrix shown in Table, the vector similarity

between data mining and search engines is 26, between data mining and databases it

is 13, and between data mining and XML it is 12.

Unit- IV -25

Page 26: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

In order to predict a rating, pa,i , for the active user, a, for an item i, all items, say j

, that are similar to i , and were rated by a, are taken into account. For each such j , the

similarity between items i and j , denoted by si,j , is computed and then weighted by the

rating, ra,j, that a gave to j . These values are summed and normalized to give the

prediction. The formal statement for the prediction made by item-based CF for the rating

of a new item by the active user is presented in Equation.

Item-Based CF:

In item-to-item algorithms, the number of items to be recommended is often

limited by a constant, say n, so that only the top-n predicted ratings of items similar to

the items rated by the active user are returned. Experiments comparing the item-to-item

algorithm to the user-based algorithm, described above, have shown consistently that

the item-to-item algorithm is not only much faster but also produces better quality

predictions.

The predicted rating for data mining for Peter is computed as follows. The

normalized weighting of the similarity between data mining and databases is 13 /(13 +

12 = 25) = 0.52, and between data mining and XML is 12/25 = 0.48.

Adding up these weights multiplied by Peter’s ratings gives a predicted rating of

0.52 × 4 + 0.48 × 5 = 4.48 for data mining.

Model-Based Collaborative Filtering:Apart from the algorithms we have presented there have been several other

proposals, notably methods, which use machine learning techniques to build a statistical

model of the user–item matrix that is then used to make predictions. One such

technique trains a neural network for each user, which earns to predict the user rating

for a new item. Another technique builds association rules such as “90% of users who

like items i and j also like item k, 30% of all users like all these items.”

The rules are generally of the form X ⇒ Y, where X is a set of items and Y is

another item, as in user-based algorithms. In this case, the rule is { i, j} {⇒ k}. The 30%

Unit- IV -26

Page 27: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

in the rule refers to its support; that is, out of all the users in the user–item matrix, 30%

like all three items (this includes the items in both X and Y ). The 90% refers to the

confidence of the rule; that is, it is the proportion of users who like all three items (this

includes the items in either X or Y ) out of the proportion of users who like only i and j

(this includes only the items in X). For prediction purposes, we are interested in rules

such that all the items in the left-hand side of these rules were rated by the active user

but the item on their right-hand side was not. Setting the support and confidence to the

minimum desired levels, the rules can be ranked according to their confidence, for those

whose support is above the desired minimum.

Yet another technique uses the naive Bayes classifier. The basic idea is as

follows, with the user–item matrix being the input. For the purpose of this algorithm, we

consider items to be rated as “liked” or “disliked,” or to be unrated. The problem is to

compute the probability that an item will be liked or disliked by the active user given

ratings of other users. The naive Bayes assumption states, in this case, that the

probability that a user (other than the active user) likes an item, given that the active

user likes an item, is independent of the probability that yet another user likes an item

given that the active user likes an item. This allows us to asses the probability that an

item is liked by the active user, given other user ratings, as being proportional to the

product of the probabilities of each user liking an item given that the active user likes an

item.

It remains to compute the probability that a user, say j , likes an item given that

the active user likes an item. This probability measures the similarity between user j and

the active user. For this we make use only of the items that both j and active user have

rated. Suppose that there are n items, which both user j and the active user rated, and

out of these the active user liked m items. Moreover, suppose that k out of the m item

were also liked by user j . Then the probability that j will like an item given that the active

user likes an item is k/m. Thus the estimation of the probability that the active user will

like an item, say i, that user j has liked but the active user has not rated is also k/m.

Multiplying all these probabilities together for all other users that like item i gives us an

estimate of the probability that the active user will like i . Preliminary experiments with

this method have shown it to be more accurate than the standard user-based algorithm.

Unit- IV -27

Page 28: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

CONTENT-BASED RECOMMENDATION SYSTEMS:In order to deal with the sparsity problem (where few if any users have rated any

items that the active user has rated) and the first-rater problem (where no users have

rated an item), a content-based approach to recommendation needs to be deployed.

Content-based approaches are not collaborative, since they involve only the active user

and the items they interact with.

For content-based systems to work, the system must be able to build a profile of

the user’s interests, which can be done explicitly or implicitly; The user’s interests

include the categories he/she prefers in relation to the application; for example, does

the user prefer fiction to nonfiction books, and pop music to classical music. Once the

system has a user profile, it can check similarity of the item (or content) a user is

viewing to the profile, and according to the degree of similarity create a rating for the

item (or content). This is much like the search process, where, in this case, the profile

acts as a query and the items presented to the user acts as the query results. The

higher the item is rated, the higher is its ranking when presented to the user.

Content-based and CF systems can be combined as follows, assuming we wish

to make a prediction for item i , and that we are measuring the similarity between the

active user and another user, say j . The item vectors for the active user and user j are

normally sparse, so we make use of content-based filtering to fill in pseudoratings for

items that were rated by one but not the other user, ensuring that the range of

pseudoratings is the same as for other user ratings.

After this stage, both vectors have a larger overlap, alleviating the sparsity

problem of CF methods. The content-based predictions can be weighted according to

the number of ratings the user had, since its accuracy depends on this number.

The algorithm can now continue much as before, making a prediction for item I

using the k-nearest neighbor method

Another aspect of CF algorithms is that of serendipity, defined in the Oxford

dictionary as “The occurrence and development of events by chance in a happy or beneficial way.”

Unit- IV -28

Page 29: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

Although users like to get recommendations that they are familiar with, they also

like to see novel recommendations that they did not expect but are interesting to them.

It is especially pleasing to get a recommendation of something that I do not know and

was not already aware of.

CF has an advantage over content-based method in this respect, since the

recommendations are not based on the content but rather on how it is rated. This factor

can be boosted by giving preference to similar but “nontypical” users, and by not always

recommending the most popular items. For example, every customer of an online

supermarket will buy the standard items such as milk and apples, so there is not much

point in recommending these items. A notable content-based recommender system for

music is Pandora (www. pandora.com), founded by Tim Westergen in 2000 on the back

of the music genome project. The way it works is that each song is represented by a

vector of up to about 400 features, called genes, each assigned a number between 1

and 5 in half integer increments. For example, there are genes for the instrument type,

for the music style, and for the type of lyrics. The song vectors are constructed by

experts, each song taking about 20–30 mins to construct. As of mid-2006, the music

genome library contained over 400,000 songs from 20,000 contemporary artists. In

addition, according to the FAQ on Pandora’s site, about 15,000 new song vectors are

added to the library every month. When a user listens to a song, a list of similar songs

can be constructed using a similarity measure such as standard vector similarity.

Content-based recommender systems inevitably have the effect of reinforcing what the

user listens to rather than being unexpected as are CF systems. However, one

advantage of Pandora’s approach is that its listeners have access to music in the long

tail, as the experts can construct vectors for less popular songs, for example, very new

songs of musicians that may not be known or old songs that have fell out of fashion. On

the other hand, this approach does not scale to the degree that, say, CF does due to

the time consuming human effort in constructing the song vectors. In order to tune its

recommendations, Pandora also collects user ratings to allow its algorithms to adjust

the feature weights and personalize future suggestions. Another interesting content-

based approach that is proving to be competitive is to analyze the signal waveform of

songs and to make automated recommendations based on musical similarity.

Unit- IV -29

Page 30: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

Evaluation of Collaborative Filtering Systems:The most common metric used to measure the distance between the predicted

and true ratings is the mean absolute error (MAE). This is simply the sum of the

absolute values of the differences between the predicted and true ratings divided by the

number of predictions made. The MAE is less appropriate when we wish the accuracy

of the top rated items to be higher than the low rated items, or when we are only

interested in a binary rating; that is, is the item “good” or is it “bad?”

Scalability of Collaborative Filtering Systems:In the first stage, the user–item matrix is preprocessed offline into an item-to-item

matrix. This offline stage, which is computationally intensive, calculates a similarity

measure between co-rated items as in item-to-item recommendation systems. The

computation, although extremely time intensive, is manageable since the user–item

matrix is sparse. However, it can be made more efficient for very popular items by

sampling users who have rated these items. It is also possible to discard users with very

few rated items, and to discard extremely popular or unpopular items.

In the second stage, the recommendations uses the item-to-item matrix output

from the first stage to deliver recommendations for the active user in real time, via a

computation, which is independent of the size of the original user–item matrix, and

depends only on the number of items the active user has rated.

Question Answering:The task of question answering involves providing a specific answer to a

user’s query, rather than a ranked list of documents. This task has a long history in

the fields of natural language processing and artificial intelligence. Early question

answering systems relied on detailed representations in logic of small, very specific

domains such as baseball, lunar rocks, or toy blocks. More recently, the focus has

shifted to an information retrieval perspective where the task involves identifying or

extracting answers found in large corpora of text.

Unit- IV -30

Page 31: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

Above Figure shows the typical components of a question answering system that

retrieves answers from a text corpus. The range of questions that is handled by such a

system is usually limited to fact-based questions with simple, short answers, such as

who, where, and when questions that have people’s names, organization names,

places, and dates as answers. The following questions are a sample from the TREC

question answering (QA) track:

Who invented the paper clip?

Where is the Valley of the Kings?

When was the last major eruption of Mt. St. Helens?

There are, of course, other types of fact-based questions that could be asked, and they

can be asked in many different ways. The task of the question analysis and

classification component of the system is to classify a question by the type of answer

that is expected. For the TREC QA questions, one classification that is frequently used

has 31 different major categories, many of which correspond to named entities that can

be automatically identified in text. Following Table gives an example of a TREC

question for each of these categories. Question classification is a moderately difficult

task, given the large variation in question formats. The question word what, for example,

can be used for many different types of questions.

The information derived from question analysis and classification is used by the

answer selection component to identify answers in candidate text passages, which are

usually sentences. The candidate text passages are provided by the passage retrieval

component based on a query generated from the question. Text passages are retrieved

from a specific corpus or the Web. In TREC QA experiments, candidate answer

passages were retrieved from TREC news corpora, and the Web was often used as an

additional resource. The passage retrieval component of many question answering systems simply finds passages containing all the non-stopwords in the question.

In general, however, passage retrieval is similar to other types of search, in that

features associated with good passages can be combined to produce effective

rankings. Many of these features will be based on the question analysis. Text passages

containing named entities of the type associated with the question category as well as

all the important question words should obviously be ranked higher.

Unit- IV -31

Page 32: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

For example, with the question “where is the valley of the kings”, sentences containing

text tagged as a location and the words “valley” and “kings” would be preferred. Some

systems identify text patterns associated with likely answers for the question category,

using either text mining techniques with the Web or predefined rules. Patterns such as

<question-location> in <location>, where questionlocation is “valley of the kings” in this

case, often may be found in answer passages. The presence of such a pattern should

improve the ranking of a text passage.

Another feature that has been shown to be useful for ranking passages is related

words from a thesaurus such as Wordnet. For example, using Wordnet relations, words

such as “fabricates”, “constructs”, and “makes” can be related to “manufactures” when

considering passages for the question “who manufactures magic chef appliances”. A

linear feature-based retrieval model provides the appropriate framework for combining

features associated with answer passages and learning effective weights. The final

selection of an answer from a text passage can potentially involve more linguistic

analysis and inference than is used to rank the text passages. In most cases, however,

users of a question answering system will want to see the context of an answer, or even

multiple answers, in order to verify that it appears to be correct or possibly to make a

decision about which is the best answer. For example, a system might return “Egypt” as

the answer to the Valley of the Kings question, but it would generally be more useful to

return the passage “The Valley of the Kings is located on the West Bank of the Nile near

Luxor in Egypt.”

From this perspective, we could view search engines as providing a spectrum of

responses for different types of queries, from focused text passages to entire

documents. Longer, more precise questions should produce more accurate, focused

responses, and in the case of fact-oriented questions such as those shown in Table,

this will generally be true. The techniques used in question answering systems show

how syntactic and semantic features can be used to obtain more accurate results for

some queries, but they do not solve the more difficult challenges of information retrieval.

A TREC query such as “Where have dams been removed and what has been the

environmental impact?” looks similar to a fact-based question, but the answers need to

be more comprehensive than a list of locations or a ranked list of sentences. On the

Unit- IV -32

Page 33: Web viewparallel corpora. These are ... Static pages are files stored on a web server and ... This is simply the sum of the absolute values of the differences between the

other hand, using question answering techniques to identify the different text

expressions for dam removal should be helpful in ranking answer passages or

documents. Similarly, a TREC query such as “What is being done to increase mass

transit use?”, while clearly not a fact-based question, should also benefit from

techniques that could recognize discussions about the use of mass transit. These

potential benefits, however, have yet to be demonstrated in retrieval experiments, which

indicates that there are significant technical issues involved in applying these

techniques to large numbers of queries. Search engines currently rely on users

learning, based on their experience, to submit queries such as “mass transit” instead of

the more precise question.

Example Question Question Category

What do you call a group of geese? Animal

Who was Monet? Biography

How many types of lemurs are there? Cardinal

What is the effect of acid rain? Cause/Effect

What is the street address of the White

House?

Contact Info

Boxing Day is celebrated on what day? Date

What is sake? Definition

What is another name for

nearsightedness?

Disease

What was the famous battle in 1836

between Texas and Mexico?

Event

What is the tallest building in Japan? Facility

What type of bridge is the Golden Gate

Bridge?

Facility Description

What is the most popular sport in Japan? Game

What is the capital of Sri Lanka? Geo-Political Entity

Name a Gaelic language. Language

What is the world’s highest peak? Location

Example TREC QA questions and their corresponding question categories

Unit- IV -33