web viewparallel corpora. these are ... static pages are files stored on a web server and ... this...
TRANSCRIPT
UNIT IV WEB SEARCH – LINK ANALYSIS AND SPECIALIZED SEARCH
Link Analysis – hubs and authorities – Page Rank and HITS algorithms -
Searching and Ranking – Relevance Scoring and ranking for Web – Similarity - Hadoop & Map Reduce - Evaluation - Personalized search - Collaborative filtering and content-based recommendation of documents and products – handling “invisible” Web - Snippet generation, Summarization, Question Answering, Cross-Lingual Retrieval. 4.1 Link Analysis:
The analysis of hyperlinks and the graph structure of the web have been
instrumental in the development of web search. Such link analysis is one of many
factors considered by web search engines in computing a composite score for a web
page on any given query.
Link analysis for web search has intellectual antecedents in the field of citation
analysis, aspects of which overlap with an area known as bibliometrics. These
disciplines seek to quantify the influence of scholarly articles by analyzing the pattern of
citations amongst them. Much as citations represent the conferral of authority from a
scholarly article to others, link analysis on the Web treats hyperlinks from a web page to
another as a conferral of authority. Clearly, not every citation or hyperlink implies such
authority conferral; for this reason, simply measuring the quality of a web page by the
number of in-links (citations from other pages) is not robust enough. For instance, one
may contrive to set up multiple web pages pointing to a target web page, with the intent
of artificially boosting the latter’s tally of in-links. This phenomenon is referred to as link
spam. Nevertheless, the phenomenon of citation is prevalent and dependable enough
that it is feasible for web search engines to derive useful signals for ranking from more
sophisticated link analysis. Link analysis also proves to be a useful indicator of what
page(s) to crawl next while crawling the web; this is done by using link analysis to guide
the priority assignment in the front queues.
The Web as a graph: (Refer Unit –I)
Unit- IV -1
Anchor text and the web graph:The following fragment of HTML code from a web page shows a hyperlink
pointing to the home page of the Journal of the ACM:
<a href="http://www.acm.org/jacm/">Journal of the ACM.</a>
In this case, the link points to the page http://www.acm.org/jacm/ and the anchor
text is Journal of the ACM. Clearly, in this example the anchor is descriptive of the
target page. But then the target page (B = http://www.acm.org/jacm/) itself contains the
same description as well as considerable additional information on the journal. So what
use is the anchor text? The Web is full of instances where the page B does not provide
an accurate description of itself. In many cases this is a matter of how the publishers
of page B choose to present themselves; this is especially common with corporate web
pages, where a web presence is a marketing statement. For example, at the time of the
writing of this book the home page of the IBM corporation (http://www.ibm.com)did not
contain the term computer anywhere in its HTML code, despite the fact that IBM is
widely viewed as the world’s largest computer maker. Similarly, the HTML code for the
home page of Yahoo! (http://www.yahoo.com) does not at this time contain the word
portal. The fact that the anchors ofmany hyperlinks pointing to http://www.ibm.com
include the word computer can be exploited by web search engines. For instance, the
anchor text terms can be included as terms under which to
index the target web page. Thus, the postings for the term
computer would include the document http://www.ibm.com and
that for the term portal would include the document
http://www.yahoo.com, using a special indicator to show that
these terms occur as anchor (rather than in-page) text. The
use of anchor text has some interesting side-effects. Searching
for big blue on most web search engines returns the home
page of the IBM corporation as the top hit; this is consistent
with the popular nickname that many people use to refer to IBM. On the other hand,
there have been (and continue to be) many instances where derogatory anchor text
such as evil empire leads to somewhat unexpected results on querying for these terms
on web search engines.
Unit- IV -2
The window of text surrounding anchor text (sometimes referred to as extended
anchor text) is often usable in the same manner as anchor text itself; consider for
instance the fragment of web text there is good discussion of vedic scripture
<a>here</a>. This has been considered in a number of settings and the useful width of
this window has been studied;
PAGE RANK:Our first technique for link analysis assigns to every node in the web graph a
numerical score between 0 and 1, known as its PageRank. The PageRank of a node
will depend on the link structure of the web graph. Given a query, a web search engine
computes a composite score for each web page that combines hundreds of features
such as cosine similarity and termproximity, together with the PageRank score.
Consider a random surfer who begins at a web page (a node of the web graph) and
executes a random walk on the Web as follows. At each time step, the surfer proceeds
from his current page A to a randomly chosen web page that A hyperlinks to. Figure
shows the surfer at a node A, out of which there are three hyperlinks to nodes B, C and
D; the surfer proceeds at the next time step to one of these three nodes, with equal
probabilities 1/3.
As the surfer proceeds in this random walk from node to node, he visits some
nodes more often than others; intuitively, these are nodes with many links coming in
from other frequently visited nodes. The idea behind Page- Rank is that pages visited
more often in this walk are more important. What if the current location of the surfer, the
node A, has no out-links? To address this we introduce an additional operation for our
random surfer: the teleport operation. In the teleport operation the surfer jumps from a
node to any other node in the web graph. This could happen because he types an
address into the URL bar of his browser. The destination of a teleport operation is
modeled as being chosen uniformly at random from all web pages. In other words, if N
is the total number of nodes in the web graph1, the teleport operation takes the surfer to
each node with probability 1/N. The surfer would also teleport to his present position
with probability 1/N. In assigning a PageRank score to each node of the web graph, we
use the teleport operation in two ways: (1) When at a node with no out-links, the surfer
Unit- IV -3
invokes the teleport operation. (2)At any node that has outgoing links, the surfer invokes
the teleport operation with probability 0 < α < 1 and the standard random walk with
probability 1 − α, where α is a fixed parameter chosen in advance. Typically, α might be
0.1. In Section 21.2.1, we will use the theory of Markov chains to argue that when the
surfer follows this combined process (random walk plus teleport) he visits each node v
of the web graph a fixed fraction of the time π(v) that depends on (1) the structure of the
web graph and (2) the value of α. We call this value π(v) the PageRank of v
Markov chains:
A Markov chain is a discrete-time stochastic process: a process that occurs in a series
of time-steps in each of which a random choice is made. A Markov chain consists of N
states. Each web page will correspond to a state in the Markov chain we will formulate.
AMarkov chain is characterized by an N × N transition probability matrix P each of
whose entries is in the interval [0, 1]; the entries in each row of P add up to 1. The
Markov chain can be in one of the N states at any given timestep; then, the entry Pij
tells us the probability that the state at the next timestep is j, conditioned on the current
state being i. Each entry Pij is known as a transition probability and depends only on the
current state i; this is known as the Markov property. Thus, by the Markov property, ∀i, j,
Pij [0, 1] and ∈
A matrix with non-negative entries that satisfies Equation (21.1) is known as a
stochastic matrix. A key property of a stochastic matrix is that it has a principal left
eigenvector corresponding to its largest eigenvalue, which is 1.
Unit- IV -4
In a Markov chain, the probability distribution of next states for a Markov chain
depends only on the current state, and not on how the Markov chain arrived at the
current state. Above figure shows a simple Markov chain with three states. From the
middle state A, we proceed with (equal) probabilities of 0.5 to either B or C. From either
B or C, we proceed with probability 1 to A. The transition probability matrix of this
Markov chain is then
A Markov chain’s probability distribution over its states may be viewed as a
probability vector: a vector all of whose entries are in the interval [0, 1], and the entries
add up to 1. An N-dimensional probability vector each of whose components
corresponds to one of the N states of a Markov chain can be viewed as a probability
distribution over its states. For our simple Markov chain of above figure the probability
vector would have 3 components that sum to 1.
We can view a random surfer on the web graph as a Markov chain, with one
state for each web page, and each transition probability representing the probability of
moving from one web page to another. The teleport operation contributes to these
transition probabilities. The adjacency matrix A of the web graph is defined as follows: if
there is a hyperlink from page i to page j, then Aij = 1, otherwise Aij = 0. We can readily
derive the transition probability matrix P for our Markov chain from the N × N matrix A:
1. If a row of A has no 1’s, then replace each element by 1/N. For all other rows
proceed as follows.
2. Divide each 1 in A by the number of 1’s in its row. Thus, if there is a row with
three 1’s, then each of them is replaced by 1/3.
3. Multiply the resulting matrix by 1 − α.
4. Add α/N to every entry of the resulting matrix, to obtain P.
Unit- IV -5
We can depict the probability distribution of the surfer’s position at any time by a
probability vector ~x. At t = 0 the surfer may begin at a state whose corresponding entry
in ~x is 1 while all others are zero. By definition, the surfer’s distribution at t = 1 is given
by the probability vector ~xP; at t = 2 by (~xP)P = ~xP2, and so on. We can thus
compute the surfer’s distribution over the states at any time, given only the initial
distribution and the transition probability matrix P.
If a Markov chain is allowed to run for many time steps, each state is visited at a
(different) frequency that depends on the structure of the Markov chain. In our running
analogy, the surfer visits certain web pages (say, popular news home pages) more
often than other pages. We now make this intuition precise, establishing conditions
under which such the visit frequency converges to fixed, steady-state quantity.
Following this, we set the Page-Rank of each node v to this steady-state visit frequency
and show how it can be computed.
Definition: A Markov chain is said to be ergodic if there exists a positive integer T0
such that for all pairs of states i, j in the Markov chain, if it is started at time 0 in state i
then for all t > T0, the probability of being in state j at time t is greater than 0.
For a Markov chain to be ergodic, two technical conditions are required of its states
and the non-zero transition probabilities; these conditions are known as irreducibility
and aperiodicity. Informally, the first ensures that there is a sequence of transitions of
non-zero probability from any state to any other, while the latter ensures that the states
are not partitioned into sets such that all state transitions occur cyclically from one set to
another.
The PageRank computation:
The left eigenvectors of the transition probability matrix P are N-vectors ~π such that
The N entries in the principal eigenvector ~π are the steady-state probabilities of the
random walk with teleporting, and thus the PageRank values for the corresponding web
pages.
Unit- IV -6
if ~π is the probability distribution of the surfer across the web pages, he remains in the
steady-state distribution ~π. Given that ~π is the steady-state distribution, we have that
πP = 1π, so 1 is an eigen value of P. Thus if we were to compute the principal left
eigenvector of the matrix P—the one with eigen value 1— we would have computed the
PageRank values.
There are many algorithms available for computing left eigenvectors; the references at
the end of Chapter 18 and the present chapter are a guide to these. We give here a
rather elementary method, sometimes known as power iteration. If ~x is the initial
distribution over the states, then the distribution at time t is ~xPt. As t grows large, we
would expect that the distribution ~xPt2 is very similar to the distribution ~xPt+1, since
for large t we would expect the Markov chain to attain its steady state. This is
independent of the initial distribution ~x. The power iteration method simulates the
surfer’s walk: begin at a state and run the walk for a large number of steps t, keeping
track of the visit frequencies for each of the states. After a large number of steps t,
these frequencies “settle down” so that the variation in the computed frequencies is
below some predetermined threshold. We declare these tabulated frequencies to be the
PageRank values. We consider the web graph in Exercise 21.6 with α = 0.5. The
transition probability matrix of the surfer’s walk with teleportation is then
Imagine that the surfer starts in state 1, corresponding to the initial probability
distribution vector ~x0 = (1 0 0). Then, after one step the distribution is
Cross-Lingual Retrieval (cross-lingual search, and multilingual search.): By translating queries for one or more monolingual search engines covering
different languages, it is possible to do cross-language search. A cross-language
Unit- IV -7
search engine receives a query in one language (e.g., English) and retrieves documents in a variety of other languages (e.g., French and Chinese). Users
typically will not be familiar with a wide range of languages, so a cross language search
system must do the query translation automatically. Since the system also retrieves
documents in multiple languages, some systems also translate these for the user.
Fig: Cross Language Search
The most obvious approach to automatic translation would be to use a large bilingual dictionary that contained the translation of a word in the source language
(e.g., English) to the target language (e.g., French). Sentences would then be translated
by looking up each word in the dictionary. The main issue is how to deal with ambiguity, since many words have multiple translations. Simple dictionary based
translations are generally poor, but a number of techniques have been developed, such
as query expansion, that reduce ambiguity and increase the ranking effectiveness of
a cross-language system to be comparable to a monolingual system
The most effective and general methods for automatic translation are based on
statistical machine translation models (Manning & Schütze, 1999). When translating
a document or a web page, in contrast to a query, not only is ambiguity a problem, but
the translated sentences should also be grammatically correct. Words can change
order, disappear, or become multiple words when a sentence is translated. Statistical
Unit- IV -8
translation models represent each of these changes with a probability. This means that
the model describes the probability that a word is translated into another word, the
probability that words change order, and the probability that words disappear or become
multiple words. These probabilities are used to calculate the most likely translation for a
sentence
Although a model that is based on word-to-word translation probabilities has
some similarities to a dictionary-based approach, if the translation probabilities are
accurate, they can make a large difference to the quality of the translation. Unusual
translations for an ambiguous word can then be easily distinguished from more typical
translations. More recent versions of these models, called phrase based translation models, further improve the use of context in the translation by calculating the probabilities of translating sequences of words, rather than just individual words.
The probabilities in statistical machine translation models are estimated primarily
by using parallel corpora. These are collections of documents in one language
together with the translations into one or more other languages. The corpora are
obtained primarily from government organizations (such as the Govt of India), news
organizations, and by mining the Web, since there are hundreds of thousands of
translated pages. The sentences in the parallel corpora are aligned either manually or
automatically, which means that sentences are paired with their translations. The
aligned sentences are then used for training the translation model.
Special attention has to be paid to the translation of unusual words, especially
proper nouns such as people’s names. For these words in particular, the Web is a rich
resource. Automatic transliteration techniques are also used to address the problem of people’s names. Proper names are not usually translated into another
language, but instead are transliterated, meaning that the name is written in the characters of another language according to certain rules or based on similar sounds. This can lead to many alternative spellings for the same name.
For example, the Libyan leader Muammar Qaddafi’s name can found in many
different transliterated variants on web pages, such as Qathafi, Kaddafi, Qadafi, Gadafi,
Unit- IV -9
Gaddafi, Kathafi, Kadhafi, Qadhafi, Qazzafi, Kazafi, Qaddafy, Qadafy, Quadhaffi,
Gadhdhafi, al-Qaddafi, Al-Qaddafi, and Al Qaddafi. Similarly, there are a number of
variants of “Bill Clinton” on Arabic web pages. Although they are not generally regarded
as cross-language search systems, web search engines can often retrieve pages in a
variety of languages. For that reason, many search engines have made translation
available on the result pages. The following figure shows an example of a page
retrieved for the query “pecheur france”, where the translation option is shown as a
hyperlink. Clicking on this link produces a translation of the page (not the snippet),
which makes it clear that the page contains links to archives of the sports magazine Le pêcheur de France, which is translated as “The fisherman of France”. Although the
translation provided is not perfect, it typically provides enough information for someone
to understand the contents and relevance of the page. These translations are generated
automatically using machine translation techniques, since any human intervention
would be prohibitively expensive.
Le pêcheur de France archives @ peche poissons - [ Translate this page ]
Le pêcheur de France Les média Revues de pêche Revue de presse Archives de la
revue
Le pêcheur de France janvier 2003 n°234 Le pêcheur de France mars 2003 ...
Figure. A French web page in the results list for the query “pecheur france”
Two Methods are used to solve the problem:
1. Query Translation
2. Document Translation
SNIPPET GENERATION, SUMMARIZATION:Snippet:- a short summary of the document, which is designed so as to allow the user to decide its relevance. Typically, the snippet consists of the document title and a short summary, which is automatically extracted. The two
bas002.00000000000ic kinds of summaries are static and dynamic.
STATIC SUMMARY: which are always the same regardless of the query.
Unit- IV -10
DYNAMIC SUMMARY: (or query-dependent) : which are customized according to the
user’s information need as deduced from a query.
Dynamic summaries attempt to explain why a particular document was retrieved
for the query at hand. A static summary is generally comprised of either or both a
subset of the document and metadata associated with the document. The simplest form
of summary takes the first two sentences or 50 words of a document, or extracts
particular zones of a document, such as the title and author. Instead of zones of a
document, the summary can instead use metadata associated with the document.
This may be an alternative way to provide an author or date, or may include
elements which are designed to give a summary, such as the description metadata
which can appear in the meta element of a web HTML page. This summary is typically
extracted and cached at indexing time, in such a way that it can be retrieved and
presented quickly when displaying search results, whereas having to access the actual
document content might be a relatively expensive operation.
KEYWORD- IN- CONTEXT(Dynamic summaries): Usually these windows contain one or several of the query terms, and so
are often referred to as keyword-in-context (KWIC) snippets, though sometimes
they may still be pieces of the text such as the title that are selected for their query
independent information value just as in the case of static summarization. Dynamic
summaries are generated in conjunction with scoring.
If the query is found as a phrase, occurrences of the phrase in the document will
be shown as the summary. If not, windows within the document that contain multiple
query terms will be selected. Commonly these windows may just stretch some number
of words to the left and right of the query terms. This is a place where NLP techniques
can usefully be employed: users prefer snippets that read well because they contain
complete phrases.
Dynamic summaries are generally regarded as greatly improving the usability of
IR systems, but they present a complication for IR system design. A dynamic summary
cannot be precomputed, but, on the other hand, if a system has only a positional index, then it cannot easily reconstruct the context surrounding search engine hits in
Unit- IV -11
order to generate such a dynamic summary. This is one reason for using static
summaries. The standard solution to this in a world of large and cheap disk drives is to
locally cache all the documents at index time.
Then, a system can simply scan a document which is about to appear in a
displayed results list to find snippets containing the query words. Beyond simply access
to the text, producing a good KWIC snippet requires some care.
Given a variety of keyword occurrences in a document, the goal is to choose
fragments which are:
(i) maximally informative about the discussion of those terms in the document,
(ii) self-contained enough to be easy to read, and
(iii) short enough to fit within the normally strict constraints on the space available
for summaries.
Generating snippets must be fast since the system is typically generating many
snippets for each query that it handles. Rather than caching an entire document, it is
common to cache only a generous but fixed size prefix of the document, such as
perhaps 10,000 characters. For most common, short documents, the entire document
is thus cached, but huge amounts of local storage will not be wasted on potentially vast
documents. Summaries of documents whose length exceeds the prefix size will be
based on material in the prefix only, which is in general a useful zone in which to look
for a document summary anyway.
If a document has been updated since it was last processed by a crawler and
indexer, these changes will be neither in the cache nor in the index. In these
circumstances, neither the index nor the summary will accurately reflect the current
contents of the document, but it is the differences between the summary and the actual
document content that will be more glaringly obvious to the end user.
A document summary for a web search typically contains the title and URL of the web page, links to live and cached versions of the page, and, most importantly, a short text summary, or snippet, that is used to convey the content of the page. In addition, most result pages contain advertisements consisting of short
descriptions and links. Query words that occur in the title, URL, snippet, or
Unit- IV -12
advertisements are highlighted to make them easier to identify, usually by displaying
them in a bold font.
Figure 1 gives an example of a document summary from a result page for a web
search. In this case, the snippet consists of two partial sentences. Figure 2 gives more
examples of snippets that are sometimes full sentences, but often text fragments,
extracted from the web page. Some of the snippets do not even contain the query
words.
Tropical FishOne of the U.K.s Leading suppliers of Tropical, Coldwater, Marine Fish and
Invertebrates plus.. .
next day fish delivery service ...
www.tropicalfish.org.uk/tropical_fish.htm Cached page
Fig 1: Typical document summary for a web search
Fig 2: Snippet Generation
Unit- IV -13
HANDLING “INVISIBLE” WEB:Not all parts of the Web are easy for a crawler to navigate. Sites that are difficult
for a crawler to find are collectively referred to as the deep Web (also called the hidden
Web or invisible web). Some studies have estimated that the deep Web is over a
hundred times larger than the traditionally indexed Web, although it is very difficult to
measure this accurately. Most sites that are a part of the deep Web fall into three broad
categories:
• Private sites are intentionally private. They may have no incoming links, or may require
you to log in with a valid account before using the rest of the site. These sites generally
want to block access from crawlers, although some news publishers may still want their
content indexed by major search engines.
• Form results are sites that can be reached only after entering some data into a form.
For example, websites selling airline tickets typically ask for trip information on the site’s
entry page. You are shown flight information only after submitting this trip information.
Even though you might want to use a search engine to find flight timetables, most
crawlers will not be able to get through this form to get to the timetable information.
• Scripted pages are pages that use JavaScript, Flash, or another client-side language
in the web page. If a link is not in the raw HTML source of the web page, but is instead
generated by JavaScript code running on the browser, the crawler will need to execute
the JavaScript on the page in order to find the link. Although this is technically possible,
executing JavaScript can slow down the crawler significantly and adds complexity to the
system.
Sometimes people make a distinction between static pages and dynamic pages.
Static pages are files stored on a web server and displayed in a web browser
unmodified, whereas dynamic pages may be the result of code executing on the web
server or the client. Typically it is assumed that static pages are easy to crawl, while
dynamic pages are hard. This is not quite true, however. Many websites have
Unit- IV -14
dynamically generated web pages that are easy to crawl; wikis are a good example of
this. Other websites have static pages that are impossible to crawl because they can be
accessed only through web forms. Web administrators of sites with form results and
scripted pages often want their sites to be indexed, unlike the owners of private sites. Of
these two categories, scripted pages are easiest to deal with. The site owner can
usually modify the pages slightly so that links are generated by code on the server
instead of by code in the browser. The crawler can also run page JavaScript, or perhaps
Flash as well, although these can take a lot of time.
The most difficult problems come with form results. Usually these sites are
repositories of changing data, and the form submits a query to a database system. In
the case where the database contains millions of records, the site would need to expose
millions of links to a search engine’s crawler. Adding a million links to the front page of
such a site is clearly infeasible. Another option is to let the crawler guess what to enter
into forms, but it is difficult to choose good form input. Even with good guesses, this
approach is unlikely to expose all of the hidden data.
PERSONALIZED SEARCH:A major deficiency of current search tools is their lack of adaptation to the user’s
preferences. Although the quality of search has improved dramatically in the last few
years and as a result user satisfaction has risen, search engines fall short of
understanding an individual user’s need and, accordingly, ranking the results for that
individual. The first ingredient, that is, the collection of personal search data, is already
present, and search engines such as Google have been hard at work to gain our trust
so that they can collect this personal data without raising too many privacy concerns.
We benefit by getting more powerful tools and the search engine benefits from the
increased internet traffic through their site.
When surfers use the search engine, cookies can be used to store their past interaction with the search service, and the inference mechanism can then personalize their query results. For example, if a searcher can be identified as a man,
a query such as “shoes” may be narrowed down to “men shoes”.
Two approaches to search engine personalization based on search engine log
data may be useful. In a click-based approach, the user’s query and click pairs are
Unit- IV -15
used for personalization. The idea is simple. When a user repeats queries over time, he
or she will prefer certain pages, that is, those that were more frequently clicked. The
downside of this approach is that if a search engine presents the same old pages to the
user each time a query is repeated it does not encourage the user to discover new
pages. On the other hand, this type of historical information may be quite useful to the
user. This approach can be refined by using content similarity to include similar queries
and web pages in the personalized results.
In a topic-based approach, a topical ontology is used to identify a user’s
interests. The ontology should include general topics that are of interest to web surfers
such as the top-level topics from the Open Directory. Then a classification technique,
such as naive Bayes, needs to be chosen in order to be able to classify the queries that
users submit and the pages that they visit;
The next step is to identify the user’s preferences based on their searches, and
finally these preferences can be used to personalize their results, for example, by
ranking them according to the learned preferences.
A dynamic and adaptive approach to personalization must be capable of
monitoring the users’ activity over time and to infer their interests and preferences as
their behavior changes over time. To implement dynamic user profiles, machine
learning techniques, such as Bayesian or neural networks, provide a sound basis for
improving the machine’s understanding of the human behind the machine.
Personalization versus Customization:Customization involves the layout of the user interface, for example the color
scheme to be used, the content displayed on the personalized web page and various
other settings
Personalized Results Tool:The Personalized Results Tool (PResTo!) is implemented as a plug-in to the
browser rather than being server based. This is a unique feature that bypasses some of
the privacy and security issues, which are becoming increasingly important to users,
since in the case of PResTo!, the ownership of the software and the personal data
generated from searches are in the user’s hands. A client-side approach is also more
Unit- IV -16
efficient for the search engine, since it does not have to manage the user profiles, and
thus scalability will not be an issue.
A downside of the client-side approach from the users’ point of view is that the
profile is less portable, but a partial solution to this problem may be to store the profile
on a local trusted server, which would enable remote access. A downside from the
search engines’ point of view is that a client-side approach can, in principle, be used to
personalize results from any search engine that the user interacts with, using a single
profile applicable to all searching. Personalization proceeds as follows:
suppose that the user issues a query to his or her favorite search engine. The
personalization plug-in detects this and sends the query results, which have been
returned to the user’s browser, to the personalization engine (on the user’s machine),
which then reranks the results according to the user’s profile and makes its
recommendations to the user in a separate window within the browser, alongside the
results returned by the search engine.
Personalized PageRank:PageRank values are personalized to the interests of an individual user, or
biased toward a particular topic such as sports or business. The optimization step is of
prime importance because each personalized PageRank vector will need to be
computed separately, and for web search companies such as Google, scalability of their
operation is a crucial ongoing concern. We refer to this special case of personalized
PageRank when the surfer is always teleported to a single page, as the individual
PageRank for that page. A more realistic preference may be to jump to a page that the
user has bookmarked or to a page from the user’s history list, with the probability being
proportional to the number of times the user visited the page in the past.
An important result, called the linearity theorem , simplifies the computation of
personalized PageRank vectors. It states that any personalized PageRank vector can
be expressed as a linear combination of individual PageRank vectors. In particular, one
application of this is that the global PageRank vector can be expressed as the average
of the linear combination of all possible individual PageRank vectors, one for each page
in the Web. This can simplify the computation of personalized PageRank vectors by
precomputing individual PageRank vectors and then combining them on demand,
Unit- IV -17
depending on the preferred web pages in a personalization instance. As we have seen
in Section 5.2.6, we can compute PageRank via a Monte Carlo simulation that samples
many random walks from each web page. The PageRank of a given page is then
computed as the proportion of random walks that end at that page. Looking at it from a
personalized perspective we can compute individual PageRank vectors by looking only
at the samples that start at the single web page being personalized, as suggested by
Fogaras et al.. The individual PageRank vectors can then be combined in an arbitrary
way, according to the linearity theorem, to obtain the required personalized PageRank
vector. An interesting variation of PageRank is topic sensitive. This version of
PageRank is biased according to some representative set of topics, based on
categories chosen, say, from the Open Directory. Another variation of PageRank, called
BlockRank, computes local PageRank values on a host basis, and then weights these
local PageRank values according to the global importance of the host.
Outride’s Personalized Search:Link analysis based on the evaluation of the authority of web sites is biased
against relevance, as determined by individual users. For example, when you submit
the query “java” to Google, you get many pages on the programming language Java,
rather than the place in Indonesia or the well-known coffee from Java. Popularity or
usage-based ranking adds to the link-based approach, by capturing the flavor of the day
and how relevance is changing over time for the user base of the search engine. In both
these approaches, relevance is measured for the population of users and not for the
individual user. Outride set out to build a model of the user, based on the context of the
activity of the user, and individual user characteristics such as prior knowledge and
history of search. The Outride system set out to integrate these features into the user
interface as follows. Once Archie submits his query, its context is determined and the
query is augmented with related terms. After it is processed, it is individualized, based
on demographic information and the past user history. A feature called “Have Seen,
Have Not Seen” allows the user to distinguish between old and new information. The
Outride user interface is integrated into the browser as a side bar that can be opened
and closed much like the favorites and history lists. According to the authors of the
research paper, searches were faster and easier to complete using Outride. It remains
Unit- IV -18
to see what Google will do with this thought-provoking technology. Jeff Heer, who is
acquainted with Outride’s former employees, said in his weblog that “Their technology
was quite impressive, building off a number of PARC innovations, but they were in the
right place at the wrong time”. It is worth mentioning that Google has released a tool
enabling users to search the Web from any application within the Windows operating
system. Another significant feature of the tool is that its query results can be displayed
in a separate window rather than in the user’s browser.
HADOOP & MAP REDUCE:Hadoop is a framework that allows processing and storing huge data sets.
Basically, Hadoop can be divided into two parts: processing and storage. So,
MapReduce is a programming model which allows you to process huge data stored in
Hadoop. When you install Hadoop in a cluster, you get MapReduce as a service where
you can write programs to perform computations in data in parallel and distributed
fashion.
What is MapReduce?MapReduce is a processing technique and a program model for distributed
computing based on java. The MapReduce algorithm contains two important tasks,
namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, Reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the
name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing
over multiple computing nodes. Under the MapReduce model, the data processing
primitives are called Mappers and Reducers. Decomposing a data processing
application into mappers and reducers is sometimes nontrivial. But, once we write an
application in the MapReduce form, scaling the application to run over hundreds,
thousands, or even tens of thousands of machines in a cluster is merely a configuration
change. This simple scalability is what has attracted many programmers to use the
MapReduce model.
Unit- IV -19
The Algorithm Generally MapReduce paradigm is based on sending the computer to where the
data resides! MapReduce program executes in three stages, namely map stage,
shuffle stage, and reduce stage.
Map stage : The map or mapper’s job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file system
(HDFS). The input file is passed to the mapper function line by line. The mapper
processes the data and creates several small chunks of data.
Reduce stage : This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored in
the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.
The framework manages all the details of data-passing such as issuing tasks,
verifying task completion, and copying data around the cluster between the
nodes.
Most of the computing takes place on nodes with data on local disks that reduces
the network traffic.
After completion of the given tasks, the cluster collects and reduces the data to
form an appropriate result, and sends it back to the Hadoop server.
Inputs and Outputs :
Unit- IV -20
The MapReduce framework operates on <key, value> pairs, that is, the
framework views the input to the job as a set of <key, value> pairs and produces a set
of <key, value> pairs as the output of the job, conceivably of different types.
The key and the value classes should be in serialized manner by the framework
and hence, need to implement the Writable interface. Additionally, the key classes have
to implement the Writable-Comparable interface to facilitate sorting by the framework.
Input and Output types of a MapReduce job: (Input) <k1, v1> -> map -> <k2, v2>->
reduce -> <k3, v3>(Output).
Input Output
Map <k1, v1> list (<k2, v2>)
Reduce <k2, list(v2)> list (<k3, v3>)
Terminology: PayLoad - Applications implement the Map and the Reduce functions, and form
the core of the job.
Mapper - Mapper maps the input key/value pairs to a set of intermediate
key/value pair.
NamedNode - Node that manages the Hadoop Distributed File System (HDFS).
DataNode - Node where data is presented in advance before any processing
takes place.
MasterNode - Node where JobTracker runs and which accepts job requests
from clients.
SlaveNode - Node where Map and Reduce program runs.
JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker - Tracks the task and reports status to JobTracker.
Job - A program is an execution of a Mapper and Reducer across a dataset.
Task - An execution of a Mapper or a Reducer on a slice of data.
Task Attempt - A particular instance of an attempt to execute a task on a
SlaveNode.
Unit- IV -21
COLLABORATIVE FILTERING:“The process in which the purchaser of a product or service tells friends, family,
neighbors, and associates about its virtues, especially when this happens in advance of
media advertising.”
As an example, suppose that I read a book, I like it, and recommend it to my
friends. Those of my friends who have a similar taste in books to mine may decide to
read the book and then recommend it to their friends. This is CF at work, through the
power of social networking.
In an e-commerce site, this process may be automated as follows. When I buy a book, this in itself is an implicit recommendation, but the site could ask me for an explicit rating of the book, say on a scale of 1 to 10. When my friend logs
onto the site, the CF system will be able to deduce that his taste in books is similar to
mine, since we have purchased similar items in the past. The system will also notice
that he has not yet bought the book that I have rated highly, and then recommend this
book to my friend. This is the essence of collaborative filtering. In practice, the system
will collect as many recommendations as it can and score them according to their
overall popularity before presenting the top recommendations to the user.
User-Based Collaborative Filtering:Consider the user-item matrix shown in Table.
Each row represents a user, and each column represents an item. A number in the i th
row and j th column is the rating that user i assigned to item j ; an empty cell indicates
Unit- IV -22
that the user did not rate that item. The ellipsis (· · ·) at the end of each row indicates
that we have shown only a small fraction of the items. In a typical e-commerce scenario,
a user would normally rate (or purchase) only a few products, say 30, out of the millions
that may be available, so that the user–item matrix is very sparse.
This sparsity problem has a negative effect on recommendation systems, since
there may not be enough data for the system to make a reliable prediction. In order to find like-minded users, that is, users with similar tastes, there needs to be sufficient overlap in their buying habits (in case of an e-commerce site) or page
views (in case of an e-learning or e-content site), for the system to have a statistically
significant assessment of their similarity. Another related problem is the first-rater problem, when an item has not been rated yet, questioning how can it be
recommended. An e-commerce site may still want to promote items having no rating,
and in this case a content-based approach is necessary. The ratings for an item can
be collected explicitly or implicitly. Explicit rating demands the user to give feedback to
the system on the quality of the item, it is normally a number between 1 and 10, low numbers providing negative feedback and high number providing positive feedback. Implicit feedback is collected without any special user intervedntion; the
system observes the user behavior and constructs a rating for the item based on the information it has. The best indicator of positive feedback in an e-commerce
setting is when users buy the item; in other settings, such as e-learning, the amount of
time users spend and/or the number of mouse operations they carry out when viewing
the content is normally used to measure their interest in the content.
A CF algorithm takes the user–item matrix as input and produces user
recommendations for the active user as output. For each user, an item vector is
constructed, where 0 implies that the item is unrated. For example, the item vector for
Alex is <1, 0, 5, 4>, for George it is <2, 3, 4, 0>, for Mark it is <4, 5, 0, 2>, and for Peter
it is <0, 0, 4, 5>. Assume that Alex is the active user.
One measure of similarity that can be computed between the two vectors is the dot product of the vectors. This is called vector similarity and is computed by
multiplying the ratings in the two vectors item by item and summing up the results. (The
result may be normalized so that it is a number between 0 and 1.) For example, the
Unit- IV -23
vector similarity between Alex and Peter is 40, between Alex and George it is 22 and
between Alex and Mark it is 12.
Another measure of similarity between two rows in the user–item matrix is to
compute the Pearson correlation between them, taking into account only the overlapping nonzero items; that is, items that were rated by both users. Correlation
measures only linear relationships between users, giving a number between −1 and 1;
more complex nonlinear relationships cannot be measured with this method. Both these
similarity measures suffer from problems related to the sparsity of the user–item matrix.
First, the similarity may be based on a few observations and therefore may not be accurate. In the extreme case of only two items in common, the Pearson correlation will
always return either 1 or −1. The second problem is the case when there is no overlap in the users nonzero rated items.
In this case, both approaches cannot detect any similarity and a content-based
approach must be used instead. The users who have positive similarity to the active user are called its neighbors. In the next step of the CF process, the predicted score
for the active user on an item he or she has not rated is computed using the k-nearest neighbors to the active users; that is, the k users who are most similar to the active
user.
More specifically, the predicated score is computed by adding to the active user’s
average score the weighted average of the deviation of the k-nearest neighbors from
their average weighting; the weighting of each neighbor is given according to his or her
similarity to the active user.
The predicted rating for search engines for Alex is computed as follows. The
nearest neighbors to Alex who have rated search engines are George and Mark.
George’s average rating is 3 and Mark’s is 3.33. The deviation of George’s average
rating from his score for search engines is zero, while the deviation from Mark’s score is
5 − 3.33 = 1.67. Weighting this deviation by Mark’s similarity and dividing by the sum of
similarities of the nearest neighbors, 22 + 12 = 34, we get 1.67(12/34) = 0.59. Finally,
adding Alex’s average, we get the prediction of 3.33 + 0.59 = 3.92 for the item search
engines.
Unit- IV -24
We note that when the ratings are binary, that is, 0 for no rating and 1 for a
positive rating, then the average rating of rated items is always 1, and so the deviation
of a rated item from the average will always be 0. In this case, the predicted rating for
an item the active user did not see will always be 1, independent of the weighting of its
neighbors, as long as there is at least one other user having positive similarity to the
active user.
To summarize, the user-based CF method has the following steps:
1. users rate items either explicitly or implicitly;
2. similarity between like-minded users is computed;
3. predications are made for items that the active user has not rated, and the
nearest neighbors ratings are used for scoring the recommendations.
The formal statement of the prediction made by user-based CF for the rating of a new
item by the active user is presented in Equation, where
1. pa,i is the prediction for the active user, a, for item, i ;
2. k is the number of nearest neighbors of a used for prediction;
3. wa,u is the similarity between a and a neighbor, u of a;
4. ru,i is the rating that user u gave to item i, and ra is the average rating of a.
User-Based CF:
Item-Based Collaborative FilteringItem-to-item recommendation systems try to match similar items that have been
co-rated by different users, rather than similar users or customers that have overlapping
interests in terms of their rated items. With regards to the user–item matrix, item-to-item
CF looks at column similarity rather than row similarity, and, as in user-based methods,
vector similarity can be used. For the matrix shown in Table, the vector similarity
between data mining and search engines is 26, between data mining and databases it
is 13, and between data mining and XML it is 12.
Unit- IV -25
In order to predict a rating, pa,i , for the active user, a, for an item i, all items, say j
, that are similar to i , and were rated by a, are taken into account. For each such j , the
similarity between items i and j , denoted by si,j , is computed and then weighted by the
rating, ra,j, that a gave to j . These values are summed and normalized to give the
prediction. The formal statement for the prediction made by item-based CF for the rating
of a new item by the active user is presented in Equation.
Item-Based CF:
In item-to-item algorithms, the number of items to be recommended is often
limited by a constant, say n, so that only the top-n predicted ratings of items similar to
the items rated by the active user are returned. Experiments comparing the item-to-item
algorithm to the user-based algorithm, described above, have shown consistently that
the item-to-item algorithm is not only much faster but also produces better quality
predictions.
The predicted rating for data mining for Peter is computed as follows. The
normalized weighting of the similarity between data mining and databases is 13 /(13 +
12 = 25) = 0.52, and between data mining and XML is 12/25 = 0.48.
Adding up these weights multiplied by Peter’s ratings gives a predicted rating of
0.52 × 4 + 0.48 × 5 = 4.48 for data mining.
Model-Based Collaborative Filtering:Apart from the algorithms we have presented there have been several other
proposals, notably methods, which use machine learning techniques to build a statistical
model of the user–item matrix that is then used to make predictions. One such
technique trains a neural network for each user, which earns to predict the user rating
for a new item. Another technique builds association rules such as “90% of users who
like items i and j also like item k, 30% of all users like all these items.”
The rules are generally of the form X ⇒ Y, where X is a set of items and Y is
another item, as in user-based algorithms. In this case, the rule is { i, j} {⇒ k}. The 30%
Unit- IV -26
in the rule refers to its support; that is, out of all the users in the user–item matrix, 30%
like all three items (this includes the items in both X and Y ). The 90% refers to the
confidence of the rule; that is, it is the proportion of users who like all three items (this
includes the items in either X or Y ) out of the proportion of users who like only i and j
(this includes only the items in X). For prediction purposes, we are interested in rules
such that all the items in the left-hand side of these rules were rated by the active user
but the item on their right-hand side was not. Setting the support and confidence to the
minimum desired levels, the rules can be ranked according to their confidence, for those
whose support is above the desired minimum.
Yet another technique uses the naive Bayes classifier. The basic idea is as
follows, with the user–item matrix being the input. For the purpose of this algorithm, we
consider items to be rated as “liked” or “disliked,” or to be unrated. The problem is to
compute the probability that an item will be liked or disliked by the active user given
ratings of other users. The naive Bayes assumption states, in this case, that the
probability that a user (other than the active user) likes an item, given that the active
user likes an item, is independent of the probability that yet another user likes an item
given that the active user likes an item. This allows us to asses the probability that an
item is liked by the active user, given other user ratings, as being proportional to the
product of the probabilities of each user liking an item given that the active user likes an
item.
It remains to compute the probability that a user, say j , likes an item given that
the active user likes an item. This probability measures the similarity between user j and
the active user. For this we make use only of the items that both j and active user have
rated. Suppose that there are n items, which both user j and the active user rated, and
out of these the active user liked m items. Moreover, suppose that k out of the m item
were also liked by user j . Then the probability that j will like an item given that the active
user likes an item is k/m. Thus the estimation of the probability that the active user will
like an item, say i, that user j has liked but the active user has not rated is also k/m.
Multiplying all these probabilities together for all other users that like item i gives us an
estimate of the probability that the active user will like i . Preliminary experiments with
this method have shown it to be more accurate than the standard user-based algorithm.
Unit- IV -27
CONTENT-BASED RECOMMENDATION SYSTEMS:In order to deal with the sparsity problem (where few if any users have rated any
items that the active user has rated) and the first-rater problem (where no users have
rated an item), a content-based approach to recommendation needs to be deployed.
Content-based approaches are not collaborative, since they involve only the active user
and the items they interact with.
For content-based systems to work, the system must be able to build a profile of
the user’s interests, which can be done explicitly or implicitly; The user’s interests
include the categories he/she prefers in relation to the application; for example, does
the user prefer fiction to nonfiction books, and pop music to classical music. Once the
system has a user profile, it can check similarity of the item (or content) a user is
viewing to the profile, and according to the degree of similarity create a rating for the
item (or content). This is much like the search process, where, in this case, the profile
acts as a query and the items presented to the user acts as the query results. The
higher the item is rated, the higher is its ranking when presented to the user.
Content-based and CF systems can be combined as follows, assuming we wish
to make a prediction for item i , and that we are measuring the similarity between the
active user and another user, say j . The item vectors for the active user and user j are
normally sparse, so we make use of content-based filtering to fill in pseudoratings for
items that were rated by one but not the other user, ensuring that the range of
pseudoratings is the same as for other user ratings.
After this stage, both vectors have a larger overlap, alleviating the sparsity
problem of CF methods. The content-based predictions can be weighted according to
the number of ratings the user had, since its accuracy depends on this number.
The algorithm can now continue much as before, making a prediction for item I
using the k-nearest neighbor method
Another aspect of CF algorithms is that of serendipity, defined in the Oxford
dictionary as “The occurrence and development of events by chance in a happy or beneficial way.”
Unit- IV -28
Although users like to get recommendations that they are familiar with, they also
like to see novel recommendations that they did not expect but are interesting to them.
It is especially pleasing to get a recommendation of something that I do not know and
was not already aware of.
CF has an advantage over content-based method in this respect, since the
recommendations are not based on the content but rather on how it is rated. This factor
can be boosted by giving preference to similar but “nontypical” users, and by not always
recommending the most popular items. For example, every customer of an online
supermarket will buy the standard items such as milk and apples, so there is not much
point in recommending these items. A notable content-based recommender system for
music is Pandora (www. pandora.com), founded by Tim Westergen in 2000 on the back
of the music genome project. The way it works is that each song is represented by a
vector of up to about 400 features, called genes, each assigned a number between 1
and 5 in half integer increments. For example, there are genes for the instrument type,
for the music style, and for the type of lyrics. The song vectors are constructed by
experts, each song taking about 20–30 mins to construct. As of mid-2006, the music
genome library contained over 400,000 songs from 20,000 contemporary artists. In
addition, according to the FAQ on Pandora’s site, about 15,000 new song vectors are
added to the library every month. When a user listens to a song, a list of similar songs
can be constructed using a similarity measure such as standard vector similarity.
Content-based recommender systems inevitably have the effect of reinforcing what the
user listens to rather than being unexpected as are CF systems. However, one
advantage of Pandora’s approach is that its listeners have access to music in the long
tail, as the experts can construct vectors for less popular songs, for example, very new
songs of musicians that may not be known or old songs that have fell out of fashion. On
the other hand, this approach does not scale to the degree that, say, CF does due to
the time consuming human effort in constructing the song vectors. In order to tune its
recommendations, Pandora also collects user ratings to allow its algorithms to adjust
the feature weights and personalize future suggestions. Another interesting content-
based approach that is proving to be competitive is to analyze the signal waveform of
songs and to make automated recommendations based on musical similarity.
Unit- IV -29
Evaluation of Collaborative Filtering Systems:The most common metric used to measure the distance between the predicted
and true ratings is the mean absolute error (MAE). This is simply the sum of the
absolute values of the differences between the predicted and true ratings divided by the
number of predictions made. The MAE is less appropriate when we wish the accuracy
of the top rated items to be higher than the low rated items, or when we are only
interested in a binary rating; that is, is the item “good” or is it “bad?”
Scalability of Collaborative Filtering Systems:In the first stage, the user–item matrix is preprocessed offline into an item-to-item
matrix. This offline stage, which is computationally intensive, calculates a similarity
measure between co-rated items as in item-to-item recommendation systems. The
computation, although extremely time intensive, is manageable since the user–item
matrix is sparse. However, it can be made more efficient for very popular items by
sampling users who have rated these items. It is also possible to discard users with very
few rated items, and to discard extremely popular or unpopular items.
In the second stage, the recommendations uses the item-to-item matrix output
from the first stage to deliver recommendations for the active user in real time, via a
computation, which is independent of the size of the original user–item matrix, and
depends only on the number of items the active user has rated.
Question Answering:The task of question answering involves providing a specific answer to a
user’s query, rather than a ranked list of documents. This task has a long history in
the fields of natural language processing and artificial intelligence. Early question
answering systems relied on detailed representations in logic of small, very specific
domains such as baseball, lunar rocks, or toy blocks. More recently, the focus has
shifted to an information retrieval perspective where the task involves identifying or
extracting answers found in large corpora of text.
Unit- IV -30
Above Figure shows the typical components of a question answering system that
retrieves answers from a text corpus. The range of questions that is handled by such a
system is usually limited to fact-based questions with simple, short answers, such as
who, where, and when questions that have people’s names, organization names,
places, and dates as answers. The following questions are a sample from the TREC
question answering (QA) track:
Who invented the paper clip?
Where is the Valley of the Kings?
When was the last major eruption of Mt. St. Helens?
There are, of course, other types of fact-based questions that could be asked, and they
can be asked in many different ways. The task of the question analysis and
classification component of the system is to classify a question by the type of answer
that is expected. For the TREC QA questions, one classification that is frequently used
has 31 different major categories, many of which correspond to named entities that can
be automatically identified in text. Following Table gives an example of a TREC
question for each of these categories. Question classification is a moderately difficult
task, given the large variation in question formats. The question word what, for example,
can be used for many different types of questions.
The information derived from question analysis and classification is used by the
answer selection component to identify answers in candidate text passages, which are
usually sentences. The candidate text passages are provided by the passage retrieval
component based on a query generated from the question. Text passages are retrieved
from a specific corpus or the Web. In TREC QA experiments, candidate answer
passages were retrieved from TREC news corpora, and the Web was often used as an
additional resource. The passage retrieval component of many question answering systems simply finds passages containing all the non-stopwords in the question.
In general, however, passage retrieval is similar to other types of search, in that
features associated with good passages can be combined to produce effective
rankings. Many of these features will be based on the question analysis. Text passages
containing named entities of the type associated with the question category as well as
all the important question words should obviously be ranked higher.
Unit- IV -31
For example, with the question “where is the valley of the kings”, sentences containing
text tagged as a location and the words “valley” and “kings” would be preferred. Some
systems identify text patterns associated with likely answers for the question category,
using either text mining techniques with the Web or predefined rules. Patterns such as
<question-location> in <location>, where questionlocation is “valley of the kings” in this
case, often may be found in answer passages. The presence of such a pattern should
improve the ranking of a text passage.
Another feature that has been shown to be useful for ranking passages is related
words from a thesaurus such as Wordnet. For example, using Wordnet relations, words
such as “fabricates”, “constructs”, and “makes” can be related to “manufactures” when
considering passages for the question “who manufactures magic chef appliances”. A
linear feature-based retrieval model provides the appropriate framework for combining
features associated with answer passages and learning effective weights. The final
selection of an answer from a text passage can potentially involve more linguistic
analysis and inference than is used to rank the text passages. In most cases, however,
users of a question answering system will want to see the context of an answer, or even
multiple answers, in order to verify that it appears to be correct or possibly to make a
decision about which is the best answer. For example, a system might return “Egypt” as
the answer to the Valley of the Kings question, but it would generally be more useful to
return the passage “The Valley of the Kings is located on the West Bank of the Nile near
Luxor in Egypt.”
From this perspective, we could view search engines as providing a spectrum of
responses for different types of queries, from focused text passages to entire
documents. Longer, more precise questions should produce more accurate, focused
responses, and in the case of fact-oriented questions such as those shown in Table,
this will generally be true. The techniques used in question answering systems show
how syntactic and semantic features can be used to obtain more accurate results for
some queries, but they do not solve the more difficult challenges of information retrieval.
A TREC query such as “Where have dams been removed and what has been the
environmental impact?” looks similar to a fact-based question, but the answers need to
be more comprehensive than a list of locations or a ranked list of sentences. On the
Unit- IV -32
other hand, using question answering techniques to identify the different text
expressions for dam removal should be helpful in ranking answer passages or
documents. Similarly, a TREC query such as “What is being done to increase mass
transit use?”, while clearly not a fact-based question, should also benefit from
techniques that could recognize discussions about the use of mass transit. These
potential benefits, however, have yet to be demonstrated in retrieval experiments, which
indicates that there are significant technical issues involved in applying these
techniques to large numbers of queries. Search engines currently rely on users
learning, based on their experience, to submit queries such as “mass transit” instead of
the more precise question.
Example Question Question Category
What do you call a group of geese? Animal
Who was Monet? Biography
How many types of lemurs are there? Cardinal
What is the effect of acid rain? Cause/Effect
What is the street address of the White
House?
Contact Info
Boxing Day is celebrated on what day? Date
What is sake? Definition
What is another name for
nearsightedness?
Disease
What was the famous battle in 1836
between Texas and Mexico?
Event
What is the tallest building in Japan? Facility
What type of bridge is the Golden Gate
Bridge?
Facility Description
What is the most popular sport in Japan? Game
What is the capital of Sri Lanka? Geo-Political Entity
Name a Gaelic language. Language
What is the world’s highest peak? Location
Example TREC QA questions and their corresponding question categories
Unit- IV -33