cs344: introduction to artificial intelligencecs344/2010/slides/cs344-lect32-ir... · cs344:...

36
CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT B b IIT Bombay Lecture 32: Information Retrieval: Basic t dMdl concepts and Model

Upload: ngotuyen

Post on 19-Jul-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • CS344: Introduction to Artificial Intelligence

    Pushpak BhattacharyyaCSE Dept., IIT B bIIT Bombay

    Lecture 32: Information Retrieval: Basic t d M d lconcepts and Model

  • The elusive user satisfactionThe elusive user satisfaction

    RankingRanking

    CorrectnessCorrectnessof

    Query ProcessingCoverage

    I d iNER

    StemmingMWE

    CrawlingIndexing

    MWE

  • Q I di T ib i L tiQuery: Indian Tribes in Latin AmericaAmerica

  • GoogleIndians of Latin America: an exhibition of materials in the Lilly ...Lilly Library: Latin American mss. Brazil. A large map in colors, this locates the course of rivers, towns, mountain ranges, and Indian tribes. ...www.indiana.edu/~liblilly/etexts/ila/ - 241k - Cached - Similar pages - Note thisIndigenous peoples of the Americas - Wikipedia, the free encyclopediaAmerican Indian creation legends tell of a variety of originations of that it had confirmed the presenceAmerican Indian creation legends tell of a variety of originations of ..... that it had confirmed the presence of 67 different uncontacted tribes in Brazil, ...en.wikipedia.org/wiki/Indigenous_peoples_of_the_Americas - 178k - Cached - Similar pages - Note thisCognition :: Giving Technologies New MeaningThe volumes that Farabee produced from his travels include Indian Tribes of Eastern Peru ... motor vehicles that are lemons Indian tribes of Latin America ...wikipedia cognition com/?num=10&from val Indian%20tribes%20of%20Latin%20Ame - 54k -wikipedia.cognition.com/?num=10&from_val...Indian%20tribes%20of%20Latin%20Ame... - 54k -Cached - Similar pages - Note thisTop 25 American Indian Tribes for the UnitedTop 25 American Indian Tribes for the UnitedStates: 1990 and 1980--Con. ... 16028 73.0 Canadian and Latin American... 19375 248.3 Chickasaw. ...www.census.gov/population/socdemo/race/indian/ailang1.txt - 6k - Cached - Similar pages - Note thisTen Largest American Indian Tribes 2000 Infoplease comTen Largest American Indian Tribes, 2000 Infoplease.comLatin American Indian, 180940. Choctaw, 158774. Sioux, 153360 ... American Indian and Alaska Native Population by Selected Tribes, Census 2000 ...www.infoplease.com/ipa/A0767349.html - 29k - Cached - Similar pages - Note thisThe Indian Tribes of North America by John R. Swanton at Questia ...Read the complete book The Indian Tribes of North America by becoming a ..... Sao Paulo recently elected its must cope with demands by Latin America forits...must cope with demands by Latin America for ...www.questia.com/library/book/the-indian-tribes-of-north-america-by-john-r-swanton.jsp - Similar pages- Note this

  • Yahoodifferent indian tribes of latin america,More...WEB RESULTSSouth America DailyIndian Pepper Photos Prices Spices. The Times of India ... Archaeologists unearth ancient tribe members sacri London ... Iran and the left in Latin America ...www.wn.com/LatinAmerica - 192kwww.wn.com/LatinAmerica 192kNative American Indian Cultures - Mexico, South AmericaAlso, many of the Yanomamo tribe are losing their members and culture by ... of Amazon Indian tribal art in the world, with over 75 tribesrepresented. ...indian-cultures.com - CachedNative American Indian Cultures - linksNorth American Tribes. rednation.org - RedNation of the Cherokee. Meso and Latin American Indians ... Human Rights in Latin America...

    i di lt /C lt /Li k ht l C h dwww.indian-cultures.com/Cultures/Links.html - CachedIndigenous peoples of the Americas - Wikipedia, the free encyclopedia... in America, particularly with regards to native Indians. ... Uncontacted Indian tribe found in Brazil's Amazon. The Peopling of the American Continents ...en.wikipedia.org/wiki/Indigenous_peoples_of_the_Americas - 179k - CachedNative American Images - American Indian North America Tribe MapAmerican Indian North America Tribe Map. Click here to view more images ... Medal | History Hotline | Iraqi War | Korean War | LatinAmericans | Medal of ...|www.nativeamericans.com/NativeAmericanImages6.htm - CachedResources forNumbers of Native Americans or Indians in Latin America: 39,442,000 million ... Indian Tribes in Latin America - Latin American Indian Population - Up date ...www.xmission.com/~amauta/population.htm - CachedIndian tribe found in Brazil's Amazon - Boston.comLatin America/Caribbean. Indian tribe found in Brazil's Amazon ... Uncontacted tribes are usually discovered when loggers and ranchers encroach onencroach on ...boston.com/news/world/latinamerica/articles/2007/06/01/.../ News

  • AltaVistaLatin AmericaCompare airfare prices from over 120 top websites and save up to 70%.Flights.SideStep.com

    Regional Telecom Statistics & ForecastsFixed, mobile, Internet, broadband telecom statistics and forecasts.www.hottelecom.com

    AltaVista found 4,520,000 results

    South America DailyIndian Pepper Photos Prices Spices. The Times of India ... Archaeologists unearth ancient tribe members sacri London ... Iran and the left in Latin America...www.wn.com/LatinAmerica More pages from wn.com

    Native American Indian Cultures - Mexico, South AmericaNative American Indian Cultures Mexico, South AmericaAlso, many of the Yanomamo tribe are losing their members and culture by ... of Amazon Indian tribal art in the world, with over 75 tribes represented. ...indian-cultures.com More pages from indian-cultures.com

    Indian tribes in Suriname cross borders - Boston.comDays of rain near Suriname's southern border have deluged Amerindian farmland, ... Latin America/Caribbean. Indian tribes in Suriname cross borders ...www.boston.com/news/world/latinamerica/articles/2006/05/12...in_suriname_cross_borders More pages from boston.com

    Indigenous peoples of the Americas - Wikipedia, the free encyclopedia... in America, particularly with regards to native Indians. ... Uncontacted Indian tribe found in Brazil's Amazon. The Peopling of the American Continents ...en.wikipedia.org/wiki/Indigenous_peoples_of_the_Americas More pages from en.wikipedia.org

    Native American Images - American Indian North America Tribe MapAmerican Indian North America Tribe Map. Click here to view more images ... Medal | History Hotline | Iraqi War | Korean War | Latin Americans | Medal of ...www.nativeamericans.com/NativeAmericanImages6.htm More pages from nativeamericans.com

  • MSNNative American Images - American Indian North America Tribe MapNative American Images American Indian North America Tribe Map Click here to view more images ... History Hotline | Iraqi War | Korean War | Latin Americans ...

    www.nativeamericans.com/NativeAmericanImages6.htm Cached page

    Resources for152t ) P (126t ) P (67t ) S i (10t ) d V l (331t ) (t th d) I di T ib i L ti... 152t.), Panama (126t.), Paraguay (67t.), Surinam (10t.), and Venezuela (331t.) (t.=thousand). - Indian Tribes in Latin

    Americawww.xmission.com/~amauta/tribes.htm Cached page

    Latin America Community Assistance Foundation - LACAThe Tarahumara Indians are the most primitive of all Indian tribes in North America, and are the least touched by modern society.

    www lacafoundation org/?page id 58www.lacafoundation.org/?page_id=58 Cached page

    Latin America Tour Set for Curtis Photos of North America Tribes28 September 2005. Latin America Tour Set for Curtis Photos of North America Tribes. Famed photographer recorded Indian tribal life in 19th, early 20th century

    www.america.gov/st/washfile-english/2005/September/20050928134700GLnesnoM0.2225763.htmlLatin America // CurrentC t TV L ti A i t di l L ti A i t i d f th A j l l d fli tCurrent TV Latin America category, discover popular Latin America stories, news and ... of the Amazon jungle, a land conflict between rice farmers and a handful of Indian tribes ...

    current.com/topics/75844112_latin_america Cached page

    Bloomberg.com: Latin AmericaMay 30 (Bloomberg) -- Brazil's National Indian Foundation has discovered an Indian tribe in the Amazon that hasn't had contact with civilization in a rare sighting of the few ...

    bl b / / ? id 20601086& id S j5 fHW CQ& f l ti iwww.bloomberg.com/apps/news?pid=20601086&sid=aSrj5wfHW.CQ&refer=latin_america

  • Personalized focused search (wikipedia.cognition)

    Indian Latin-America tribe: 249 files William Curtis FarabeeThe volumes that Farabee produced from his travels include Indian Tribes of Eastern Peru based on his first trip in 1906-1908 (Obituary, 1925).Direct link (no highlighting)Mexican TexasSettlers were empowered to create their own militias to help control hostile Indian tribes. Texas faced raids from both the Apache andSettlers were empowered to create their own militias to help control hostile Indian tribes. Texas faced raids from both the Apache and Comanche tribes, [...]Direct link (no highlighting)Temecula, CaliforniaThe Luiseo and Cahuilla tribes were involved, rather bloodily, in the local battles of the Mexican-American War during the following years. Direct link (no highlighting)Kaweah Indian NationRecently, scam artists have sold purported citizenships in the non-recognized tribe, particularly to Mexican nationals who have entered the US ill ll 1 [ ]illegally.1 [...]Direct link (no highlighting)Flag of Puerto RicoThe tribal nation flag of the Jatibonicu Taino Indians of Borikn, represents the Jatibonicu Taino tribe's original pre-Columbian territories of [...]Direct link (no highlighting)Maina IndiansThe Maina Indians are a group of tribes constituting a distinct linguistic stock, the [...] along the north bank of the Maran River in South AmericaAmericaDirect link (no highlighting)Erie (tribe)^ Ebooks by Google: "Handbook of American Indians North of Mexico" By Frederick Webb Hodge http://books.google.com/books?Direct link (no highlighting)Miccosukee[1] Other members went on to form the Miccosukee Tribe of Indians of Florida, which was not recognized by Fidel Castro's Cuban government in 1959. The [...]Direct link (no highlighting)New Tribes MissionIn Paraguay in 1979 and 1986, New Tribes Mission was accused of assisting in the forcible contact of nomadic Ayoreo Indians.Direct link (no highlighting)

  • Example: Semantically precise search for relations/events

    Query: afghans destroying opium poppiesQ y fg y g p p pp

  • India Wide Cross Lingual I f ti A (CLIA)Information Access (CLIA)EndeavourEndeavour

  • MotivationEnglish still the most dominant language on the web

    Contributes 72% of the content

    Number of non-English users gsteadily rising all over the worldEnglish penetration in IndiaEnglish penetration in India

    Estimated to be around 3-4%Mostly the urban educated classclass

    Need to enable access to above information through local languages

  • Cross Language Information Retrieval (CLIR)

    Crawled andTarget Language Index Crawled and Indexed

    Web Pages

    Target Language Indexin English

    Hindi Query

    CLIR Engine

    T I f i

    CLIR Engine

    Target Informationin EnglishLanguage

    Resources

    | -

    Ranked List of Results

    | Result Snippets

    in Hindi

  • Challenges involved in CLIAChallenges involved in CLIAIndexing, retrieval and ranking of multilingual documentsWeb data is not clean and regular

    Different font encodings some of them proprietarySpelling variations very commonSpelling variations very commonDifferent document encodings

    Language identification needed to invoke appropriate l llanguage analyzersInvolves a number of fundamental NLP research problems like query disambiguation, machine p q y gtransliteration, named-entity recognition, multi-word recognition

  • Cross Language Information Access (CLIA) Consortia Project

    Indian Language CLIR Engine under developmentg g g pInput Six Indian Languages (Hindi, Bengali, Telugu, Tamil, Marathi and Punjabi) Output Hindi, English and Input Language of Queryp , g p g g Q yDomains Tourism (Current Release)

    Involves 10 academic institutes all over the country: IITs, Indian Statistical Institute CDAC Anna UniversityIndian Statistical Institute, CDAC, Anna University, Jadavpur University

    IIT Bombay Overall co-ordinatorResponsible for Hindi Marathi language verticalsResponsible for Hindi, Marathi language verticals

    Includes full-fledged search featuresSnippet translationSummary generationInformation Extraction

  • Portal

    Public portal released at http://www clia iitb ac in/clia-beta-ext/ inhttp://www.clia.iitb.ac.in/clia-beta-ext/ in September 2009. (Outside IITB)Public portal released atPublic portal released at http://www.clia.iitb.ac.in:8080/clia-beta-ext/ in September 2009. (Inside IITB)

  • Recent Press CoverageRecent Press Coverage

  • Hindustan Times

  • IR BasicsIR Basics

    (mainly from R. Baeza-Yates and B. Ribeiro-Neto. Modern Information RetrievalAddison-Wesley, Wokingham, UK, 1999.

    andChristopher D. Manning, Prabhakar Raghavan and Hinrich p g, gSchtze, Introduction to Information Retrieval, Cambridge

    University Press. 2008. )

  • Definition of IR Model

    An IR model is a quadrupul[D, Q, F, R(qi, dj)][ , Q, , (qi, j)]

    Where,D: documentsD: documentsQ: QueriesF: Framework for modeling document queryF: Framework for modeling document, query and their relationshipsR(.,.): Ranking function returning a real no.R(.,.): Ranking function returning a real no. expressing the relevance of dj with qi

  • Index Terms

    Keywords representing a documentSemantics of the word helps rememberSemantics of the word helps remember the main theme of the documentGenerally nounsGenerally nounsAssign numerical weights to index

    d hterms to indicate their importance

  • IntroductionDocs Index TermsIndex Terms

    doc

    Information Need Rankingmatch

    Information Need

    query

  • Classic IR Models - Basic Concepts

    The importance of the index terms is represented by weights associated to them

    Let t be the number of index terms in the system K= {k1, k2, k3,... kt} set of all index terms ki be an index term

    d be a document dj be a document wij is a weight associated with (ki,dj) wij = 0 indicates that term does not belong to docwij 0 indicates that term does not belong to doc vec(dj) = (w1j, w2j, , wtj) is a weighted vector

    associated with the document dj gi(vec(dj)) = wij is a function which returns the weight

    associated with pair (ki,dj)

  • The Boolean Model

    Simple model based on set theory Only AND, OR and NOT are usedy , Queries specified as boolean expressions

    precise semantics neat formalism q = ka (kb kc)

    T ith t b t Th {0 1} Terms are either present or absent. Thus, wij {0,1} Consider

    q = k (k k ) q = ka (kb kc) vec(qdnf) = (1,1,1) (1,1,0) (1,0,0) vec(qcc) = (1,1,0) is a conjunctive componentvec(qcc) (1,1,0) is a conjunctive component

  • The Boolean Model

    k (k k ) (1 1 0)Ka Kb

    q = ka (kb kc)(1,1,1)

    (1,0,0)(1,1,0)

    sim(q,dj) = 1 if vec(qcc) | Kc

    j(vec(qcc) vec(qdnf)) (ki, gi(vec(dj)) = gi(vec(qcc)))

    0 otherwise0 otherwise

  • Drawbacks of the Boolean Model

    Retrieval based on binary decision criteria with no notion of partial matching

    No ranking of the documents is provided (absence of a grading scale)Information need has to be translated into a Boolean Information need has to be translated into a Boolean expression which most users find awkward

    The Boolean queries formulated by the users are most often q ytoo simplistic

    As a consequence, the Boolean model frequently returns ith t f t d t i teither too few or too many documents in response to a user

    query

  • The Vector Model

    Use of binary weights is too limitingNon binary weights provide consideration for Non-binary weights provide consideration for partial matches

    These term weights are used to compute a degree of similarity between a query and each document

    Ranked set of documents provides for better matching

  • The Vector Model Define: Define:

    wij > 0 whenever ki djw >= 0 associated with the pair (k q) wiq >= 0 associated with the pair (ki,q)

    vec(dj) = (w1j, w2j, ..., wtj)vec(q) = (w w w )vec(q) = (w1q, w2q, ..., wtq)

    In this space queries and documents are In this space, queries and documents are represented as weighted vectors

  • The Vector Modelj

    dj

    i

    q

    Sim(q,dj) = cos()= [vec(dj) vec(q)] / |dj| * |q|= [ wij * wiq] / |dj| * |q|

    Si 0 d 0 0 i ( d ) 1

    i

    Since wij > 0 and wiq > 0, 0

  • The Vector Model

    Sim(q,dj) = [ wij * wiq] / |dj| * |q| How to compute the weights wij and wi ? How to compute the weights wij and wiq ? A good weight must take into account two

    effects:effects: quantification of intra-document contents

    (similarity)(similarity) tf factor, the term frequency within a document

    quantification of inter-documents separation (dissi- quantification of inter-documents separation (dissi-milarity) idf factor, the inverse document frequencyd acto , t e e se docu e t eque cy

    wij = tf(i,j) * idf(i)

  • The Vector Model Let,

    N be the total number of docs in the collection ni be the number of docs which contain ki freq(i,j) raw frequency of ki within dj

    A normalized tf factor is given byf(i j) = freq(i j) / max (freq(l j)) f(i,j) = freq(i,j) / maxl(freq(l,j))

    where the maximum is computed over all terms which occur within the document djj

    The idf factor is computed as idf(i) = log (N/ni) the log is used to make the values of tf and idf

    comparable. It can also be interpreted as the amount of information associated with the term kiinformation associated with the term ki.

  • The Vector Model The best term-weighting schemes use weights which are give

    by w = f(i j) * log(N/n ) wij = f(i,j) * log(N/ni)

    the strategy is called a tf-idf weighting scheme For the query term weights, a suggestion isFor the query term weights, a suggestion is

    wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N/ni) The vector model with tf-idf weights is a good ranking

    strategy with general collections The vector model is usually as good as the known ranking

    alternatives It is also simple and fast to computealternatives. It is also simple and fast to compute.

  • The Vector Model

    Advantages: term-weighting improves quality of the answer setg g partial matching allows retrieval of docs that

    approximate the query conditions cosine ranking formula sorts documents according

    to degree of similarity to the query

    Disadvantages: assumes independence of index terms (??); not p ( );

    clear that this is bad though

  • The Vector Model: Example I

    d7k1

    k2

    d1

    d2

    d3d4 d5

    d6d7

    d1

    k3

    k1 k2 k3 q djd1 1 0 1 2d2 1 0 0 1d3 0 1 1 2d4 1 0 0 1d5 1 1 1 3d6 1 1 0 2d7 0 1 0 1

    q 1 1 1

  • The Vector Model: Example II

    d7k1

    k2

    d1

    d2

    d3d4 d5

    d6d7

    d1

    k3

    k1 k2 k3 q djd1 1 0 1 4d2 1 0 0 1d2 1 0 0 1d3 0 1 1 5d4 1 0 0 1d5 1 1 1 6d5 1 1 1 6d6 1 1 0 3d7 0 1 0 2

    q 1 2 3

  • The Vector Model: Example III

    d7k1

    k2

    d1

    d2

    d3d4 d5

    d6d7

    d1

    k3

    k1 k2 k3 q djd1 2 0 1 5d2 1 0 0 1d2 1 0 0 1d3 0 1 3 11d4 2 0 0 2d5 1 2 4 17d5 1 2 4 17d6 1 2 0 5d7 0 5 0 10

    q 1 2 3