acquisition of bilingual comparable corpora - iit …pabitra/facad/06cs6015t.pdf · 2008. 5. 6. ·...

63
Automatic Acquisition Of Bilingual Comparable Corpus Thesis Submitted In Partial Fulfillment of the Requirement for the Degree In Master of Technology In Computer Science & Engineering Submitted by Mayank Gupta (06CS6015) Under the guidance of Prof. Sudeshna Sarkar Department Of Computer Science and Engineering IIT Kharagpur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, IIT KHARAGPUR Indian Institute of Technology Kharagpur, West Bengal – 721302 9 May 2008

Upload: others

Post on 11-Feb-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

  • Automatic Acquisition Of

    Bilingual Comparable Corpus

    Thesis Submitted In Partial Fulfillment of the Requirement for the Degree In

    Master of Technology

    In

    Computer Science & Engineering

    Submitted by

    Mayank Gupta (06CS6015)

    Under the guidance of

    Prof. Sudeshna Sarkar Department Of Computer Science and Engineering

    IIT Kharagpur

    DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, IIT KHARAGPUR

    Indian Institute of Technology

    Kharagpur, West Bengal – 721302

    9 May 2008

  • ii

    Department of Computer Science and Engineering

    CERTIFICATE

    This is to certify that the thesis entitled “Automatic Acquisition Of

    Bilingual Comparable Corpora” is a record of bona fide work carried out

    by Mr. Mayank Gupta (06CS6015), under my supervision and guidance,

    during the academic session 2007-2008, in partial fulfillment of the

    requirement for the degree of Master of Technology in Computer Science and

    Engineering, Department of Computer Science & Engineering, Indian

    Institute of Technology, Kharagpur. The results presented in this thesis have

    not been submitted elsewhere for the award of any other degree or diploma.

    Dr. Sudeshna Sarkar

    Department of Computer Science & Engineering,

    Indian Institute of Technology Kharagpur- 721302.

    Date: 30 April 2008

    Place: Kharagpur

    -

  • iii

    ACKNOWLEDGEMENTS

    I wish to extend my sincere thanks to my supervisor Prof. Sudeshna

    Sarkar, for keeping faith in me, and for being a driving force all through the

    way. The project would not be so smooth and so interesting without her

    encouragement. I am indebted to the Department of Computer Science &

    Engineering and IIT Kharagpur for providing me with all the required

    facility to carry my work in a congenial environment.

    I also wish to thank Mr. Debasis Mandal and Mr. Sujan Saha for providing

    me some valuable resources for my project, and to add a lot to my expertise

    by their invaluable suggestions. I extend my gratitude to the CSE department

    lab staff for providing me the needful time to time whenever requested.

    Above all, I am grateful to my parents, friends and well-wishers for their

    patience and continuous supply of inspirations and suggestions for my ever-

    growing performance. Last but not the least; I thank the Almighty for making

    me a part of the world.

    MAYANK GUPTA 06CS6015

    -

  • iv

    Dedicated to…

    My mom, who loves me more than I love myself;

    And to all those who love her…

    -

  • v

    Contents

    Abstract...................................................................................................................... x

    Introduction......................................................................................................... 1–12

    1.1 Cross-Language Information Retrieval ........................................1–13

    1.2 Multilingual nature of web .............................................................1–14

    1.3 Motivation ........................................................................................1–15

    1.4 Problem Statement ..........................................................................1–15

    1.5 Organization of thesis .....................................................................1–16

    Background ......................................................................................................... 2–19

    2.1 Notion of corpora.............................................................................2–20

    2.2 Comparable Corpora ......................................................................2–20

    2.3 Classification of CC .........................................................................2–22

    2.4 Some Examples of Comparable Corpora .....................................2–22

    2.5 Applications of Comparable Corpora ...........................................2–24

    2.6 Limitations of CC ............................................................................2–24

    Related Work ...................................................................................................... 3–26

    3.1 Effect of poor dictionary .................................................................3–28

    Work Done........................................................................................................... 4–33

    4.1 Newspaper Sites: Open Source of CC ...........................................4–33

    -

  • vi

    4.2 Methodology.....................................................................................4–33

    4.3 Details of the implementation.........................................................4–34

    4.3.1 Gathering News .........................................................................4–35

    4.3.2 Named Entity Recognition (NER) ...........................................4–35

    4.3.3 Preprocessing.............................................................................4–35

    4.3.4 Translation.................................................................................4–36

    4.3.4.1 Dictionary Lookup (DL).......................................................4–36

    4.3.4.2 Gazetteer List Lookup (GL) .................................................4–36

    4.3.4.3 Abbreviation List Lookup (AL) ...........................................4–36

    4.3.4.4 Transliteration Similarity (TS) .............................................4–37

    4.5 Filtering ............................................................................................4–39

    4.6 Conflict Resolution ..........................................................................4–39

    4.7 Example ............................................................................................4–40

    Results .................................................................................................................. 5–43

    5.1 Sample data ......................................................................................5–44

    5.2 Evaluations .......................................................................................5–44

    5.3 Analysis.............................................................................................5–47

    5.3.1 Positive results ...........................................................................5–47

    5.3.2 Negative results .........................................................................5–48

    5.3.2.1 Pair unable to cross the thresholds .......................................5–48

    5.3.2.2 Absence of corresponding news article in target language..5–49

    5.3.2.3 Effect of similar names found in dissimilar stories..............5–50

    5.3.2.4 Effect of similar dictionary words in stories depicting news on

    similar topic, yet describing different events ...................................5–53

    5.3.2.5 Effect of wrongly identified pair of names ..........................5–54

    5.3.2.6 Effect of poor dictionary ......................................................5–55

    5.4 Improvements ..................................................................................5–56

    -

  • vii

    Conclusion .......................................................................................................... 6–59

    6.1 Future work .....................................................................................6–60

    Bibliography ........................................................................................................ 6–62

    -

  • viii

    List of Figures

    1. Difference between the volumes occupied by English-speaking users on the web, as against that occupied by non-English speaking people......1–14

    2. Frequency distribution of number of translations in Hindi bilingual

    dictionary ...............................................................................................3–28 3. Recall plotted versus Precision for the six runs for Bengali and Hindi to

    English CLIR Evaluation experiment....................................................3–30 4. Precision versus evaluation of system on ten random points for a sample

    data of stories collected over a period of one month.............................5–44 5. Average precision values for the evaluations for one month sample data 5–

    45 6. Number of documents retrieved versus precision obtained for one month

    test data ..................................................................................................5–46 7. Number of documents retrieved versus precision obtained on improved

    system ....................................................................................................5–57

    -

  • ix

    List of Tables

    1. Cross language runs submitted in CLEF 2007 ......................................3–30

    2. Summary of bilingual runs of the CLIR evaluation experiment ...........3–30

    3. Phonetic rules for recognizing similar sounding names........................4–37

    4. Number of retrieved documents versus the Precision values for top 70

    pairs........................................................................................................5–47

    5. Number of retrieved documents versus the Precision values for top 70

    pairs for improved case..........................................................................5–57

    -

  • x

    Abstract Corpora are the main knowledge foundation for progress in the field of information

    retrieval [Arora]. Processing of multilingual corpora helps in the construction of efficient

    language-specific resources and in cross-lingual information access [Peters] [Mandal].

    This report presents the work aimed at constructing a comparable corpus automatically in

    English, and the most common Indian language, Hindi, by collecting similar news stories

    from newspaper websites. Our system identifies comparable news articles by recognizing

    intersecting proper names, and content overlap using a medium coverage dictionary,

    transliteration similarity, temporal closeness and filtering mechanisms at various levels.

    The system scored 67.4% and 38.4% precision for the best and worst case respectively,

    evaluated randomly on our sample set of Hindi news articles of 30 consecutive days. We

    adopted some enhancements in the present system after in-depth scrutiny and achieved

    substantial improvements in the results.

    Based on achieved results, we deduce the observation in accordance with the general

    perception, that former half of the news article furnishes the largest part of the essential

    information conveyed. However, it is the consideration of whole story, which results in

    better identification of similar stories. Consistent performance of the results when we

    consider the whole stories over the fluctuating precision values for the case when we

    examine the first half of the stories helps us arrive at the conclusion. Detection of

    association in a source-target pair on intersecting proper names is indispensable, but

    should not be the only criteria, specifically for languages like Hindi, which is highly

    undersupplied with valuable cross-language assets.

    -

  • xi

    -

  • Chapter 1

    Introduction The first step towards advancement in research in various emerging areas of natural

    language processing and interconnected fields is the availability of huge collections of

    easily accessible text [Arora] [Resnik]. Such corpora are useful in the extraction of

    knowledge for the participating languages, and for the construction of many efficient

    resources for these languages. Resnik and Smith [Resnik] lists some of the uses of Parallel

    corpora, and elucidate their algorithm for building one through mining the web

    (STRAND).

    Corpora could be monolingual or multilingual. Monolingual, as the name implies, is a

    corpora where only one language is involved. Multilingual is a corpora linking two or

    more languages, with further classification as parallel or comparable. Parallel corpora refer

    to near-exact translations of text in participating languages, whereas Comparable corpora

    refer to collection of texts in different languages according to some criterion of similarity.

    Examples of parallel corpora could be translations of a text in different languages, such as

    user manuals, educational books or proceedings of some event either written

    independently by different persons, or translated from a single source. In case of

    comparable data, the text may be similar in content (comparable content) or any other like

    time or domain. Newspaper articles in multiple languages from independent press

    agencies, or description of a product or occasion by different people are some example of

    a comparable text.

    Aligning the comparable text to sentence and word level can help create automatically

    efficient resources for the languages deficient in these assets at present. Such a corpus

    acts as the starting step of cross-lingual studies. It also helps in multilingual

    summarization, which is the automatic procedure for extracting information from multiple

    texts written about the same topic in multiple languages. It offers a wide scope of

  • 1–13

    applications for research in Discourse analysis (Analysing written, spoken or signed

    language use), Pragmatics (ability to understand the intended meaning of speaker,

    pragmatic competence), Information retrieval etc.

    1.1 Cross-Language Information Retrieval

    Cross-Language Information Retrieval (CLIR) is an enormous and ever-growing field, and

    is not limited to one particular subject or discipline, rather is multidisciplinary. People

    from interconnected fields like Information Retrieval, NLP, Machine Translation, and

    Speech Processing etc come closer for information access. There are various Information

    Processing issues (Cross Lingual information access, speech processing) and a strong need

    of Language resources (dictionaries, thesauri, corpora, test collections).

    The escalating demand of CLIR is due to various reasons. Growing Internationalization

    had made many developed and developing countries multilingual, where the number of

    speakers of non-native lingo is not less than that of native speakers. United States, Canada

    and even India are some of those countries where no single language dominates, and

    people from all parts of the world with their own culture and language share the

    boundaries. Globalization of the economy has reduced the problem of localization of

    employment, and the trend of MNCs has have employees & customers from all parts of

    the world coming and working under one roof, speaking and using multiple languages.

    Global Information Society has evaporated the physical boundaries of the world, and

    shrinking of the space between the human races for want of educational needs and

    entertainment.

    Of course, to attain better communication, desire to achieve excellence in some common

    language at such places, in addition to the native speech often arises. For many it is not a

    setback, but for those with no background in other languages, communicating becomes a

    matter of nuisance. Availability of resources like translators, multilingual dictionaries and

    other cross-lingual packages has made information access in multiple languages much

    easier than ever. For the ease of customers and for producers too, official documents and

    manuals are no longer restricted to a single language. Distance Learning, Digital Libraries

    and other resources provided to students worldwide by top educational institutions are no

    -

  • 1–14

    longer constraining students to follow the path to success in a language they are not

    comfortable.

    1.2 Multilingual nature of web A survey shows that Internet is no longer monolingual and non-English content is growing

    rapidly [CNNIC]. Figure 1 shows the difference between the volumes occupied by

    English-speaking users on the web, as against that occupied by non-English speaking

    people. This is the place where the need for MLIA arises. Carol Peters [Carol] estimates

    78% of Internet Users to be Non-English Speaking by 2005. The latest figures put 70% of

    the online population to comprise of the non-English group, as on 30 June 2007. Even

    though the number of such corpus collections is increasing, the number of languages in

    which such collections are available is still limited [Somers]. The coverage of such

    collections for a particular need is also not satisfactory always.

    Figure 1

    Difference between the volumes occupied by English-speaking users on the web, as against that occupied by non-English speaking people

    Though English still tops the chart, more and more people with diverse backgrounds are

    connecting to the web. In addition to the largely exploited language English, the work in

    European languages such as French, German and Portuguese, and Asian languages such as

    Chinese, Korean and Japanese has also escalated in recent times [McEnery]. With the

    -

  • 1–15

    advancement in technology, more and more languages are becoming a part of the efforts in

    the field of Natural Language Processing and associated areas.

    1.3 Motivation

    With 22 listed regional languages listed in the eighth schedule of the Constitution1, India

    has been a multi-language country with a wealthy unexplored reserve of Knowledge.

    [Yale] [Languages]. According to the latest survey, Hindi is the fifth and Bengali seventh

    highest spoken language in the world, with Chinese and English having ranks one and

    three respectively [Ethnologue]. Most of the inhabitants of India are bilingual in nature,

    and are exposed to English or Hindi (or both), in addition to their mother tongue, in

    general [Mandal]. They, as well as those who are not, seek information from different

    domains (like news) and often face trouble in doing so. The motivation of this venture is

    easing the troubles in information access between Hindi and English. We selected Hindi to

    maximize the effect of the project.

    However, merely collecting texts from different sources does not constitute a corpus.

    Inferring knowledge from the corpora is as important as selection of corpora and the

    development of tools. In addition to collecting pairs of similar news stories in the three

    languages, we develop as a by-product of our project, a list of English Named Entities

    collected from English news stories, and an automatically generated gazetteer list that

    contains English names for Hindi names that the system recognize as proper nouns.

    1.4 Problem Statement

    Our work is an attempt in the direction of making a comparable corpus in the most

    common Indian language, that is, Hindi, by collecting similar news articles automatically

    from online news websites. The objective of this project is easing the information access

    between Hindi and English, and the collection to act as the initiator to many researches in

    language technology. We selected Hindi to maximize the effect of the project.

    1 http://languages.iloveindia.com/

    -

  • 1–16

    The process extracts proper names from the news stories in each language. For languages

    other than English, we use the limited coverage bilingual lexicon available with us for

    translation of some key words. The similarity between two stories in different languages is

    a function of the extent of correspondence between the identified names, and of

    translations. We use various phonetic substitutions to identify names with same

    pronunciation in the two languages. The pair closest to each other in terms of temporal

    closeness in addition to similarity in names and translations is tagged the best match.

    1.5 Organization of thesis

    We present the thesis in the following manner.

    Chapter 1 introduces the concept of Cross-Language Information Retrieval, highlighting

    the escalating call for proficient cross-language resources. The motivation of this project,

    enhancing the resources for Indian languages is also explained alongwith the problem

    statement.

    The next chapter, that is, Chapter 2, discusses the background of the area of natural

    language processing, particularly the explanation of what a corpus is. This chapter

    familiarizes the reader with the concept of comparable corpus, classifications and its

    applications. A section for describing some of the available corpuses is also present.

    Chapter 3 describes the efforts worldwide for creation of text corpora. The section

    includes our own effort to understand the need of a corpus for enrichment of the deficient

    and undersupplied resources in Indian languages Hindi and Bengali. We highlight the need

    of an effective bilingual lexicon for noticeable Cross-Language Information Retrieval, in

    addition to other language-specific needs like Named Entity Recognizer and Feedback

    System.

    Chapter 4 explains the algorithm we adapted for identifying similar stories in the crawled

    news stories from the websites in participating languages English and Hindi.

    -

  • 1–17

    Chapter 5 analyzes the results of our test run on a sample data set. The set was a

    collection of more than 1700 stories in the source language Hindi, collected over the

    period of 30 days. The target collection contained English news articles collected over the

    same period. We analyzed the causes of failure of the present work and evaluated the

    system again after adopting some improvement measures. The section discusses the

    achieved performance and the accomplished results in sync with the expected.

    We conclude the thesis with Chapter 6, where we discuss the future work we wish to

    acknowledge in this area. The references follow the scope and limitations of our present

    work.

    -

  • 1–18

    -

  • 2–19

    Chapter 2

    Background

    According to Computer Linguistics, a corpus is a self-contained compilation of texts,

    spoken and/or written, accumulated and assembled on a set of clearly defined criterion.

    Intention of corpus collection is generally to serve a particular purpose to the person

    gathering it, and to others working in the participating language the corpus is in, through

    its exploitation for various resources and language studies. ICAME (International

    Computer Archive of Modern English) is a centre that aims to organize and assist the

    sharing of computer-based corpora. Some of the examples of available English corpora are

    available at Computer Linguistics website2

    Some of the corpora in English are as follows. In British English, we have The BNC, a

    corpus in written and spoken British English, used extensively by researchers and for the

    Oxford University Press, Chambers and Longman publishing houses. CANCODE

    (Cambridge Nottingham Corpus of the Discourse of English) is a corpora in spoken

    British English, and used at length by researchers and Cambridge University Press. In

    addition, there is ICE (International Corpus of English), including in itself international

    varieties of spoken and written English. The corpus has a major drawback that most of the

    corpus is not yet available.

    Brown University Corpus & LOB (Lancaster-Oslo-Bergen) Corpus is a parallel corpora of

    written texts, but is now rather outdated. The Bank of English is a compilation of written

    and spoken English, an important resource for researchers and for the COBUILD series of

    English language books. London-Lund Corpus (Survey of English Usage) is a collection

    of spoken British English, but it is now quite old. Santa Barbara Corpus collects text in

    spoken American English. This corpus has the drawback similar to ICE corpus; most of

    the corpus is not yet accessible for use. Hong Kong Corpus of Spoken English is an under-

    compilation corpus.

    2 http://www.engl.polyu.edu.hk/corpuslinguist/corpus.htm#Definition%20of%20a%20corpus.

    -

  • 2–20

    2.1 Notion of corpora

    There are two approaches to multilingual corpora: parallel and comparable. Parallel

    Corpora is a compilation of texts, each of which translated into at least one language other

    than the original and thus, clubs together perfectly aligned (parallel) translated text. The

    simplest case exists where only two languages are participating: one corpus is exact

    transformation of other, with virtually non-existent direction of translation. Examples of

    parallel corpora and some efforts worldwide can be found at [Arora].

    In order to analyze a parallel or comparable text, some kind of text alignment is essential,

    which identifies equivalent text segments like sentences or words. One example of parallel

    corpus is European Parliament, a corpus of pair wise-aligned files created by Philipp

    Koehn. The corpus is available in Danish-English, German-English, Greek-English,

    Spanish-English, Finnish-English, French-English, Italian-English, Dutch-English,

    Portuguese-English and Swedish-English. Each corpus is about 100 MB [Athel].

    TRIPTIC, that is, TRIlingual Parallel Text Information Corpus, forms part of the empirical

    data used for research on the contrastive analysis of prepositions. Developed in English,

    French and Dutch, the corpus investigates the way in which languages converge and

    diverge in the semantic structure of so-called function words. According to [Athel], the

    paragraph-aligned corpus consists of 20 lakh words consisting of 10 lakh fiction and non-

    fiction data each. The corpus has the facility of automatic selection of the n-th paragraph

    in the languages the trilingual corpus is.

    Parallel corpora are objects of curiosity because of the prospect it offers for aligning

    original and translation and gaining insights into the nature of translation. Tools to aid

    translation can be formulated. In addition to it, probabilistic machine translation systems

    can be trained on such a collection of parallel text.

    2.2 Comparable Corpora

    Comparable Corpora choose alike texts in more than one language or variety, based on

    some criterion of similarity. The sub corpora are not exact translations of each other, but

    -

  • 2–21

    collected on either same sampling frame or some measure of comparability. In simpler

    words, comparable corpora are corpora chosen under “...similar circumstances of

    communication”. There is no strict agreement on the nature of the similarity and there are

    very few examples of comparable corpora [Resnik]. One such example is ICE -- the

    International Corpus of English.

    One example of comparable content could be description of a new product written

    independently by different people in the language they are comfortable. The style of

    writing and the presentation would vary a lot. In addition, one may highlight more features

    depending upon his or her perception of the product, and may even include comments and

    suggestions for further improvement. Some author in the depiction may also include

    feedback of other users known to the writer. Even with such a variability in the account of

    same product, high chances are there of sentences in different language speaking of the

    same feature of the product. We can identify and extract such sentences, and exploit a

    huge collection of such accounts of different products for an enormous collection of highly

    valuable multilingual corpora.

    Such corpora would be an example of corpora generated based on content. If there exists a

    time-bound on the selection of descriptions, assigning more closeness to descriptions

    within a specified time-bound, then the corpora thus chosen is concurrent corpora.

    Newspaper reports are example of such corpora. The next subsection describes major

    classifications of comparable corpora based on the similarity condition used.

    Comparable corpus enjoys many advantages over parallel corpora in terms of availability,

    versatility, extensibility and accessibility. Moreover, parallel corpora works on the

    assumption that the amount of variation in the texts under consideration is limited. The

    procedures of acquisition of comparable corpora generally relax the limitation largely.

    Mountains of comparable text exists online in the form of news reports; and in print in the

    form of legal texts, socially conventional texts like marriages, announcements and

    advertisements, books and magazines etc. Academic and scientific text written in

    accordance with neighborhood conventions is a high-quality source of related data.

    Comparable text benefits from its nature of being easily extensible, with negligible data

    acquisition issues in most cases, something parallel corpora is deficient in.

    -

  • 2–22

    2.3 Classification of CC

    Based on the similarity measure used, text can be treated as comparable on the basis of

    four types. The first one is the form of data, that is, the size of files, number of words,

    sentences, paragraphs or even the length of texts. It can also be on the file’s format - .txt,

    .doc, .html, .xml etc. Content can be compared for finding similar documents. The corpus

    can be in general language or talking of specialised domains. Newspaper articles, reports

    of war or politics, interviews and discussions, views and reviews all come under this

    category. Structure of the documents can be considered too, where the text can be formal,

    carefully constructed texts like Legal texts, or Informal, loosely organized discourse like

    transcriptions of conversation. Mode is the fourth category where the similarity measure is

    based on whether the text is spoken: e.g. speech, formal dialogue, conversation, or written:

    e.g. book, essay, instruction manual. Very large-sized corpora can be treated as

    comparable to some other corpora similar in size to it, and is constructed according to

    same criteria of quantity and quality of text types. Many more measures are possible other

    than the ones described above, to categorize collected data as comparable.

    2.4 Some Examples of Comparable Corpora

    This section describes three major efforts globally to acquire comparable corpus.

    2.4.1 ICE (International Corpus of English)

    It is a corpus of around 1 million words in each of many varieties of English around the

    world. It began in 1990 with the primary aim of collecting material for comparative

    studies of English worldwide. Fifteen research teams around the world are preparing

    electronic corpora of their own national or regional variety of English.

    Each ICE corpus consists of 1 million words of spoken and written English produced after

    1989. To ensure compatibility among the constituent corpora, each team is following a

    common corpus design, as well as a common scheme for grammatical annotation. More

    information is available on the website http://www.ucl.ac.uk/english-usage/ice/index.htm.

    -

  • 2–23

    2.4.2 The Brown Corpus (American English)

    The Brown Corpus of Standard American English is the first of the modern, computer

    readable, general corpora compiled by W.N. Francis and H. Kucera, Brown University,

    Providence, RI. It has 1 million words of American English texts printed in 1961 sampled

    from fifteen different text categories to make the corpus a good standard reference. The

    corpus is undersized, slightly passé but still utilized and clichéd by other corpus compilers

    The LOB corpus (British English) and the Kolhapur Corpus (Indian English) are two

    examples of corpora made to match the Brown corpus. Comparison of same language

    existent in different varieties such as English is easy with the availability of these corpora.

    2.4.3 LOB Corpus (British English)

    Researchers in Lancaster, Oslo and Bergen compiled the Lancaster-Oslo-Bergen Corpus.

    It has 1 million words of British English texts from 1961 sampled from fifteen different

    text categories. Each text is just over 2,000 words long (cut at the first sentence boundary

    after 2,000 words for longer texts) and the number of texts in each category varies. The

    corpus has been grammatically tagged (all words have been given a word-class label). The

    tagged and untagged versions of the corpus are available through ICAME. This corpus is

    the British counterpart of the Brown Corpus of American English, which contains texts

    printed in the same year for comparison between both varieties.

    2.4.4 Kolhapur Corpus (Indian English)

    This corpus is comparable to the Brown and the LOB corpora. The motive behind the

    construction of this corpus with to serve as source material for comparative studies of

    American, British and Indian English, and is drawn from materials printed and published

    in 1961 for Brown and LOB corpora and 1978 for the Indian corpus. It consists of 500

    texts sampled from 15 different text categories, each consisting of just over 2,000 words

    drawn mainly from Govt. Documents, foundation reports, industry reports, College

    catalogue, Fiction, Religion, Press Editorials etc.

    -

  • 2–24

    2.5 Applications of Comparable Corpora

    Comparable texts in different languages can extract multilingual lexicons, paraphrases and

    other language-specific resources, and available ones in the same languages can enrich

    themselves. This is particularly helpful in creating effective bilingual dictionaries for

    language pairs for which either no dictionary exists, or the available ones are very

    ineffective. It also helps in multilingual summarization, which is the automatic procedure

    for extracting information from multiple texts written about the same topic in multiple

    languages. It offers a wide scope of applications for research in Discourse analysis

    (Analyzing written, spoken or signed language use), Pragmatics (ability to understand the

    intended meaning of speaker, pragmatic competence), Information retrieval etc.

    2.6 Limitations of CC Disadvantages of comparable corpus lies in the difficulty in managing the corpus for more

    delicate analysis. Also, comparable data is not applicable to all areas of language

    technology, and becomes nnecessary for certain types of research.

    -

  • 2–25

    -

  • Chapter 3 Related Work Karunesh Arora et. al, CDAC, Noida [Arora] have made a Parallel Corpora for 12 Indian

    languages (Hindi, Punjabi, Marathi, Bengali, Oriya, Gujarati, Telugu, Tamil, Malayalam,

    Assamese, Kannada ) including Nepali aligned at paragraph level. The corpus was OCRed

    from various books available in different Indian languages like National Book Trust India,

    Sahitya Akademi, and Navjivan Publications. They found candidate pairs for paragraph

    alignment based on the size of content. The paper also describes some of the major efforts

    towards the acquisition of corpus in many languages all over the world.

    Harold Somers [Somers], England deals with the techniques of building parallel corpora

    from web through ‘tricks’, like names of filename (.fr, .en) or ‘anchors’ in the text. They

    find the candidate pairs based on content, that is, the amount of text available between

    each anchor. In addition, they highlight the issues related to alignment of thus obtained

    parallel data. Using identification of anchor points and evaluating the extent of match of

    the text between these anchors, or some other features like usage of machine-readable

    dictionaries and other language specific resources are some of the techniques discussed.

    Almeida and Alberto [Almeida] grab the parallel corpora from web by getting file pairs

    from a list of URLs; user supplied file pairs; from the result of queries on search engines;

    from a web site etc. They go for in-depth validation techniques, such as file size

    comparison, string normalization & edit distance. They use a variety of normalization

    techniques, such as they convert “index_pt.html” to “index”, for the normalization of

    “index_pt.html” (for Portuguese) and “index_en.html” (for English) to “index”. Shinmaya

    & Sekine [Shinmaya], 2003 took Japanese newspapers on stories involving deaths. They

    performed comparing the sentences on extent of match of Named-Entities present in the

    sentences.

  • 3–27

    Resnik and Smith [Resnik] utilize the Internet Archive for grabbing parallel data, and

    using their STRAND (Structural Translation Recognition, Acquiring Natural Data) Web-

    Mining architecture to identify pages that might be translations of each other in different

    languages. Their algorithm first locates the pages that might have text parallel in multiple

    languages, spawning candidate pairs that could be translations through URL-matching

    algorithm, and applying structural filtering to throw out negative pairs. They use their

    algorithm to build English-Arabic corpus having 2910 translation pairs.

    Parallel Text Miner (PTMiner) used by Chen and Nie [Chen] applies the concept of

    utilizing existing search engines to locate parallel data through a query in a specific

    language. The search engine returns links to the pages, which they first verify for the

    language of interest using length filter and automatic language identification. Once they

    locate candidate multilingual sites, they are crawled deeply for data. Based on

    examination, Chen and Nie accumulate 118MB/135MB English-French corpus having a

    95% precision, and 137MB/117MB English-Chinese corpus of 90% precision [Chen]

    [Resnik].

    Ma and Liberman [Ma] in 1999 proposed BITS, that is, Bilingual Internet Text Search,

    wherein multilingual pages from a pre-specified list of domains are identified using

    language identification. The recognized pages are crawled exhaustively, filtered and

    compared based on content and filtered on threshold.

    To highlight the need of bilingual lexicon for effective CLIR, we developed a system to

    retrieve relevant documents from English target collection in response to queries in Hindi

    and Bengali using Machine Translation approach. The system submitted the results in

    ranked order. The best MAP values (Mean Average Precision) for Bengali and Hindi

    CLIR for our experiment were 7.26% and 4.77%, which were 20% and 13% of our best

    monolingual retrieval, respectively. The system became a part of our first participation in

    Cross-Language Evaluation Forum (CLEF) 2007 [Mandal].

    We followed the dictionary-based Machine Translation approach to generate the

    corresponding English query out of Indian language (Hindi, Bengali) topics. Our main

    challenge was to work with a limited coverage dictionary (of coverage ~20%) that was

    -

  • 3–28

    available for Hindi-English, and virtually non-existent dictionary for Bengali-English.

    Therefore, we depended mostly on a phonetic transliteration system to overcome this. We

    had access to a Hindi-English bilingual lexicon of approximately 26000 Hindi words, a

    Bengali bio-chemical lexicon of around 9000 Bengali words, a Bengali morphological

    analyzer and a Hindi Stemmer. In order to achieve a successful retrieval under this limited

    set of resources, we adopted the following strategies: Structured Query Translations,

    phoneme-based followed by a list-based named entity transliterations, and performing no

    relevance judgment. Finally, we fed the English query into Lucene search engine and

    retrieved the documents along with their normalized scores, which follows the Vector

    Space Model (VSM) of Information Retrieval.

    3.1 Effect of poor dictionary

    We emphasize in this section the effect of underprivileged bilingual lexicon on the overall

    results.

    Figure 2

    Frequency distribution of number of translations in Hindi bilingual dictionary

    The above graph (Figure 2) shows the frequency distribution of number of translations for

    Hindi words in the Hindi bilingual dictionary we used. With the increase of lexical entries

    -

  • 3–29

    and Structured Query Translation (SQT), more and more ‘noisy words’ were incorporated

    into final query in the absence of any translation disambiguation algorithm, thus bringing

    down the overall performance. The average English translations per Hindi word in the

    lexicon were 1.29, with 14.89% Hindi words having two or more translations. For

    example, the Hindi word ‘रोकना’ (to stop) had 20 translations in the dictionary, making it

    highly susceptible towards noise. The process followed for CLIR was as follows:

    STOPWORD REMOVAL

    STRUCTURED QUERY TRANSLATION

    BILINGUAL LEXICON LOOKUP (ALL TRANSLATIONS USED)

    CORPUS PROCESSING

    QUERY GENERATION

    STEMMING

    INDEXING (LUCENE)

    LANGUAGE – SPECIFIC STOPWORDS REMOVAL

    MORPHOLOGICAL ANALYZER (BENGALI)

    STEMMER (HINDI)

    LUCENE STEMMER (ENGLISH)

    (ALL POSSIBLE STEMS)

    DOCUMENT RETRIEVAL

    TRANSLITERATION (ITRANS)

    LUCENE

    RESULTS

    1.35 LAKH DOCS ~433 MB SIZE

    EDIT – DISTANCE ALGO USED FOR

    NAMED-ENTITY MATCHING

    The topics were in the Indian languages Hindi and Bengali, with a target collection of

    English documents. The topics consisted of three fields namely Title, Description and

    Narration. Table 1 shows the various Cross language runs submitted in CLEF 2007. Table

    2 shows the summary of bilingual runs of the CLIR evaluation experiment.

    -

  • 3–30

    Table 1

    Cross language runs submitted in CLEF 2007

    Table 2

    Summary of bilingual runs of the CLIR evaluation experiment

    Figure 3

    Recall plotted versus Precision for the six runs for Bengali and Hindi to English CLIR Evaluation

    -

  • 3–31

    The above graph (Figure number 3 shows Recall plotted versus Precision for the six runs

    for Bengali and Hindi to English CLIR Evaluation experiment we evaluated the system

    on, two for each language English (monolingual), Hindi and Bengali. We figured out that

    difference in performance is due to missing specialized vocabulary, missing general terms,

    wrong translation due to ambiguity and correct identical translation. There is a strong need

    for effective translation, memory & processing capacity, effective bilingual lexicon and

    the availability of a parallel corpus to build statistical lexicon.

    Translation disambiguation during query generation explains the anomalous behavior of

    Hindi. Query wise score breakup revealed that the queries with more named entities

    always provided better results than those lacking them. The poorer performance of our

    system with respect to other resource-rich participants clearly pointed out the necessity of

    a rich bilingual lexicon, a good transliteration system, and effective Named Entity

    recognition. We found that a trilingual comparable corpus in the three languages English,

    Hindi and Bengali could prove to be the key in constructing the needful resources. Since at

    present no such corpus exists that could serve our purpose, we decided to construct one

    such corpus in English and Hindi. The next phase of the project is towards the fulfillment

    of the preliminary requirements without which no CLIR in these two languages can be of

    much help.

    -

  • Chapter 4

    Work Done

    4.1 Newspaper Sites: Open Source of CC

    For this project, the source of comparable data is the freely available news articles from

    online newspapers Navbharat Times (NBT) for Hindi and Times of India (TOI) for

    English. The choice of online news articles is the fact that news items offer high variety

    and easy availability, with no legal issues of data acquisition.

    In case of comparable corpus, there is no complete information overlap, particularly for

    languages like English and Hindi with very different foundations. Two texts written by

    different press agencies, describing the same incident or event, can vary widely in

    presentation and content, depending on the times the two events are covered, and the

    author. One may contain additional facts or comments, and can have different arrangement

    of the appearance of comparable information, adding to the “noise”. We assumed

    beforehand that the amount of insertion and deletions between the texts is limited across

    the similar stories. If these cases do not hold, the chances of false positives cropping up in

    the final pairs’ list escalate.

    4.2 Methodology

    Different authors writing the same event in their respective newspaper will result in

    numerous versions of similar content uploaded on websites. However, certain kinds of

    noun phrases such as names, dates and numbers behave as “anchors” which are shared by

    similar articles. Our key inspiration is to identify these anchors among comparable articles

    and find the similarity score. This way we can extract stories that convey the similar

    information.

  • 4–34

    4.3 Details of the implementation

    NEWS ARTICLES

    CRAWLER NEWS ACCUMULATION HINDI ENGLISH

    WWW

    PREPROCESSING

    TRANSLATION

    NAMED ENTITY RECOGNITION

    DICTIONARY LOOKUP

    TRANSLITERATION

    SIMILARITY GAZETTEER LIST

    LOOKUP

    ABBREVIATIONS LIST LOOKUP

    COMPARABLE CORPUS

    TEMPORAL FILTERING

    SIZE FILTERING

    CONFLICT RESOLUTION

    SIMILARITY FINDING

    THRESHOLD FILTERING

    -

  • 4–35

    4.3.1 Gathering News

    The first step was to gather news items from the mentioned web sites automatically. On a

    daily basis, we collected crawled news stories and extracted the actual news article

    alongwith the title, date and place of publication. We saved special reports with no place,

    like letters to editor, discussions or reviews with no place defined. This helped in matching

    of such special reports in different newspapers.

    4.3.2 Named Entity Recognition (NER)

    For identifying proper names in English story, we used the freely available LingPipe

    Named Entity demo code3 that identifies proper names present in English files, and tags

    them in three classes: Person, Organization and Location. However, the quantity of names

    identified by the system was small and included some false positives. To enhance the

    collection of identified names, we made a “crude” Named Entity recognizer for English

    that considers any word with a capital letter a proper name, including sentence’s first

    words. It treated multi-word names like “Prime Minister” as single name. To suppress

    false positives, we constructed a new stopword list that contains common sentence

    initiators such as “Currently” and “Today”, in addition to the common stopwords such as

    “The”.

    For identifying names in Hindi story, we used a Named Entity recognizer4 and collected

    the Hindi proper names identified. The Named Entity recognizer translates Hindi file and

    extracts and tags the proper names present in the supplied file to three classes: Name,

    Location, Organization and Date. Therefore, the system tags “Manmohan Singh” to

    PERSON, and 20 June 2008 to DATE.

    4.3.3 Preprocessing

    Stopword removal and processing of special cases for both languages were undertaken.

    Special cases include normalizing currencies like $20 to 20 dollars, mathematical 3 http://alias-i.com/lingpipe/web/demo-ne.html4 Courtesy Mr. Sujan Saha, Communication Empowerment Laboratory, Department of Computer Science and Engineering, IIT Kharagpur

    -

    http://alias-i.com/lingpipe/web/demo-ne.html

  • 4–36

    quantities like 35.5% to 35.5 percent, breaking range of years like 2007-09 to 2007 2009,

    and others. To take care of domain-dependency of stopwords, we constructed news-

    specific stopwords list for both the languages. For instance, words like “report”, “article”,

    “incident”, which though have meaning, do not add much to the semantic content of the

    text when present in news. We appended a list of such words to the existing stopwords list.

    In addition, common abbreviations were expanded using abbreviations’ list for English

    articles (AL). For preprocessing the articles in languages other than English, we used the

    techniques of dictionary lookup (DL), gazetteer list lookup (GL), abbreviation list lookup

    (AL) and Transliteration Similarity (TS).

    4.3.4 Translation

    4.3.4.1 Dictionary Lookup (DL)

    The translation step at DL used Hindi-English dictionary with 24824 Hindi words. The

    list was one-to-many, that is, in many cases, more than one English translation were

    present for one Hindi word in the dictionary. This dictionary was different from the one

    we used for our earlier results. To suppress the effect of noise, we considered only a

    maximum of first two translations for each Hindi word found.

    4.3.4.2 Gazetteer List Lookup (GL)

    We manually constructed a list containing the corresponding English names for regularly

    found Hindi proper names. The list was incremental in nature, that is, it automatically

    appended all the proper names identified by the system for which the corresponding

    English names found by TS, were in the list at the end of the execution for future use.

    4.3.4.3 Abbreviation List Lookup (AL)

    We used a list of commonly found acronyms and their expansions. In addition, print media

    tends to adopt many compression techniques and abbreviations, generally to shorten the

    length of title, in a bid to pack more significant terms in the title. For instance, “President”

    -

  • 4–37

    is frequently written as “Prez”, “Prime Minister” and “Chief Minister” as PM and CM

    respectively. Such abbreviations were also included in the gazetteer list to facilitate better

    detection of correspondence between titles and even news story.

    4.3.4.4 Transliteration Similarity (TS)

    For identifying names, we exploited the phonetic correspondence of alphabets and sub-

    strings in English-Hindi. For example, “ph” and “f” both map to the same sound of “फ” (f).

    Likewise, “sha” in Hindi (as in Roshan) and “tio” in English (as in ration) sound similar.

    Prior to executing content-based similarity-finding algorithm, we used TS approach on the

    list of English Named Entities identified by the system collected. Using edit-distance

    method, we collected, and appended mappings identified by the system as a valid name-

    match, to the present gazetteer list for future use.

    For each of the cases, we calculated the similarity between identified proper names for

    each source language and target language pair. The system considered the pair with

    maximum similarity, above a certain threshold level a valid pair. To accomplish ease in

    similarity based on phonetics, we wrote language-specific rules for both the languages.

    We list some of them below in Table 3:

    Table 3

    ष -> श (sh) ट / ठ / त / थ -> ट (t) ढ / ड / द / ध / ङ -> द (d)

    ज / झ -> ज (j) ब / भ / व -> व (v)श / प / फ -> प (p)

    ग / घ -> ग (g) न / ण / ञ -> न (n) ख / क -> क (k)

    Phonetic rules for recognizing similar sounding names

    -

  • 4–38

    4.4 Similarity Calculation

    The final similarity between two stories is the sum total of similarity of names found in

    them, and the matches between words appearing in the title as well as actual news story.

    The prime focus was on appending through some techniques the similarity value between

    a true positive pair, as well as penalizing a lot the sure-shot mismatches; thus pushing the

    true matching pairs above all others.

    Extent of Match = NSource Target×

    - (1)

    Where

    N: number of intersecting words in source string and target string

    SOURCE: number of words in source language string

    TARGET: number of words in target language string

    We calculated the similarity between source and target string by computing the cosine of

    the angle between the two string vectors. Matching proper names were assigned the

    highest weight (6.0) followed by title matches (3.0) and matches in story (2.0), and

    multiplied to the individual extent value. The total similarity value was a sum total of the

    similarity value found from the correspondence in the three cases, viz. Proper names, title

    and actual story. We normalize the total value to the range 0 to 100 depicting no match at

    all, and perfect match respectively.

    A match in titles increased the chances of the pair being comparable, as the chances of

    related stories, having high resemblance in titles, are not high in reality. Often newspapers

    tend to present the titles in sensational ways, to attract viewers, and often deviating from

    the “real” content.

    Based on the closeness of the stories, we put intense penalties on pairs with different dates

    of publish and/or different places. The thought is that even if a matching pair appears on

    websites at a wide time-gap, the higher similarity value would make for the losses. This

    -

  • 4–39

    step reduces the chances of mismatching stories attaining higher similarity value due to

    accidental similarities in content.

    4.5 Filtering

    Rejecting pairs of stories that cannot be similar under any circumstances at an early stage

    can lead to a substantial saving in time and effort required to get the results. Various

    filtering techniques to reject obvious-looking mismatches and attain substantial saving in

    time and effort required included a combination of temporal and size filtering. We

    permitted a tolerance of seven days on either side between dates of publish of source and

    target stories. The tolerance was adjustable to any extent but we kept it to a considerable

    amount to attain reduction in time required to run the program, without losing any true

    pair.

    We also use length filtering to throw away stories with large size variation in them. For a

    very less value of match between titles, added with a high difference between the lengths

    of stories, we rejected the current target language text, and continued with next. As already

    stated, we assume beforehand that the amount of insertions and/or deletions at both ends

    of a matching pair is limited. After a careful analysis of news stories, we kept the ratio of

    sizes to four. This step reduced the total number of target files for each source story,

    escaping the comparison of obvious mismatching stories.

    4.6 Conflict Resolution

    At each point of time, for one source pair, we find the similarity value for the

    corresponding target story. If the calculated similarity is above a certain pre-specified

    threshold, and more than the maximum similarity found until now, we save the

    corresponding pair.

    It is possible that at certain point of time, the total similarity value attained for a source

    story is same for two or more target stories. In this case, we select the appropriate target

    -

  • 4–40

    story through the process of conflict resolution. The process followed when two or more

    pairs have same total similarity value for same source story was as follows. First, we

    check the temporal closeness of both target stories with the source news. Temporal

    closeness is the measure of closeness of two stories based on the variance in date and

    place of publish. A pair with same date as well as place of publication is closest. Then

    with lesser closeness, comes the case wherein date of publication is one day on ether side,

    and place of publication is same. This takes care of those instances where two different

    press agencies publish the same story on adjacent dates for various reasons. Closeness

    reduces when the date of publication is same but the place of publication is different. This

    helps differentiate dissimilar articles published on the same date. The final case with lesser

    closeness is the case where place is different and date is one day on either side. After that,

    if none of the case holds, the closeness is the function of difference between the dates of

    publishes of the source-target pair, and goes on reducing as the difference enlarges.

    TWO PAIRS OF NEWS ITEMS SAME MAX SIM FOR ONE SOURCE FILE

    CHOOSE PAIR WITH MAX TEMPORAL CLOSENESS

    IF SAME, ADD BOTH PAIRS TO FINAL SET

    IF SAME, CHOOSE PAIR WITH MORE SIMILARITY FROM STORY

    4.7 Example

    This hypothetical situation demonstrates the series of incidents occurring for a source

    story (at the center) surrounded by a set of target stories. Some of the target stories are not

    comparable to the source, but some articles describe the same happening, and there is only

    one (or more) story in the target language that matches with the source. The source story is

    in Hindi, temporarily translated in the figure. The target stories are in English.

    The breadth of the arrow between the pair depicts the closeness of a story to the source

    story in the center. For the stories on the left hand side, the dotted arrows indicate that the

    stories are far away from the source story, and the ones at the right are closer to it. In each

    bubble, the first line shows the title, followed by date of publish, and the text marked

    between hyphens (- ABC -) shows the possible reason for either its selection or rejection.

    -

  • 4–41

    PM talks to Sonia Gandhi

    08/25/2007 - Uncommon names -

    Salman Khan released from jail

    Aug 25, 2007 - Names uncommon -

    Sanjay Dutt’s movie attracting crowds

    Aug 25, 2007 - TOTAL SIM THRESHOLD

    -

    3 killed in Nandigram

    Aug 25, 2007 - NE -

    Sanjay arrested Aug 21, 2007

    - TOTAL SIM -

    Sanjay Dutt released Aug 25, 2006

    - TEMPORAL FILTERING -

    Munnabhai’s story till now – ’93–’07

    Aug 25, 2007 - SIZE -

    Sanjay Dutt released from Pune jail

    August 25, 2007 - (HINDI) -

    Sanjay celebrates freedom Aug 26, 2007

    - TOTAL SIM -

    Dutt to be released Aug 24, 2007

    - CLOSENESS -

    Sanjay’s case’ hearing today Aug 24, 2007

    - SAME DATE -

    Sanjay’s verdict deferred Aug 23, 2007

    - TOTAL SIM -

    Sanjay released from Pune Aug 25, 2007 - MAX SIM -

    Bollywood relieved, Sanjay back home

    Aug 25, 2007 - SAME MAXSIM -

    -

  • 4–42

    -

  • Chapter 5

    Results Precision is the ratio of true positives and total retrieved pairs, whereas recall is the

    number of relevant documents retrieved. We evaluated our system on the precision, that is,

    the number of pairs actually comparable divided by the number of pairs tagged positive by

    the system. The desire is to get a high-precision system, though a system high on recall

    alongwith precision is difficult to attain.

    Out of the total pairs retrieved, we pick some random pairs and evaluate our system on

    those pairs. For manual evaluations of the returned results, we considered a pair valid

    matches if at least one of the three described conditions hold good.

    A news story pair would obviously be comparable if it talks of the same event. News

    describing Indian cricket team winning Twenty-20 World Cup match in one particular

    year in both the newspapers is an example of such a case. When both the stories describe

    same events, with related contents and/or actions, we treat them as comparable. For

    instance, a report in one language on Indian team winning Twenty-20 cricket World Cup

    match, and an interview of the captain of the winning team for the same competition in

    another. Both the stories describe similar events, with at least one comparable sentence

    present in the pair, are also comparable, as they serve our purpose of getting similar news,

    and comparable sentences. A report on Indian team winning Twenty-20 cricket World Cup

    match in one language, and a report on World Cup matches won by India against the same

    team in the past in another, is an example of such news.

    For each source-target pair, we compared Proper names (N) present in the whole stories in

    the pairs for calculation of similarity. In addition, depending upon the amount of content

    taken from the news article compared, we calculated similarity for the following four

    cases:

  • 5–44

    1. : Title only (T), the actual news was NOT considered

    2. : Title plus first half of the news story (HS) (up-to a max of first 10 lines).

    3. : Title and the whole news story (FS).

    4. : only names, no title or news story.

    5.1 Sample data

    The sample data set consisted of 1711 different stories in the source language Hindi,

    collected for a period of 30 days (29 March 2008 to 27 April 2008). The target language

    (English) pair had 3500+ stories collected over a period of 27 March 2008 to 1 May 2008

    around the dates of Hindi collection, to take care of time gap between uploading of

    comparable items on websites. We are in a process of evaluating the system on larger data

    set. For the improvements, we used a slightly bigger data set consisting of 2400+ source

    language files collected for 35 days, and corresponding English data of similar sampling

    period.

    5.2 Evaluations

    0

    20

    40

    60

    80

    100

    1 2 3 4 5 6 7 8 9Random Points of evaluation

    Prec

    isio

    n

    10

    N+T N+T+HS N+T+FS N

    Figure 4

    Precision versus evaluation of system on ten random points for a sample data of stories

    collected over a period of one month

    -

  • 5–45

    The graph (Figure 4) shows Precision on Y-axis plotted for evaluations on ten random

    points plotted on X-axis. The next graph (Figure 5) shows the average precision values for

    the evaluations for four cases considered.

    0

    20

    40

    60

    80

    100

    1 2 3 4

    AVERAGE PRECISION

    Figure 5

    Average precision values for the evaluations for one-month sample data

    The overall system behaved satisfactorily and the average precision for the evaluations

    turned out to be 67.4 percent for the best case () and 37.4 percent for the worst

    case () (Figure 4). We infer that comparing the news articles based on only their

    proper names is not at all a good choice, and in general leads to false positives. Comparing

    only titles alongwith the named entities is better at some places, as similar words in title

    boost the similarity value of similar news stories. However, as stated earlier, in general,

    similar news stories might have mismatching titles, twisted and compressed in a bid to

    make them more striking to the reader. Two dissimilar stories also can bear common

    words in titles, due to various reasons (analyzed next).

    The graph clearly depicts the outperforming of Case 2 over Case 3 in most of the cases. In

    fact, the highest precision attained at any point of time is 78%, achieved at two points

    (point 4 and 10), which is higher than the highest precision obtained by case 3 (76% at

    point 4). We realize that for majority of cases, the first half of the news caters for the

    largest part of the vital information. This is in accordance with the regular scenario where

    first the few sentences highlight the actual current event, and trailing sentences provide

    -

  • 5–46

    additional related information or comments, and are generally repetitions of some data

    from precedent incidents.

    However, we deduce on investigation that for appropriate alignment of news articles at the

    document-level that we need to consider the key features, like names and dictionary words

    in the whole story. The average precision graph (Figure 5) shows Case 3 (N+T+FS)

    outshining Case 2 (N+T+HS) over all the random points taken together.

    Figure 6 shows the number of documents retrieved versus precision obtained.

    Nevertheless, for alignment of stories at the document level, considering only the first half

    of the story is not a good choice, as depicted in the average precision curve. This is due to

    the consistent performance of case 3 over the fluctuating precision of Case 2 in the

    preceding graph. In any case, considering only the titles is not a good choice.

    0

    20

    40

    60

    80

    100

    10 20 30 40 50 60 70 80 90 100 110 120 130 140Retrieved Documents

    Prec

    isio

    n

    N+T N+T+HS N+T+FS N

    Figure 6

    Number of documents retrieved versus precision obtained for top 140 results

    Table 4 on the next page shows the number of retrieved documents versus the Precision

    values for top 70 pairs.

    -

  • 5–47

    Table 4

    CASE P@10 P@20 P@30 P@40 P@50 P@60 P@70

    N+T 100 100 96.66 92.5 88 86.67 85.71N+T+HS 100 100 90 85 86 85 84.29N+T+FS 90 95 90 87.5 86 83.33 81.43

    N 80 70 73.33 75 70 63.33 65.71

    Number of retrieved documents versus the Precision values for top 70 pairs for improved

    case

    5.3 Analysis

    5.3.1 Positive results

    -

  • 5–48

    A pair is termed as legitimate, if it confirms to the specifications we laid, as described in

    the start of this chapter, irrespective of the dates and places of publication of the paired

    articles. The image on the previous page shows a comparable story marked positive by the

    system.

    The system identifies the pair correctly even when both the dates and places of publish are

    different. The result can be attributed to the high number of common proper name “मन”ु

    (Manu) found in both the stories (frequency 6 in Hindi and 10 in English). As expected,

    even after a high penalty for the stories due to differences in date of publish and place

    mismatch, the high score of proper names helped the total similarity value above all

    others. We tuned the penalties to make sure the similarity values of pairs like these remain

    above the threshold level after the application of punishment, and thus are a part of the

    final set if no other target story achieves a higher similarity.

    5.3.2 Negative results

    As explained earlier, at some places, the method did not behave the way it intended to. We

    identified the following points of failure of the present arrangement leading to wrong

    results.

    5.3.2.1 Pair unable to cross the thresholds

    With less matches in names, added with the lower number of dictionary words translated,

    the total similarity value between a correct pair might not be adequate to reach the other

    side of the threshold boundary. The example below shows such a case.

    -

  • 5–49

    The similarity of the shown pair was more than the threshold when considering only

    proper names, and made it to the final list of comparable documents. Nevertheless, due to

    absence of adequate number of translations, the total similarity value reduced for other

    cases. To boost the similarity value for such pairs, we enhanced the effect of match in

    names to an extent where the effect of named entities is substantial, without

    overshadowing the effect of other parameters. Tuning the weight associated with the

    names, we decided to put it twice the weight for title, and thrice that for story.

    5.3.2.2 Absence of corresponding news article in target language

    With no story to match, accidental matches in a dissimilar target language story can cross

    the thresholds and counted as “positive” pair. Accidental matches could be in proper

    names and in dictionary words. Sometimes, the figures also match leading to increment in

    similarity value. However, we tried to put a certain threshold value to suppress the effect

    -

  • 5–50

    of such “accidental” matches on the overall results, and thus the precision, cannot rule out

    the possibility of occurrence of such cases.

    5.3.2.3 Effect of similar names found in dissimilar stories

    When similar names occur in dissimilar stories, chances are that the match in names

    results in high similarity value for that pair. Two stories with one on population of India,

    and other one on some world event with India as one of the participants, with similar name

    India occurring frequently, can have a high score of similarity from names. Since the total

    similarity depends on both source and target file, even after the presence of comparable

    target story, a target story with higher density of such names might generate higher

    correspondence.

    The next example shows the situation where a highly mismatching pair is tagged positive

    by the system due to occurrence of same name “राहुल” “Rahul” in the news.

    -

  • 5–51

    While the Hindi story speaks of cricket, and contains the name “Rahul Dravid”, the target

    story describes the latest fluctuations in politics, with “Rahul Gandhi” occurring

    frequently. We analyzed and found that no corresponding news was present in the target

    collection, due to which the system pairs the above target story with the source story at the

    end of the simulation.

    There was also a pair found in our sample data set where the system identified correct

    target story for cases 1 to 3, but at case 4, the system paired the same source story with

    some other mismatching story because of the presence of more intersecting names. The

    images below show two pairs. The first one is the correct pair identified when some

    content from the story is compared, viz. title in case 1, first half of the story in case 2, and

    full story in case 3. However, in case 4, we compare only names for calculating the total

    similarity. The next figure depicts such a case.

    -

  • 5–52

    The total number of proper names common to the true pair was six (“इरान” (Iran) occurring

    six times in source story and nine times in target), whereas in the mismatching pair, “इरान”

    (Iran) occurred six times each, and “तेहरान” (Tehran) occurred two times, making the total

    number of proper names common to both as eight. When a comparison occurs on the

    names only, the pair with higher number of names achieved a higher total.

    Topic identification could be one possible solution to minimize such negativity. A

    classification of news stories into independent fields on identification of the context can

    group similar news. In that case, the above two stories would constitute part of separate

    cluster. Penalizing matches from dissimilar clusters can reduce false positives like this to

    great degree.

    -

  • 5–53

    In addition, we identified many pairs with high effect of matching proper names in

    mismatching stories. We discovered most of them d to have a negligible value of

    similarity in their story. To suppress such pairs where the total similarity value is a result

    of similarity in proper names alone, we rejected pairs with very low similarity in story. In

    this way, a pair needs to have some correspondence in all the measures applied, to come in

    the final list.

    5.3.2.4 Effect of similar dictionary words in stories depicting news on similar topic, yet

    describing different events

    This happened when the names were either all mismatching, or the effect was negligible.

    However, there were considerable dictionary translations common in both the stories. For

    instance, separate accidents occurring at different place, with unintentional match of

    dictionary words and/or even names resulted in higher similarity value from story and/or

    title.

    Some of the pairs when analyzed revealed that the dictionary too was responsible for

    erroneous matches. A translation found in the dictionary was actually a stopword for the

    news domain, or otherwise, and was included in the final similarity calculation procedure.

    For example, the DL step translated the Hindi word “अब” to “now”, and a story pair with

    this commonly found word got a boost in the correspondence value. Particularly, we had

    in our test data a pair with “now” in the title itself. This helped the pair to cross the

    minimum similarity levels.

    Below is a snapshot of the case where the words “train” and “rail” were prominent in both

    the stories, speaking of different happenings. The case is not comparable yet marked

    positive by the system.

    -

  • 5–54

    We tuned the weights associated with similarities to minimize the effect of matches in

    dictionary words alone. Particularly in titles, small coincidental matches led to high boost

    to similarity. To suppress the effect, at the time of matching, we rechecked the stopwords

    list in target language to filter out any such words. This reduced such accidental matches

    in the final evaluations.

    5.3.2.5 Effect of wrongly identified pair of names

    This was the result of error in the TS stage, with some mismatching pair crossing the

    threshold and marked “true” by the system. This occurred when match occurred in some

    proper name identified by the system in the target language with some source language

    word. Considering the target word a translation of the source word, it was included in the

    final list of identified names. For instance, “बु कग” (booking) in Hindi transformed into

    name “Viking” after passing the TS step. The corresponding word booking, not being a

    -

  • 5–55

    proper name, was not present in English named entity list. The example shows the case

    where the word in Hindi “ली” (to take) gets mapped to “Lee”, and comes closer to the

    name “Lee” present in the target story.

    To suppress mismatches, we assigned higher threshold levels for similarity in TS. We

    tuned the thresholds on length of the compared strings, with higher thresholds required for

    higher lengths.

    5.3.2.6 Effect of poor dictionary

    The bilingual dictionary used for this experiment is still in the development stage. Last

    used, the dictionary contained 24824 Hindi words. Inclusion of more source language

    -

  • 5–56

    words will lead to more translations identified in the DL step, escalating the effect of

    similarity in dictionary words, and thus resulting in identification of more similar pairs.

    5.4 Improvements

    To achieve a further improvement in the results, we adapted some more techniques.

    Firstly, we adopted a three-level weight system for the English names found in the stories.

    We assigned highest weight to proper names identified by both LingPipe code and our

    custom-made NER, followed by the names identified only by LingPipe followed by the

    names found by our recognizer. This ensured the maximum effect of sure-shot names

    identified by both the systems for English language. At the time of finding intersecting

    names, we checked the weight of identified Hindi name in the English name collection,

    and the total similarity was a function of that weight.

    In addition, a deep analysis of mismatching pairs revealed a common trend where the

    stories came closer due to high similarity in names, but had a significantly low

    correspondence in actual news item. We reduced the effect of such mismatches by

    rejecting a pair low on similarity value for news story. As expected, the system responded

    positively and we attained a graph much better than the previous run for the same sample

    set.

    Figure 7 shows the graph we obtained on the improved version of our system, for a larger

    collection. The graph shows number of top retrieved documents taken 10 at a time versus

    precision. This collection included Hindi news for a period of 35 days with supporting

    target collection in English. The graph met our expectations and show better precision for

    the case when we include whole story for document-level alignment, contrary to Figure 6

    where only title included with the proper names for similarity calculation shows better

    precision. The values obtained representing the numbers of retrieved documents versus

    precision values for top 70 retrieved pairs are in Table 5.

    -

  • 5–57

    0

    20

    40

    60

    80

    100

    10 20 30 40 50 60 70

    N+T+HS N+T+FS N N+T

    Figure 7

    Table 5

    CASE P@10 P@20 P@30 P@40 P@50 P@60 P@70

    N+T 100 100 96.67 97.5 92 90 87.14N+T+HS 100 100 96.67 97.5 96 90 87.14N+T+FS 100 100 100 100 96 90 85.71

    N 100 95 96.67 90 82 73.33 70

    Number of documents retrieved versus precision obtained for the improved system

    Number of retrieved documents versus the Precision values for top 70 pairs for improved case

    Thus, based on improved results, we restate the observation in accordance with the general

    perception that former half of the news article furnishes the largest part of the essential

    information conveyed. However, it is the consideration of whole story, which results in

    better identification of similar stories. Detection of association in a source-target pair on

    intersecting proper names is indispensable, but should not be the only criteria, specifically

    for languages like Hindi, which is highly undersupplied with valuable cross-language

    assets.

    -

  • 5–58

    -

  • Chapter 6

    Conclusion

    The size of corpus thus obtained is far from any appreciable usage in the applicable areas

    of research and technology. We were able to generate a repository consisting of more than

    100 true pairs of comparable stories at the time of writing this thesis. We look forward to

    accumulate many such pairs aligned at document level for it to be helpful to researchers all

    over. In addition, the present work is for the language pair Hindi-English. With language-

    specific resources in hand, it is possible to extend the system to generate multi-lingual

    corpus in the participating languages. The resources for Bengali language, the second most

    spoken language in India [Languages], are in the development stage. We wish to extend

    the present bilingual corpus to trilingual corpus (English-Hindi-Bengali).

    The present system crawl the news articles from only one website for each language.

    NavBharatTimes lack the archives, so a story missed on a particular day might not be

    crawled to the final collection. We found Bhaskar.com to contain archives for news

    articles published in preceding years. We wish to extend the present system to many more

    websites like Bhaskar.com to extract a huge amount of comparable data.

    The next step towards the project is to align the found comparable documents at sentence

    level, thus gathering comparable sentences from amongst the similar documents.

    This project is the foundation stone for progress in numerous areas associated directly or

    indirectly to natural language processing. Appreciable amounts of this corpus collected

    will act as a catalyst to several researches the languages we built it in, and in many related

    languages. We expect the effort of ours helps the Indian community in attaining our

    purpose of shrinking the space between English and prominent Indian languages like

    Hindi and Bengali, and make information access easier.

  • 6–60

    6.1 Future work We did not include Word Sense Disambiguation in the present system. Many words have

    different meanings depending upon the words around it in the sentence it is used. For

    instance, “bank” can refer to the financial institution as well as the bank of river. Some

    places use the word as a synonym for trust. A dissimilar pair using the word in different

    meanings can come closer harming the precision.

    Though precision is acceptable, the recall of the present system is low. The result is the

    cascading effect of inefficiency of Named Entity Recognizers in both the languages on the

    overall similarity. There were a significant number of documents with a low percentage of

    proper names mined by the recognizers in both the languages. Improving the Named

    Entity Recognition systems to make them extract most of the named entities will help the

    positive pairs to attain higher similarity values, thus reducing the amount of false

    positives, and assisting many rightly identified pairs with total association value less than

    the associated threshold values, to cross the boundary values.

    -

  • 6–61

    -

  • Bibliography [Arora] Karunesh Kr. Arora, Sunita Arora, Vijay Gugnani, V N Shukla, Dr S S Agrawal,

    GyanNidhi: A Parallel Corpus for Indian Languages including Nepali.

    [Shinyama] Yusuke Shinyama, Satoshi Sekine, Paraphrase Acquisition For Information

    Extraction, Computer Science Department, New York University, Proceedings of the

    second international workshop on Paraphrasing - Volume 16, Sapporo, Japan. Pages: 65 -

    71 Year of Publication: 2003.

    [Resnik] P Resnik, NA Smith, Web as a Parallel Corpus, Resnik - Computational

    Linguistics, Vol. 29, No. 3, Pages 349-380, Year of Publication: 2003.

    [Somers] Harold Somers, Bilingual Parallel Corpora and Language Engineering,

    Department of Language Engineering, UMIST, Manchester, England.

    [Almeida] Jose Jo~ao Almeida, Alberto Manuel Sim~oes, Jose Alves de Castro,

    Grabbing parallel corpora from the Web, Departamento de Informatica, Universidade do

    Minho.

    [Ma] Ma, Xiaoyi and Mark Liberman.. BITS: A method for bilingual text search over the

    web, In Machine Translation Summit VII, Year of Publication: 1999.

    [Barzilay] Regina Barzilay , Noemie Elhadad, Sentence alignment for monolingual

    comparable corpora, Theoretical Issues In Natural Language Processing, , Proceedings of

    the 2003 conference on Empirical methods in natural language processing - Volume 10

    Pages: 25 - 32 Year of Publication: 2003

    [Mandal] Debasis Mandal, Sandipan Dandapat, Mayank Gupta, Pratyush Banerjee,

    Sudeshna Sarkar, Bengali and Hindi to English Cross-language Text Retrieval under

    Limited Resources, In Working Notes for the CLEF 2007 Workshop (2007).

    [Liddy] Elizabeth D. Liddy , How Might CLIR Be Accomplished, ASIST Annual

    Meeting, Chicago, IL 11/13/2000 http://www.cnlp.org/presentations/slides/CLIR.pdf

    [Ethnologue] Ethnologue list of most spoken languages in the world

    http://en.wikipedia.org/wiki/Ethnologue_list_of_most_spoken_languages

    [Languages] Languages of India http://languages.iloveindia.com/

    [Arist] Arist Chapter, Douglas W. Oard & Anne R. Diekema, Cross-Language

    Information Retrieval.

    http://portal.acm.org/author_page.cfm?id=81100158311&coll=GUIDE&dl=GUIDE&trk=0&CFID=64051802&CFTOKEN=85963839http://portal.acm.org/author_page.cfm?id=81100617563&coll=GUIDE&dl=GUIDE&trk=0&CFID=64051802&CFTOKEN=85963839http://en.wikipedia.org/wiki/Ethnologue_list_of_most_spoken_languages

  • 6–63

    [McEnery] Anthony Mcenery & Zhonghua Xiao, Parallel and comparable corpora:

    What are they up to?

    [Maia] Belinda Maia, Creating parallel and comparable corpora for work in domain

    specific areas of language, FLUP.

    [Peters] Carol Peters, Multilingual Information Access for Digital Libraries, ISTI-CNR,

    Pisa

    [CNNIC] Serving the Needs of the Community — IDN and Alternatives

    [Yale] Mohan Yale, Building a Sustainable Framework for a Multilingual Internet,

    Internationalized Domain Names (IDNs), A2K2 Conference, New Haven, USA April 27,

    2007

    Tools and resources:

    LingPipe home-http://alias-i.com/lingpipe/