acquisition of bilingual comparable corpora - iit …pabitra/facad/06cs6015t.pdf · 2008. 5. 6. ·...

Automatic Acquisition Of

Bilingual Comparable Corpus

Thesis Submitted In Partial Fulfillment of the Requirement for the Degree In

Master of Technology

In

Computer Science & Engineering

Submitted by

Mayank Gupta (06CS6015)

Under the guidance of

Prof. Sudeshna Sarkar Department Of Computer Science and Engineering

IIT Kharagpur

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, IIT KHARAGPUR

Indian Institute of Technology

Kharagpur, West Bengal – 721302

9 May 2008

ii

Department of Computer Science and Engineering

CERTIFICATE

This is to certify that the thesis entitled “Automatic Acquisition Of

Bilingual Comparable Corpora” is a record of bona fide work carried out

by Mr. Mayank Gupta (06CS6015), under my supervision and guidance,

during the academic session 2007-2008, in partial fulfillment of the

requirement for the degree of Master of Technology in Computer Science and

Engineering, Department of Computer Science & Engineering, Indian

Institute of Technology, Kharagpur. The results presented in this thesis have

not been submitted elsewhere for the award of any other degree or diploma.

Dr. Sudeshna Sarkar

Department of Computer Science & Engineering,

Indian Institute of Technology Kharagpur- 721302.

Date: 30 April 2008

Place: Kharagpur

-

iii

ACKNOWLEDGEMENTS

I wish to extend my sincere thanks to my supervisor Prof. Sudeshna

Sarkar, for keeping faith in me, and for being a driving force all through the

way. The project would not be so smooth and so interesting without her

encouragement. I am indebted to the Department of Computer Science &

Engineering and IIT Kharagpur for providing me with all the required

facility to carry my work in a congenial environment.

I also wish to thank Mr. Debasis Mandal and Mr. Sujan Saha for providing

me some valuable resources for my project, and to add a lot to my expertise

by their invaluable suggestions. I extend my gratitude to the CSE department

lab staff for providing me the needful time to time whenever requested.

Above all, I am grateful to my parents, friends and well-wishers for their

patience and continuous supply of inspirations and suggestions for my ever-

growing performance. Last but not the least; I thank the Almighty for making

me a part of the world.

MAYANK GUPTA 06CS6015

-

iv

Dedicated to…

My mom, who loves me more than I love myself;

And to all those who love her…

-

v

Contents

Abstract...................................................................................................................... x

Introduction......................................................................................................... 1–12

1.1 Cross-Language Information Retrieval ........................................1–13

1.2 Multilingual nature of web .............................................................1–14

1.3 Motivation ........................................................................................1–15

1.4 Problem Statement ..........................................................................1–15

1.5 Organization of thesis .....................................................................1–16

Background ......................................................................................................... 2–19

2.1 Notion of corpora.............................................................................2–20

2.2 Comparable Corpora ......................................................................2–20

2.3 Classification of CC .........................................................................2–22

2.4 Some Examples of Comparable Corpora .....................................2–22

2.5 Applications of Comparable Corpora ...........................................2–24

2.6 Limitations of CC ............................................................................2–24

Related Work ...................................................................................................... 3–26

3.1 Effect of poor dictionary .................................................................3–28

Work Done........................................................................................................... 4–33

4.1 Newspaper Sites: Open Source of CC ...........................................4–33

-

vi

4.2 Methodology.....................................................................................4–33

4.3 Details of the implementation.........................................................4–34

4.3.1 Gathering News .........................................................................4–35

4.3.2 Named Entity Recognition (NER) ...........................................4–35

4.3.3 Preprocessing.............................................................................4–35

4.3.4 Translation.................................................................................4–36

4.3.4.1 Dictionary Lookup (DL).......................................................4–36

4.3.4.2 Gazetteer List Lookup (GL) .................................................4–36

4.3.4.3 Abbreviation List Lookup (AL) ...........................................4–36

4.3.4.4 Transliteration Similarity (TS) .............................................4–37

4.5 Filtering ............................................................................................4–39

4.6 Conflict Resolution ..........................................................................4–39

4.7 Example ............................................................................................4–40

Results .................................................................................................................. 5–43

5.1 Sample data ......................................................................................5–44

5.2 Evaluations .......................................................................................5–44

5.3 Analysis.............................................................................................5–47

5.3.1 Positive results ...........................................................................5–47

5.3.2 Negative results .........................................................................5–48

5.3.2.1 Pair unable to cross the thresholds .......................................5–48

5.3.2.2 Absence of corresponding news article in target language..5–49

5.3.2.3 Effect of similar names found in dissimilar stories..............5–50

5.3.2.4 Effect of similar dictionary words in stories depicting news on

similar topic, yet describing different events ...................................5–53

5.3.2.5 Effect of wrongly identified pair of names ..........................5–54

5.3.2.6 Effect of poor dictionary ......................................................5–55

5.4 Improvements ..................................................................................5–56

-

vii

Conclusion .......................................................................................................... 6–59

6.1 Future work .....................................................................................6–60

Bibliography ........................................................................................................ 6–62

-

viii

List of Figures

1. Difference between the volumes occupied by English-speaking users on the web, as against that occupied by non-English speaking people......1–14

2. Frequency distribution of number of translations in Hindi bilingual

dictionary ...............................................................................................3–28 3. Recall plotted versus Precision for the six runs for Bengali and Hindi to

English CLIR Evaluation experiment....................................................3–30 4. Precision versus evaluation of system on ten random points for a sample

data of stories collected over a period of one month.............................5–44 5. Average precision values for the evaluations for one month sample data 5–

45 6. Number of documents retrieved versus precision obtained for one month

test data ..................................................................................................5–46 7. Number of documents retrieved versus precision obtained on improved

system ....................................................................................................5–57

-

ix

List of Tables

1. Cross language runs submitted in CLEF 2007 ......................................3–30

2. Summary of bilingual runs of the CLIR evaluation experiment ...........3–30

3. Phonetic rules for recognizing similar sounding names........................4–37

4. Number of retrieved documents versus the Precision values for top 70

pairs........................................................................................................5–47

5. Number of retrieved documents versus the Precision values for top 70

pairs for improved case..........................................................................5–57

-

x

Abstract Corpora are the main knowledge foundation for progress in the field of information

retrieval [Arora]. Processing of multilingual corpora helps in the construction of efficient

language-specific resources and in cross-lingual information access [Peters] [Mandal].

This report presents the work aimed at constructing a comparable corpus automatically in

English, and the most common Indian language, Hindi, by collecting similar news stories

from newspaper websites. Our system identifies comparable news articles by recognizing

intersecting proper names, and content overlap using a medium coverage dictionary,

transliteration similarity, temporal closeness and filtering mechanisms at various levels.

The system scored 67.4% and 38.4% precision for the best and worst case respectively,

evaluated randomly on our sample set of Hindi news articles of 30 consecutive days. We

adopted some enhancements in the present system after in-depth scrutiny and achieved

substantial improvements in the results.

Based on achieved results, we deduce the observation in accordance with the general

perception, that former half of the news article furnishes the largest part of the essential

information conveyed. However, it is the consideration of whole story, which results in

better identification of similar stories. Consistent performance of the results when we

consider the whole stories over the fluctuating precision values for the case when we

examine the first half of the stories helps us arrive at the conclusion. Detection of

association in a source-target pair on intersecting proper names is indispensable, but

should not be the only criteria, specifically for languages like Hindi, which is highly

undersupplied with valuable cross-language assets.

-

Chapter 1

Introduction The first step towards advancement in research in various emerging areas of natural

language processing and interconnected fields is the availability of huge collections of

easily accessible text [Arora] [Resnik]. Such corpora are useful in the extraction of

knowledge for the participating languages, and for the construction of many efficient

resources for these languages. Resnik and Smith [Resnik] lists some of the uses of Parallel

corpora, and elucidate their algorithm for building one through mining the web

(STRAND).

Corpora could be monolingual or multilingual. Monolingual, as the name implies, is a

corpora where only one language is involved. Multilingual is a corpora linking two or

more languages, with further classification as parallel or comparable. Parallel corpora refer

to near-exact translations of text in participating languages, whereas Comparable corpora

refer to collection of texts in different languages according to some criterion of similarity.

Examples of parallel corpora could be translations of a text in different languages, such as

user manuals, educational books or proceedings of some event either written

independently by different persons, or translated from a single source. In case of

comparable data, the text may be similar in content (comparable content) or any other like

time or domain. Newspaper articles in multiple languages from independent press

agencies, or description of a product or occasion by different people are some example of

a comparable text.

Aligning the comparable text to sentence and word level can help create automatically

efficient resources for the languages deficient in these assets at present. Such a corpus

acts as the starting step of cross-lingual studies. It also helps in multilingual

summarization, which is the automatic procedure for extracting information from multiple

texts written about the same topic in multiple languages. It offers a wide scope of

1–13

applications for research in Discourse analysis (Analysing written, spoken or signed

language use), Pragmatics (ability to understand the intended meaning of speaker,

pragmatic competence), Information retrieval etc.

1.1 Cross-Language Information Retrieval

Cross-Language Information Retrieval (CLIR) is an enormous and ever-growing field, and

is not limited to one particular subject or discipline, rather is multidisciplinary. People

from interconnected fields like Information Retrieval, NLP, Machine Translation, and

Speech Processing etc come closer for information access. There are various Information

Processing issues (Cross Lingual information access, speech processing) and a strong need

of Language resources (dictionaries, thesauri, corpora, test collections).

The escalating demand of CLIR is due to various reasons. Growing Internationalization

had made many developed and developing countries multilingual, where the number of

speakers of non-native lingo is not less than that of native speakers. United States, Canada

and even India are some of those countries where no single language dominates, and

people from all parts of the world with their own culture and language share the

boundaries. Globalization of the economy has reduced the problem of localization of

employment, and the trend of MNCs has have employees & customers from all parts of

the world coming and working under one roof, speaking and using multiple languages.

Global Information Society has evaporated the physical boundaries of the world, and

shrinking of the space between the human races for want of educational needs and

entertainment.

Of course, to attain better communication, desire to achieve excellence in some common

language at such places, in addition to the native speech often arises. For many it is not a

setback, but for those with no background in other languages, communicating becomes a

matter of nuisance. Availability of resources like translators, multilingual dictionaries and

other cross-lingual packages has made information access in multiple languages much

easier than ever. For the ease of customers and for producers too, official documents and

manuals are no longer restricted to a single language. Distance Learning, Digital Libraries

and other resources provided to students worldwide by top educational institutions are no

-

1–14

longer constraining students to follow the path to success in a language they are not

comfortable.

1.2 Multilingual nature of web A survey shows that Internet is no longer monolingual and non-English content is growing

rapidly [CNNIC]. Figure 1 shows the difference between the volumes occupied by

English-speaking users on the web, as against that occupied by non-English speaking

people. This is the place where the need for MLIA arises. Carol Peters [Carol] estimates

78% of Internet Users to be Non-English Speaking by 2005. The latest figures put 70% of

the online population to comprise of the non-English group, as on 30 June 2007. Even

though the number of such corpus collections is increasing, the number of languages in

which such collections are available is still limited [Somers]. The coverage of such

collections for a particular need is also not satisfactory always.

Figure 1

Difference between the volumes occupied by English-speaking users on the web, as against that occupied by non-English speaking people

Though English still tops the chart, more and more people with diverse backgrounds are

connecting to the web. In addition to the largely exploited language English, the work in

European languages such as French, German and Portuguese, and Asian languages such as

Chinese, Korean and Japanese has also escalated in recent times [McEnery]. With the

-

1–15

advancement in technology, more and more languages are becoming a part of the efforts in

the field of Natural Language Processing and associated areas.

1.3 Motivation

With 22 listed regional languages listed in the eighth schedule of the Constitution1, India

has been a multi-language country with a wealthy unexplored reserve of Knowledge.

[Yale] [Languages]. According to the latest survey, Hindi is the fifth and Bengali seventh

highest spoken language in the world, with Chinese and English having ranks one and

three respectively [Ethnologue]. Most of the inhabitants of India are bilingual in nature,

and are exposed to English or Hindi (or both), in addition to their mother tongue, in

general [Mandal]. They, as well as those who are not, seek information from different

domains (like news) and often face trouble in doing so. The motivation of this venture is

easing the troubles in information access between Hindi and English. We selected Hindi to

maximize the effect of the project.

However, merely collecting texts from different sources does not constitute a corpus.

Inferring knowledge from the corpora is as important as selection of corpora and the

development of tools. In addition to collecting pairs of similar news stories in the three

languages, we develop as a by-product of our project, a list of English Named Entities

collected from English news stories, and an automatically generated gazetteer list that

contains English names for Hindi names that the system recognize as proper nouns.

1.4 Problem Statement

Our work is an attempt in the direction of making a comparable corpus in the most

common Indian language, that is, Hindi, by collecting similar news articles automatically

from online news websites. The objective of this project is easing the information access

between Hindi and English, and the collection to act as the initiator to many researches in

language technology. We selected Hindi to maximize the effect of the project.

1 http://languages.iloveindia.com/

-

1–16

The process extracts proper names from the news stories in each language. For languages

other than English, we use the limited coverage bilingual lexicon available with us for

translation of some key words. The similarity between two stories in different languages is

a function of the extent of correspondence between the identified names, and of

translations. We use various phonetic substitutions to identify names with same

pronunciation in the two languages. The pair closest to each other in terms of temporal

closeness in addition to similarity in names and translations is tagged the best match.

1.5 Organization of thesis

We present the thesis in the following manner.

Chapter 1 introduces the concept of Cross-Language Information Retrieval, highlighting

the escalating call for proficient cross-language resources. The motivation of this project,

enhancing the resources for Indian languages is also explained alongwith the problem

statement.

The next chapter, that is, Chapter 2, discusses the background of the area of natural

language processing, particularly the explanation of what a corpus is. This chapter

familiarizes the reader with the concept of comparable corpus, classifications and its

applications. A section for describing some of the available corpuses is also present.

Chapter 3 describes the efforts worldwide for creation of text corpora. The section

includes our own effort to understand the need of a corpus for enrichment of the deficient

and undersupplied resources in Indian languages Hindi and Bengali. We highlight the need

of an effective bilingual lexicon for noticeable Cross-Language Information Retrieval, in

addition to other language-specific needs like Named Entity Recognizer and Feedback

System.

Chapter 4 explains the algorithm we adapted for identifying similar stories in the crawled

news stories from the websites in participating languages English and Hindi.

-

1–17

Chapter 5 analyzes the results of our test run on a sample data set. The set was a

collection of more than 1700 stories in the source language Hindi, collected over the

period of 30 days. The target collection contained English news articles collected over the

same period. We analyzed the causes of failure of the present work and evaluated the

system again after adopting some improvement measures. The section discusses the

achieved performance and the accomplished results in sync with the expected.

We conclude the thesis with Chapter 6, where we discuss the future work we wish to

acknowledge in this area. The references follow the scope and limitations of our present

work.

-

1–18

-

2–19

Chapter 2

Background

According to Computer Linguistics, a corpus is a self-contained compilation of texts,

spoken and/or written, accumulated and assembled on a set of clearly defined criterion.

Intention of corpus collection is generally to serve a particular purpose to the person

gathering it, and to others working in the participating language the corpus is in, through

its exploitation for various resources and language studies. ICAME (International

Computer Archive of Modern English) is a centre that aims to organize and assist the

sharing of computer-based corpora. Some of the examples of available English corpora are

available at Computer Linguistics website2

Some of the corpora in English are as follows. In British English, we have The BNC, a

corpus in written and spoken British English, used extensively by researchers and for the

Oxford University Press, Chambers and Longman publishing houses. CANCODE

(Cambridge Nottingham Corpus of the Discourse of English) is a corpora in spoken

British English, and used at length by researchers and Cambridge University Press. In

addition, there is ICE (International Corpus of English), including in itself international

varieties of spoken and written English. The corpus has a major drawback that most of the

corpus is not yet available.

Brown University Corpus & LOB (Lancaster-Oslo-Bergen) Corpus is a parallel corpora of

written texts, but is now rather outdated. The Bank of English is a compilation of written

and spoken English, an important resource for researchers and for the COBUILD series of

English language books. London-Lund Corpus (Survey of English Usage) is a collection

of spoken British English, but it is now quite old. Santa Barbara Corpus collects text in

spoken American English. This corpus has the drawback similar to ICE corpus; most of

the corpus is not yet accessible for use. Hong Kong Corpus of Spoken English is an under-

compilation corpus.

2 http://www.engl.polyu.edu.hk/corpuslinguist/corpus.htm#Definition%20of%20a%20corpus.

-

2–20

2.1 Notion of corpora

There are two approaches to multilingual corpora: parallel and comparable. Parallel

Corpora is a compilation of texts, each of which translated into at least one language other

than the original and thus, clubs together perfectly aligned (parallel) translated text. The

simplest case exists where only two languages are participating: one corpus is exact

transformation of other, with virtually non-existent direction of translation. Examples of

parallel corpora and some efforts worldwide can be found at [Arora].

In order to analyze a parallel or comparable text, some kind of text alignment is essential,

which identifies equivalent text segments like sentences or words. One example of parallel

corpus is European Parliament, a corpus of pair wise-aligned files created by Philipp

Koehn. The corpus is available in Danish-English, German-English, Greek-English,

Spanish-English, Finnish-English, French-English, Italian-English, Dutch-English,

Portuguese-English and Swedish-English. Each corpus is about 100 MB [Athel].

TRIPTIC, that is, TRIlingual Parallel Text Information Corpus, forms part of the empirical

data used for research on the contrastive analysis of prepositions. Developed in English,

French and Dutch, the corpus investigates the way in which languages converge and

diverge in the semantic structure of so-called function words. According to [Athel], the

paragraph-aligned corpus consists of 20 lakh words consisting of 10 lakh fiction and non-

fiction data each. The corpus has the facility of automatic selection of the n-th paragraph

in the languages the trilingual corpus is.

Parallel corpora are objects of curiosity because of the prospect it offers for aligning

original and translation and gaining insights into the nature of translation. Tools to aid

translation can be formulated. In addition to it, probabilistic machine translation systems

can be trained on such a collection of parallel text.

2.2 Comparable Corpora

Comparable Corpora choose alike texts in more than one language or variety, based on

some criterion of similarity. The sub corpora are not exact translations of each other, but

-

2–21

collected on either same sampling frame or some measure of comparability. In simpler

words, comparable corpora are corpora chosen under “...similar circumstances of

communication”. There is no strict agreement on the nature of the similarity and there are

very few examples of comparable corpora [Resnik]. One such example is ICE -- the

International Corpus of English.

One example of comparable content could be description of a new product written

independently by different people in the language they are comfortable. The style of

writing and the presentation would vary a lot. In addition, one may highlight more features

depending upon his or her perception of the product, and may even include comments and

suggestions for further improvement. Some author in the depiction may also include

feedback of other users known to the writer. Even with such a variability in the account of

same product, high chances are there of sentences in different language speaking of the

same feature of the product. We can identify and extract such sentences, and exploit a

huge collection of such accounts of different products for an enormous collection of highly

valuable multilingual corpora.

Such corpora would be an example of corpora generated based on content. If there exists a

time-bound on the selection of descriptions, assigning more closeness to descriptions

within a specified time-bound, then the corpora thus chosen is concurrent corpora.

Newspaper reports are example of such corpora. The next subsection describes major

classifications of comparable corpora based on the similarity condition used.

Comparable corpus enjoys many advantages over parallel corpora in terms of availability,

versatility, extensibility and accessibility. Moreover, parallel corpora works on the

assumption that the amount of variation in the texts under consideration is limited. The

procedures of acquisition of comparable corpora generally relax the limitation largely.

Mountains of comparable text exists online in the form of news reports; and in print in the

form of legal texts, socially conventional texts like marriages, announcements and

advertisements, books and magazines etc. Academic and scientific text written in

accordance with neighborhood conventions is a high-quality source of related data.

Comparable text benefits from its nature of being easily extensible, with negligible data

acquisition issues in most cases, something parallel corpora is deficient in.

-

2–22

2.3 Classification of CC

Based on the similarity measure used, text can be treated as comparable on the basis of

four types. The first one is the form of data, that is, the size of files, number of words,

sentences, paragraphs or even the length of texts. It can also be on the file’s format - .txt,

.doc, .html, .xml etc. Content can be compared for finding similar documents. The corpus

can be in general language or talking of specialised domains. Newspaper articles, reports

of war or politics, interviews and discussions, views and reviews all come under this

category. Structure of the documents can be considered too, where the text can be formal,

carefully constructed texts like Legal texts, or Informal, loosely organized discourse like

transcriptions of conversation. Mode is the fourth category where the similarity measure is

based on whether the text is spoken: e.g. speech, formal dialogue, conversation, or written:

e.g. book, essay, instruction manual. Very large-sized corpora can be treated as

comparable to some other corpora similar in size to it, and is constructed according to

same criteria of quantity and quality of text types. Many more measures are possible other

than the ones described above, to categorize collected data as comparable.

2.4 Some Examples of Comparable Corpora

This section describes three major efforts globally to acquire comparable corpus.

2.4.1 ICE (International Corpus of English)

It is a corpus of around 1 million words in each of many varieties of English around the

world. It began in 1990 with the primary aim of collecting material for comparative

studies of English worldwide. Fifteen research teams around the world are preparing

electronic corpora of their own national or regional variety of English.

Each ICE corpus consists of 1 million words of spoken and written English produced after

1989. To ensure compatibility among the constituent corpora, each team is following a

common corpus design, as well as a common scheme for grammatical annotation. More

information is available on the website http://www.ucl.ac.uk/english-usage/ice/index.htm.

-

2–23

2.4.2 The Brown Corpus (American English)

The Brown Corpus of Standard American English is the first of the modern, computer

readable, general corpora compiled by W.N. Francis and H. Kucera, Brown University,

Providence, RI. It has 1 million words of American English texts printed in 1961 sampled

from fifteen different text categories to make the corpus a good standard reference. The

corpus is undersized, slightly passé but still utilized and clichéd by other corpus compilers

The LOB corpus (British English) and the Kolhapur Corpus (Indian English) are two

examples of corpora made to match the Brown corpus. Comparison of same language

existent in different varieties such as English is easy with the availability of these corpora.

2.4.3 LOB Corpus (British English)

Researchers in Lancaster, Oslo and Bergen compiled the Lancaster-Oslo-Bergen Corpus.

It has 1 million words of British English texts from 1961 sampled from fifteen different

text categories. Each text is just over 2,000 words long (cut at the first sentence boundary

after 2,000 words for longer texts) and the number of texts in each category varies. The

corpus has been grammatically tagged (all words have been given a word-class label). The

tagged and untagged versions of the corpus are available through ICAME. This corpus is

the British counterpart of the Brown Corpus of American English, which contains texts

printed in the same year for comparison between both varieties.

2.4.4 Kolhapur Corpus (Indian English)

This corpus is comparable to the Brown and the LOB corpora. The motive behind the

construction of this corpus with to serve as source material for comparative studies of

American, British and Indian English, and is drawn from materials printed and published

in 1961 for Brown and LOB corpora and 1978 for the Indian corpus. It consists of 500

texts sampled from 15 different text categories, each consisting of just over 2,000 words

drawn mainly from Govt. Documents, foundation reports, industry reports, College

catalogue, Fiction, Religion, Press Editorials etc.

-

2–24

2.5 Applications of Comparable Corpora

Comparable texts in different languages can extract multilingual lexicons, paraphrases and

other language-specific resources, and available ones in the same languages can enrich

themselves. This is particularly helpful in creating effective bilingual dictionaries for

language pairs for which either no dictionary exists, or the available ones are very

ineffective. It also helps in multilingual summarization, which is the automatic procedure

for extracting information from multiple texts written about the same topic in multiple

languages. It offers a wide scope of applications for research in Discourse analysis

(Analyzing written, spoken or signed language use), Pragmatics (ability to understand the

intended meaning of speaker, pragmatic competence), Information retrieval etc.

2.6 Limitations of CC Disadvantages of comparable corpus lies in the difficulty in managing the corpus for more

delicate analysis. Also, comparable data is not applicable to all areas of language

technology, and becomes nnecessary for certain types of research.

-

2–25

-

Chapter 3 Related Work Karunesh Arora et. al, CDAC, Noida [Arora] have made a Parallel Corpora for 12 Indian

languages (Hindi, Punjabi, Marathi, Bengali, Oriya, Gujarati, Telugu, Tamil, Malayalam,

Assamese, Kannada ) including Nepali aligned at paragraph level. The corpus was OCRed

from various books available in different Indian languages like National Book Trust India,

Sahitya Akademi, and Navjivan Publications. They found candidate pairs for paragraph

alignment based on the size of content. The paper also describes some of the major efforts

towards the acquisition of corpus in many languages all over the world.

Harold Somers [Somers], England deals with the techniques of building parallel corpora

from web through ‘tricks’, like names of filename (.fr, .en) or ‘anchors’ in the text. They

find the candidate pairs based on content, that is, the amount of text available between

each anchor. In addition, they highlight the issues related to alignment of thus obtained

parallel data. Using identification of anchor points and evaluating the extent of match of

the text between these anchors, or some other features like usage of machine-readable

dictionaries and other language specific resources are some of the techniques discussed.

Almeida and Alberto [Almeida] grab the parallel corpora from web by getting file pairs

from a list of URLs; user supplied file pairs; from the result of queries on search engines;

from a web site etc. They go for in-depth validation techniques, such as file size

comparison, string normalization & edit distance. They use a variety of normalization

techniques, such as they convert “index_pt.html” to “index”, for the normalization of

“index_pt.html” (for Portuguese) and “index_en.html” (for English) to “index”. Shinmaya

& Sekine [Shinmaya], 2003 took Japanese newspapers on stories involving deaths. They

performed comparing the sentences on extent of match of Named-Entities present in the

sentences.

3–27

Resnik and Smith [Resnik] utilize the Internet Archive for grabbing parallel data, and

using their STRAND (Structural Translation Recognition, Acquiring Natural Data) Web-

Mining architecture to identify pages that might be translations of each other in different

languages. Their algorithm first locates the pages that might have text parallel in multiple

languages, spawning candidate pairs that could be translations through URL-matching

algorithm, and applying structural filtering to throw out negative pairs. They use their

algorithm to build English-Arabic corpus having 2910 translation pairs.

Parallel Text Miner (PTMiner) used by Chen and Nie [Chen] applies the concept of

utilizing existing search engines to locate parallel data through a query in a specific

language. The search engine returns links to the pages, which they first verify for the

language of interest using length filter and automatic language identification. Once they

locate candidate multilingual sites, they are crawled deeply for data. Based on

examination, Chen and Nie accumulate 118MB/135MB English-French corpus having a

95% precision, and 137MB/117MB English-Chinese corpus of 90% precision [Chen]

[Resnik].

Ma and Liberman [Ma] in 1999 proposed BITS, that is, Bilingual Internet Text Search,

wherein multilingual pages from a pre-specified list of domains are identified using

language identification. The recognized pages are crawled exhaustively, filtered and

compared based on content and filtered on threshold.

To highlight the need of bilingual lexicon for effective CLIR, we developed a system to

retrieve relevant documents from English target collection in response to queries in Hindi

and Bengali using Machine Translation approach. The system submitted the results in

ranked order. The best MAP values (Mean Average Precision) for Bengali and Hindi

CLIR for our experiment were 7.26% and 4.77%, which were 20% and 13% of our best

monolingual retrieval, respectively. The system became a part of our first participation in

Cross-Language Evaluation Forum (CLEF) 2007 [Mandal].

We followed the dictionary-based Machine Translation approach to generate the

corresponding English query out of Indian language (Hindi, Bengali) topics. Our main

challenge was to work with a limited coverage dictionary (of coverage ~20%) that was

-

3–28

available for Hindi-English, and virtually non-existent dictionary for Bengali-English.

Therefore, we depended mostly on a phonetic transliteration system to overcome this. We

had access to a Hindi-English bilingual lexicon of approximately 26000 Hindi words, a

Bengali bio-chemical lexicon of around 9000 Bengali words, a Bengali morphological

analyzer and a Hindi Stemmer. In order to achieve a successful retrieval under this limited

set of resources, we adopted the following strategies: Structured Query Translations,

phoneme-based followed by a list-based named entity transliterations, and performing no

relevance judgment. Finally, we fed the English query into Lucene search engine and

retrieved the documents along with their normalized scores, which follows the Vector

Space Model (VSM) of Information Retrieval.

3.1 Effect of poor dictionary

We emphasize in this section the effect of underprivileged bilingual lexicon on the overall

results.

Figure 2

Frequency distribution of number of translations in Hindi bilingual dictionary

The above graph (Figure 2) shows the frequency distribution of number of translations for

Hindi words in the Hindi bilingual dictionary we used. With the increase of lexical entries

-

3–29

and Structured Query Translation (SQT), more and more ‘noisy words’ were incorporated

into final query in the absence of any translation disambiguation algorithm, thus bringing

down the overall performance. The average English translations per Hindi word in the

lexicon were 1.29, with 14.89% Hindi words having two or more translations. For

example, the Hindi word ‘रोकना’ (to stop) had 20 translations in the dictionary, making it

highly susceptible towards noise. The process followed for CLIR was as follows:

STOPWORD REMOVAL

STRUCTURED QUERY TRANSLATION

BILINGUAL LEXICON LOOKUP (ALL TRANSLATIONS USED)

CORPUS PROCESSING

QUERY GENERATION

STEMMING

INDEXING (LUCENE)

LANGUAGE – SPECIFIC STOPWORDS REMOVAL

MORPHOLOGICAL ANALYZER (BENGALI)

STEMMER (HINDI)

LUCENE STEMMER (ENGLISH)

(ALL POSSIBLE STEMS)

DOCUMENT RETRIEVAL

TRANSLITERATION (ITRANS)

LUCENE

RESULTS

1.35 LAKH DOCS ~433 MB SIZE

EDIT – DISTANCE ALGO USED FOR

NAMED-ENTITY MATCHING

The topics were in the Indian languages Hindi and Bengali, with a target collection of

English documents. The topics consisted of three fields namely Title, Description and

Narration. Table 1 shows the various Cross language runs submitted in CLEF 2007. Table

2 shows the summary of bilingual runs of the CLIR evaluation experiment.

-

3–30

Table 1

Cross language runs submitted in CLEF 2007

Table 2

Summary of bilingual runs of the CLIR evaluation experiment

Figure 3

Recall plotted versus Precision for the six runs for Bengali and Hindi to English CLIR Evaluation

-

3–31

The above graph (Figure number 3 shows Recall plotted versus Precision for the six runs

for Bengali and Hindi to English CLIR Evaluation experiment we evaluated the system

on, two for each language English (monolingual), Hindi and Bengali. We figured out that

difference in performance is due to missing specialized vocabulary, missing general terms,

wrong translation due to ambiguity and correct identical translation. There is a strong need

for effective translation, memory & processing capacity, effective bilingual lexicon and

the availability of a parallel corpus to build statistical lexicon.

Translation disambiguation during query generation explains the anomalous behavior of

Hindi. Query wise score breakup revealed that the queries with more named entities

always provided better results than those lacking them. The poorer performance of our

system with respect to other resource-rich participants clearly pointed out the necessity of

a rich bilingual lexicon, a good transliteration system, and effective Named Entity

recognition. We found that a trilingual comparable corpus in the three languages English,

Hindi and Bengali could prove to be the key in constructing the needful resources. Since at

present no such corpus exists that could serve our purpose, we decided to construct one

such corpus in English and Hindi. The next phase of the project is towards the fulfillment

of the preliminary requirements without which no CLIR in these two languages can be of

much help.

-

Chapter 4

Work Done

4.1 Newspaper Sites: Open Source of CC

For this project, the source of comparable data is the freely available news articles from

online newspapers Navbharat Times (NBT) for Hindi and Times of India (TOI) for

English. The choice of online news articles is the fact that news items offer high variety

and easy availability, with no legal issues of data acquisition.

In case of comparable corpus, there is no complete information overlap, particularly for

languages like English and Hindi with very different foundations. Two texts written by

different press agencies, describing the same incident or event, can vary widely in

presentation and content, depending on the times the two events are covered, and the

author. One may contain additional facts or comments, and can have different arrangement

of the appearance of comparable information, adding to the “noise”. We assumed

beforehand that the amount of insertion and deletions between the texts is limited across

the similar stories. If these cases do not hold, the chances of false positives cropping up in

the final pairs’ list escalate.

4.2 Methodology

Different authors writing the same event in their respective newspaper will result in

numerous versions of similar content uploaded on websites. However, certain kinds of

noun phrases such as names, dates and numbers behave as “anchors” which are shared by

similar articles. Our key inspiration is to identify these anchors among comparable articles

and find the similarity score. This way we can extract stories that convey the similar

information.

4–34

4.3 Details of the implementation

NEWS ARTICLES

CRAWLER NEWS ACCUMULATION HINDI ENGLISH

WWW

PREPROCESSING

TRANSLATION

NAMED ENTITY RECOGNITION

DICTIONARY LOOKUP

TRANSLITERATION

SIMILARITY GAZETTEER LIST

LOOKUP

ABBREVIATIONS LIST LOOKUP

COMPARABLE CORPUS

TEMPORAL FILTERING

SIZE FILTERING

CONFLICT RESOLUTION

SIMILARITY FINDING

THRESHOLD FILTERING

-

4–35

4.3.1 Gathering News

The first step was to gather news items from the mentioned web sites automatically. On a

daily basis, we collected crawled news stories and extracted the actual news article

alongwith the title, date and place of publication. We saved special reports with no place,

like letters to editor, discussions or reviews with no place defined. This helped in matching

of such special reports in different newspapers.

4.3.2 Named Entity Recognition (NER)

For identifying proper names in English story, we used the freely available LingPipe

Named Entity demo code3 that identifies proper names present in English files, and tags

them in three classes: Person, Organization and Location. However, the quantity of names

identified by the system was small and included some false positives. To enhance the

collection of identified names, we made a “crude” Named Entity recognizer for English

that considers any word with a capital letter a proper name, including sentence’s first

words. It treated multi-word names like “Prime Minister” as single name. To suppress

false positives, we constructed a new stopword list that contains common sentence

initiators such as “Currently” and “Today”, in addition to the common stopwords such as

“The”.

For identifying names in Hindi story, we used a Named Entity recognizer4 and collected

the Hindi proper names identified. The Named Entity recognizer translates Hindi file and

extracts and tags the proper names present in the supplied file to three classes: Name,

Location, Organization and Date. Therefore, the system tags “Manmohan Singh” to

PERSON, and 20 June 2008 to DATE.

4.3.3 Preprocessing

Stopword removal and processing of special cases for both languages were undertaken.

Special cases include normalizing currencies like $20 to 20 dollars, mathematical 3 http://alias-i.com/lingpipe/web/demo-ne.html4 Courtesy Mr. Sujan Saha, Communication Empowerment Laboratory, Department of Computer Science and Engineering, IIT Kharagpur

-

http://alias-i.com/lingpipe/web/demo-ne.html

4–36

quantities like 35.5% to 35.5 percent, breaking range of years like 2007-09 to 2007 2009,

and others. To take care of domain-dependency of stopwords, we constructed news-

specific stopwords list for both the languages. For instance, words like “report”, “article”,

“incident”, which though have meaning, do not add much to the semantic content of the

text when present in news. We appended a list of such words to the existing stopwords list.

In addition, common abbreviations were expanded using abbreviations’ list for English

articles (AL). For preprocessing the articles in languages other than English, we used the

techniques of dictionary lookup (DL), gazetteer list lookup (GL), abbreviation list lookup

(AL) and Transliteration Similarity (TS).

4.3.4 Translation

4.3.4.1 Dictionary Lookup (DL)

The translation step at DL used Hindi-English dictionary with 24824 Hindi words. The

list was one-to-many, that is, in many cases, more than one English translation were

present for one Hindi word in the dictionary. This dictionary was different from the one

we used for our earlier results. To suppress the effect of noise, we considered only a

maximum of first two translations for each Hindi word found.

4.3.4.2 Gazetteer List Lookup (GL)

We manually constructed a list containing the corresponding English names for regularly

found Hindi proper names. The list was incremental in nature, that is, it automatically

appended all the proper names identified by the system for which the corresponding

English names found by TS, were in the list at the end of the execution for future use.

4.3.4.3 Abbreviation List Lookup (AL)

We used a list of commonly found acronyms and their expansions. In addition, print media

tends to adopt many compression techniques and abbreviations, generally to shorten the

length of title, in a bid to pack more significant terms in the title. For instance, “President”

-

4–37

is frequently written as “Prez”, “Prime Minister” and “Chief Minister” as PM and CM

respectively. Such abbreviations were also included in the gazetteer list to facilitate better

detection of correspondence between titles and even news story.

4.3.4.4 Transliteration Similarity (TS)

For identifying names, we exploited the phonetic correspondence of alphabets and sub-

strings in English-Hindi. For example, “ph” and “f” both map to the same sound of “फ” (f).

Likewise, “sha” in Hindi (as in Roshan) and “tio” in English (as in ration) sound similar.

Prior to executing content-based similarity-finding algorithm, we used TS approach on the

list of English Named Entities identified by the system collected. Using edit-distance

method, we collected, and appended mappings identified by the system as a valid name-

match, to the present gazetteer list for future use.

For each of the cases, we calculated the similarity between identified proper names for

each source language and target language pair. The system considered the pair with

maximum similarity, above a certain threshold level a valid pair. To accomplish ease in

similarity based on phonetics, we wrote language-specific rules for both the languages.

We list some of them below in Table 3:

Table 3

ष -> श (sh) ट / ठ / त / थ -> ट (t) ढ / ड / द / ध / ङ -> द (d)

ज / झ -> ज (j) ब / भ / व -> व (v)श / प / फ -> प (p)

ग / घ -> ग (g) न / ण / ञ -> न (n) ख / क -> क (k)

Phonetic rules for recognizing similar sounding names

-

4–38

4.4 Similarity Calculation

The final similarity between two stories is the sum total of similarity of names found in

them, and the matches between words appearing in the title as well as actual news story.

The prime focus was on appending through some techniques the similarity value between

a true positive pair, as well as penalizing a lot the sure-shot mismatches; thus pushing the

true matching pairs above all others.

Extent of Match = NSource Target×

- (1)

Where

N: number of intersecting words in source string and target string

SOURCE: number of words in source language string

TARGET: number of words in target language string

We calculated the similarity between source and target string by computing the cosine of

the angle between the two string vectors. Matching proper names were assigned the

highest weight (6.0) followed by title matches (3.0) and matches in story (2.0), and

multiplied to the individual extent value. The total similarity value was a sum total of the

similarity value found from the correspondence in the three cases, viz. Proper names, title

and actual story. We normalize the total value to the range 0 to 100 depicting no match at

all, and perfect match respectively.

A match in titles increased the chances of the pair being comparable, as the chances of

related stories, having high resemblance in titles, are not high in reality. Often newspapers

tend to present the titles in sensational ways, to attract viewers, and often deviating from

the “real” content.

Based on the closeness of the stories, we put intense penalties on pairs with different dates

of publish and/or different places. The thought is that even if a matching pair appears on

websites at a wide time-gap, the higher similarity value would make for the losses. This

-

4–39

step reduces the chances of mismatching stories attaining higher similarity value due to

accidental similarities in content.

4.5 Filtering

Rejecting pairs of stories that cannot be similar under any circumstances at an early stage

can lead to a substantial saving in time and effort required to get the results. Various

filtering techniques to reject obvious-looking mismatches and attain substantial saving in

time and effort required included a combination of temporal and size filtering. We

permitted a tolerance of seven days on either side between dates of publish of source and

target stories. The tolerance was adjustable to any extent but we kept it to a considerable

amount to attain reduction in time required to run the program, without losing any true

pair.

We also use length filtering to throw away stories with large size variation in them. For a

very less value of match between titles, added with a high difference between the lengths

of stories, we rejected the current target language text, and continued with next. As already

stated, we assume beforehand that the amount of insertions and/or deletions at both ends

of a matching pair is limited. After a careful analysis of news stories, we kept the ratio of

sizes to four. This step reduced the total number of target files for each source story,

escaping the comparison of obvious mismatching stories.

4.6 Conflict Resolution

At each point of time, for one source pair, we find the similarity value for the

corresponding target story. If the calculated similarity is above a certain pre-specified

threshold, and more than the maximum similarity found until now, we save the

corresponding pair.

It is possible that at certain point of time, the total similarity value attained for a source

story is same for two or more target stories. In this case, we select the appropriate target

-

4–40

story through the process of conflict resolution. The process followed when two or more

pairs have same total similarity value for same source story was as follows. First, we

check the temporal closeness of both target stories with the source news. Temporal

closeness is the measure of closeness of two stories based on the variance in date and

place of publish. A pair with same date as well as place of publication is closest. Then

with lesser closeness, comes the case wherein date of publication is one day on ether side,

and place of publication is same. This takes care of those instances where two different

press agencies publish the same story on adjacent dates for various reasons. Closeness

reduces when the date of publication is same but the place of publication is different. This

helps differentiate dissimilar articles published on the same date. The final case with lesser

closeness is the case where place is different and date is one day on either side. After that,

if none of the case holds, the closeness is the function of difference between the dates of

publishes of the source-target pair, and goes on reducing as the difference enlarges.

TWO PAIRS OF NEWS ITEMS SAME MAX SIM FOR ONE SOURCE FILE

CHOOSE PAIR WITH MAX TEMPORAL CLOSENESS

IF SAME, ADD BOTH PAIRS TO FINAL SET

IF SAME, CHOOSE PAIR WITH MORE SIMILARITY FROM STORY

4.7 Example

This hypothetical situation demonstrates the series of incidents occurring for a source

story (at the center) surrounded by a set of target stories. Some of the target stories are not

comparable to the source, but some articles describe the same happening, and there is only

one (or more) story in the target language that matches with the source. The source story is

in Hindi, temporarily translated in the figure. The target stories are in English.

The breadth of the arrow between the pair depicts the closeness of a story to the source

story in the center. For the stories on the left hand side, the dotted arrows indicate that the

stories are far away from the source story, and the ones at the right are closer to it. In each

bubble, the first line shows the title, followed by date of publish, and the text marked

between hyphens (- ABC -) shows the possible reason for either its selection or rejection.

-

4–41

PM talks to Sonia Gandhi

08/25/2007 - Uncommon names -

Salman Khan released from jail

Aug 25, 2007 - Names uncommon -

Sanjay Dutt’s movie attracting crowds

Aug 25, 2007 - TOTAL SIM THRESHOLD

-

3 killed in Nandigram

Aug 25, 2007 - NE -

Sanjay arrested Aug 21, 2007

- TOTAL SIM -

Sanjay Dutt released Aug 25, 2006

- TEMPORAL FILTERING -

Munnabhai’s story till now – ’93–’07

Aug 25, 2007 - SIZE -

Sanjay Dutt released from Pune jail

August 25, 2007 - (HINDI) -

Sanjay celebrates freedom Aug 26, 2007

- TOTAL SIM -

Dutt to be released Aug 24, 2007

- CLOSENESS -

Sanjay’s case’ hearing today Aug 24, 2007

- SAME DATE -

Sanjay’s verdict deferred Aug 23, 2007

- TOTAL SIM -

Sanjay released from Pune Aug 25, 2007 - MAX SIM -

Bollywood relieved, Sanjay back home

Aug 25, 2007 - SAME MAXSIM -

-

4–42

-

Chapter 5

Results Precision is the ratio of true positives and total retrieved pairs, whereas recall is the

number of relevant documents retrieved. We evaluated our system on the precision, that is,

the number of pairs actually comparable divided by the number of pairs tagged positive by

the system. The desire is to get a high-precision system, though a system high on recall

alongwith precision is difficult to attain.

Out of the total pairs retrieved, we pick some random pairs and evaluate our system on

those pairs. For manual evaluations of the returned results, we considered a pair valid

matches if at least one of the three described conditions hold good.

A news story pair would obviously be comparable if it talks of the same event. News

describing Indian cricket team winning Twenty-20 World Cup match in one particular

year in both the newspapers is an example of such a case. When both the stories describe

same events, with related contents and/or actions, we treat them as comparable. For

instance, a report in one language on Indian team winning Twenty-20 cricket World Cup

match, and an interview of the captain of the winning team for the same competition in

another. Both the stories describe similar events, with at least one comparable sentence

present in the pair, are also comparable, as they serve our purpose of getting similar news,

and comparable sentences. A report on Indian team winning Twenty-20 cricket World Cup

match in one language, and a report on World Cup matches won by India against the same

team in the past in another, is an example of such news.

For each source-target pair, we compared Proper names (N) present in the whole stories in

the pairs for calculation of similarity. In addition, depending upon the amount of content

taken from the news article compared, we calculated similarity for the following four

cases:

5–44

1. : Title only (T), the actual news was NOT considered

2. : Title plus first half of the news story (HS) (up-to a max of first 10 lines).

3. : Title and the whole news story (FS).

4. : only names, no title or news story.

5.1 Sample data

The sample data set consisted of 1711 different stories in the source language Hindi,

collected for a period of 30 days (29 March 2008 to 27 April 2008). The target language

(English) pair had 3500+ stories collected over a period of 27 March 2008 to 1 May 2008

around the dates of Hindi collection, to take care of time gap between uploading of

comparable items on websites. We are in a process of evaluating the system on larger data

set. For the improvements, we used a slightly bigger data set consisting of 2400+ source

language files collected for 35 days, and corresponding English data of similar sampling

period.

5.2 Evaluations

0

20

40

60

80

100

1 2 3 4 5 6 7 8 9Random Points of evaluation

Prec

isio

n

10

N+T N+T+HS N+T+FS N

Figure 4

Precision versus evaluation of system on ten random points for a sample data of stories

collected over a period of one month

-

5–45

The graph (Figure 4) shows Precision on Y-axis plotted for evaluations on ten random

points plotted on X-axis. The next graph (Figure 5) shows the average precision values for

the evaluations for four cases considered.

0

20

40

60

80

100

1 2 3 4

AVERAGE PRECISION

Figure 5

Average precision values for the evaluations for one-month sample data

The overall system behaved satisfactorily and the average precision for the evaluations

turned out to be 67.4 percent for the best case () and 37.4 percent for the worst

case () (Figure 4). We infer that comparing the news articles based on only their

proper names is not at all a good choice, and in general leads to false positives. Comparing

only titles alongwith the named entities is better at some places, as similar words in title

boost the similarity value of similar news stories. However, as stated earlier, in general,

similar news stories might have mismatching titles, twisted and compressed in a bid to

make them more striking to the reader. Two dissimilar stories also can bear common

words in titles, due to various reasons (analyzed next).

The graph clearly depicts the outperforming of Case 2 over Case 3 in most of the cases. In

fact, the highest precision attained at any point of time is 78%, achieved at two points

(point 4 and 10), which is higher than the highest precision obtained by case 3 (76% at

point 4). We realize that for majority of cases, the first half of the news caters for the

largest part of the vital information. This is in accordance with the regular scenario where

first the few sentences highlight the actual current event, and trailing sentences provide

-

5–46

additional related information or comments, and are generally repetitions of some data

from precedent incidents.

However, we deduce on investigation that for appropriate alignment of news articles at the

document-level that we need to consider the key features, like names and dictionary words

in the whole story. The average precision graph (Figure 5) shows Case 3 (N+T+FS)

outshining Case 2 (N+T+HS) over all the random points taken together.

Figure 6 shows the number of documents retrieved versus precision obtained.

Nevertheless, for alignment of stories at the document level, considering only the first half

of the story is not a good choice, as depicted in the average precision curve. This is due to

the consistent performance of case 3 over the fluctuating precision of Case 2 in the

preceding graph. In any case, considering only the titles is not a good choice.

0

20

40

60

80

100

10 20 30 40 50 60 70 80 90 100 110 120 130 140Retrieved Documents

Prec

isio

n

N+T N+T+HS N+T+FS N

Figure 6

Number of documents retrieved versus precision obtained for top 140 results

Table 4 on the next page shows the number of retrieved documents versus the Precision

values for top 70 pairs.

-

5–47

Table 4

CASE P@10 P@20 P@30 P@40 P@50 P@60 P@70

N+T 100 100 96.66 92.5 88 86.67 85.71N+T+HS 100 100 90 85 86 85 84.29N+T+FS 90 95 90 87.5 86 83.33 81.43

N 80 70 73.33 75 70 63.33 65.71

Number of retrieved documents versus the Precision values for top 70 pairs for improved

case

5.3 Analysis

5.3.1 Positive results

-

5–48

A pair is termed as legitimate, if it confirms to the specifications we laid, as described in

the start of this chapter, irrespective of the dates and places of publication of the paired

articles. The image on the previous page shows a comparable story marked positive by the

system.

The system identifies the pair correctly even when both the dates and places of publish are

different. The result can be attributed to the high number of common proper name “मन”ु

(Manu) found in both the stories (frequency 6 in Hindi and 10 in English). As expected,

even after a high penalty for the stories due to differences in date of publish and place

mismatch, the high score of proper names helped the total similarity value above all

others. We tuned the penalties to make sure the similarity values of pairs like these remain

above the threshold level after the application of punishment, and thus are a part of the

final set if no other target story achieves a higher similarity.

5.3.2 Negative results

As explained earlier, at some places, the method did not behave the way it intended to. We

identified the following points of failure of the present arrangement leading to wrong

results.

5.3.2.1 Pair unable to cross the thresholds

With less matches in names, added with the lower number of dictionary words translated,

the total similarity value between a correct pair might not be adequate to reach the other

side of the threshold boundary. The example below shows such a case.

-

5–49

The similarity of the shown pair was more than the threshold when considering only

proper names, and made it to the final list of comparable documents. Nevertheless, due to

absence of adequate number of translations, the total similarity value reduced for other

cases. To boost the similarity value for such pairs, we enhanced the effect of match in

names to an extent where the effect of named entities is substantial, without

overshadowing the effect of other parameters. Tuning the weight associated with the

names, we decided to put it twice the weight for title, and thrice that for story.

5.3.2.2 Absence of corresponding news article in target language

With no story to match, accidental matches in a dissimilar target language story can cross

the thresholds and counted as “positive” pair. Accidental matches could be in proper

names and in dictionary words. Sometimes, the figures also match leading to increment in

similarity value. However, we tried to put a certain threshold value to suppress the effect

-

5–50

of such “accidental” matches on the overall results, and thus the precision, cannot rule out

the possibility of occurrence of such cases.

5.3.2.3 Effect of similar names found in dissimilar stories

When similar names occur in dissimilar stories, chances are that the match in names

results in high similarity value for that pair. Two stories with one on population of India,

and other one on some world event with India as one of the participants, with similar name

India occurring frequently, can have a high score of similarity from names. Since the total

similarity depends on both source and target file, even after the presence of comparable

target story, a target story with higher density of such names might generate higher

correspondence.

The next example shows the situation where a highly mismatching pair is tagged positive

by the system due to occurrence of same name “राहुल” “Rahul” in the news.

-

5–51

While the Hindi story speaks of cricket, and contains the name “Rahul Dravid”, the target

story describes the latest fluctuations in politics, with “Rahul Gandhi” occurring

frequently. We analyzed and found that no corresponding news was present in the target

collection, due to which the system pairs the above target story with the source story at the

end of the simulation.

There was also a pair found in our sample data set where the system identified correct

target story for cases 1 to 3, but at case 4, the system paired the same source story with

some other mismatching story because of the presence of more intersecting names. The

images below show two pairs. The first one is the correct pair identified when some

content from the story is compared, viz. title in case 1, first half of the story in case 2, and

full story in case 3. However, in case 4, we compare only names for calculating the total

similarity. The next figure depicts such a case.

-

5–52

The total number of proper names common to the true pair was six (“इरान” (Iran) occurring

six times in source story and nine times in target), whereas in the mismatching pair, “इरान”

(Iran) occurred six times each, and “तेहरान” (Tehran) occurred two times, making the total

number of proper names common to both as eight. When a comparison occurs on the

names only, the pair with higher number of names achieved a higher total.

Topic identification could be one possible solution to minimize such negativity. A

classification of news stories into independent fields on identification of the context can

group similar news. In that case, the above two stories would constitute part of separate

cluster. Penalizing matches from dissimilar clusters can reduce false positives like this to

great degree.

-

5–53

In addition, we identified many pairs with high effect of matching proper names in

mismatching stories. We discovered most of them d to have a negligible value of

similarity in their story. To suppress such pairs where the total similarity value is a result

of similarity in proper names alone, we rejected pairs with very low similarity in story. In

this way, a pair needs to have some correspondence in all the measures applied, to come in

the final list.

5.3.2.4 Effect of similar dictionary words in stories depicting news on similar topic, yet

describing different events

This happened when the names were either all mismatching, or the effect was negligible.

However, there were considerable dictionary translations common in both the stories. For

instance, separate accidents occurring at different place, with unintentional match of

dictionary words and/or even names resulted in higher similarity value from story and/or

title.

Some of the pairs when analyzed revealed that the dictionary too was responsible for

erroneous matches. A translation found in the dictionary was actually a stopword for the

news domain, or otherwise, and was included in the final similarity calculation procedure.

For example, the DL step translated the Hindi word “अब” to “now”, and a story pair with

this commonly found word got a boost in the correspondence value. Particularly, we had

in our test data a pair with “now” in the title itself. This helped the pair to cross the

minimum similarity levels.

Below is a snapshot of the case where the words “train” and “rail” were prominent in both

the stories, speaking of different happenings. The case is not comparable yet marked

positive by the system.

-

5–54

We tuned the weights associated with similarities to minimize the effect of matches in

dictionary words alone. Particularly in titles, small coincidental matches led to high boost

to similarity. To suppress the effect, at the time of matching, we rechecked the stopwords

list in target language to filter out any such words. This reduced such accidental matches

in the final evaluations.

5.3.2.5 Effect of wrongly identified pair of names

This was the result of error in the TS stage, with some mismatching pair crossing the

threshold and marked “true” by the system. This occurred when match occurred in some

proper name identified by the system in the target language with some source language

word. Considering the target word a translation of the source word, it was included in the

final list of identified names. For instance, “बु कग” (booking) in Hindi transformed into

name “Viking” after passing the TS step. The corresponding word booking, not being a

-

5–55

proper name, was not present in English named entity list. The example shows the case

where the word in Hindi “ली” (to take) gets mapped to “Lee”, and comes closer to the

name “Lee” present in the target story.

To suppress mismatches, we assigned higher threshold levels for similarity in TS. We

tuned the thresholds on length of the compared strings, with higher thresholds required for

higher lengths.

5.3.2.6 Effect of poor dictionary

The bilingual dictionary used for this experiment is still in the development stage. Last

used, the dictionary contained 24824 Hindi words. Inclusion of more source language

-

5–56

words will lead to more translations identified in the DL step, escalating the effect of

similarity in dictionary words, and thus resulting in identification of more similar pairs.

5.4 Improvements

To achieve a further improvement in the results, we adapted some more techniques.

Firstly, we adopted a three-level weight system for the English names found in the stories.

We assigned highest weight to proper names identified by both LingPipe code and our

custom-made NER, followed by the names identified only by LingPipe followed by the

names found by our recognizer. This ensured the maximum effect of sure-shot names

identified by both the systems for English language. At the time of finding intersecting

names, we checked the weight of identified Hindi name in the English name collection,

and the total similarity was a function of that weight.

In addition, a deep analysis of mismatching pairs revealed a common trend where the

stories came closer due to high similarity in names, but had a significantly low

correspondence in actual news item. We reduced the effect of such mismatches by

rejecting a pair low on similarity value for news story. As expected, the system responded

positively and we attained a graph much better than the previous run for the same sample

set.

Figure 7 shows the graph we obtained on the improved version of our system, for a larger

collection. The graph shows number of top retrieved documents taken 10 at a time versus

precision. This collection included Hindi news for a period of 35 days with supporting

target collection in English. The graph met our expectations and show better precision for

the case when we include whole story for document-level alignment, contrary to Figure 6

where only title included with the proper names for similarity calculation shows better

precision. The values obtained representing the numbers of retrieved documents versus

precision values for top 70 retrieved pairs are in Table 5.

-

5–57

0

20

40

60

80

100

10 20 30 40 50 60 70

N+T+HS N+T+FS N N+T

Figure 7

Table 5

CASE P@10 P@20 P@30 P@40 P@50 P@60 P@70

N+T 100 100 96.67 97.5 92 90 87.14N+T+HS 100 100 96.67 97.5 96 90 87.14N+T+FS 100 100 100 100 96 90 85.71

N 100 95 96.67 90 82 73.33 70

Number of documents retrieved versus precision obtained for the improved system

Number of retrieved documents versus the Precision values for top 70 pairs for improved case

Thus, based on improved results, we restate the observation in accordance with the general

perception that former half of the news article furnishes the largest part of the essential

information conveyed. However, it is the consideration of whole story, which results in

better identification of similar stories. Detection of association in a source-target pair on

intersecting proper names is indispensable, but should not be the only criteria, specifically

for languages like Hindi, which is highly undersupplied with valuable cross-language

assets.

-

5–58

-

Chapter 6

Conclusion

The size of corpus thus obtained is far from any appreciable usage in the applicable areas

of research and technology. We were able to generate a repository consisting of more than

100 true pairs of comparable stories at the time of writing this thesis. We look forward to

accumulate many such pairs aligned at document level for it to be helpful to researchers all

over. In addition, the present work is for the language pair Hindi-English. With language-

specific resources in hand, it is possible to extend the system to generate multi-lingual

corpus in the participating languages. The resources for Bengali language, the second most

spoken language in India [Languages], are in the development stage. We wish to extend

the present bilingual corpus to trilingual corpus (English-Hindi-Bengali).

The present system crawl the news articles from only one website for each language.

NavBharatTimes lack the archives, so a story missed on a particular day might not be

crawled to the final collection. We found Bhaskar.com to contain archives for news

articles published in preceding years. We wish to extend the present system to many more

websites like Bhaskar.com to extract a huge amount of comparable data.

The next step towards the project is to align the found comparable documents at sentence

level, thus gathering comparable sentences from amongst the similar documents.

This project is the foundation stone for progress in numerous areas associated directly or

indirectly to natural language processing. Appreciable amounts of this corpus collected

will act as a catalyst to several researches the languages we built it in, and in many related

languages. We expect the effort of ours helps the Indian community in attaining our

purpose of shrinking the space between English and prominent Indian languages like

Hindi and Bengali, and make information access easier.

6–60

6.1 Future work We did not include Word Sense Disambiguation in the present system. Many words have

different meanings depending upon the words around it in the sentence it is used. For

instance, “bank” can refer to the financial institution as well as the bank of river. Some

places use the word as a synonym for trust. A dissimilar pair using the word in different

meanings can come closer harming the precision.

Though precision is acceptable, the recall of the present system is low. The result is the

cascading effect of inefficiency of Named Entity Recognizers in both the languages on the

overall similarity. There were a significant number of documents with a low percentage of

proper names mined by the recognizers in both the languages. Improving the Named

Entity Recognition systems to make them extract most of the named entities will help the

positive pairs to attain higher similarity values, thus reducing the amount of false

positives, and assisting many rightly identified pairs with total association value less than

the associated threshold values, to cross the boundary values.

-

6–61

-

Bibliography [Arora] Karunesh Kr. Arora, Sunita Arora, Vijay Gugnani, V N Shukla, Dr S S Agrawal,

GyanNidhi: A Parallel Corpus for Indian Languages including Nepali.

[Shinyama] Yusuke Shinyama, Satoshi Sekine, Paraphrase Acquisition For Information

Extraction, Computer Science Department, New York University, Proceedings of the

second international workshop on Paraphrasing - Volume 16, Sapporo, Japan. Pages: 65 -

71 Year of Publication: 2003.

[Resnik] P Resnik, NA Smith, Web as a Parallel Corpus, Resnik - Computational

Linguistics, Vol. 29, No. 3, Pages 349-380, Year of Publication: 2003.

[Somers] Harold Somers, Bilingual Parallel Corpora and Language Engineering,

Department of Language Engineering, UMIST, Manchester, England.

[Almeida] Jose Jo~ao Almeida, Alberto Manuel Sim~oes, Jose Alves de Castro,

Grabbing parallel corpora from the Web, Departamento de Informatica, Universidade do

Minho.

[Ma] Ma, Xiaoyi and Mark Liberman.. BITS: A method for bilingual text search over the

web, In Machine Translation Summit VII, Year of Publication: 1999.

[Barzilay] Regina Barzilay , Noemie Elhadad, Sentence alignment for monolingual

comparable corpora, Theoretical Issues In Natural Language Processing, , Proceedings of

the 2003 conference on Empirical methods in natural language processing - Volume 10

Pages: 25 - 32 Year of Publication: 2003

[Mandal] Debasis Mandal, Sandipan Dandapat, Mayank Gupta, Pratyush Banerjee,

Sudeshna Sarkar, Bengali and Hindi to English Cross-language Text Retrieval under

Limited Resources, In Working Notes for the CLEF 2007 Workshop (2007).

[Liddy] Elizabeth D. Liddy , How Might CLIR Be Accomplished, ASIST Annual

Meeting, Chicago, IL 11/13/2000 http://www.cnlp.org/presentations/slides/CLIR.pdf

[Ethnologue] Ethnologue list of most spoken languages in the world

http://en.wikipedia.org/wiki/Ethnologue_list_of_most_spoken_languages

[Languages] Languages of India http://languages.iloveindia.com/

[Arist] Arist Chapter, Douglas W. Oard & Anne R. Diekema, Cross-Language

Information Retrieval.

http://portal.acm.org/author_page.cfm?id=81100158311&coll=GUIDE&dl=GUIDE&trk=0&CFID=64051802&CFTOKEN=85963839http://portal.acm.org/author_page.cfm?id=81100617563&coll=GUIDE&dl=GUIDE&trk=0&CFID=64051802&CFTOKEN=85963839http://en.wikipedia.org/wiki/Ethnologue_list_of_most_spoken_languages

6–63

[McEnery] Anthony Mcenery & Zhonghua Xiao, Parallel and comparable corpora:

What are they up to?

[Maia] Belinda Maia, Creating parallel and comparable corpora for work in domain

specific areas of language, FLUP.

[Peters] Carol Peters, Multilingual Information Access for Digital Libraries, ISTI-CNR,

Pisa

[CNNIC] Serving the Needs of the Community — IDN and Alternatives

[Yale] Mohan Yale, Building a Sustainable Framework for a Multilingual Internet,

Internationalized Domain Names (IDNs), A2K2 Conference, New Haven, USA April 27,

2007

Tools and resources:

LingPipe home-http://alias-i.com/lingpipe/