3rd international conference on computational techniques ...psrcentre.org/images/extraimages/14...

Abstract— Information Extraction (IE) is the process of finding

structured text from unstructured or semi-structured text by

annotating semantic information. It becomes more and more

important because of explosion of unstructured electronic text on the

Web such as newswire, blogs, and email communications and so on.

IE is applied in a variety of application areas like news, financial and

biomedical domains. Moreover, it plays as a key component for

Natural Language Processing (NLP) areas including Automatic

Machine Translation, Automatic Text Summarization, and Question

Answering etc. Because of the effects of global warming, natural

disasters are occurring around the world in a growth number. In order

to make data analysis, comparisons and decisions at management

level, it is important to view natural disaster news in summary forms.

With the use of Automatic Myanmar Text Summarization, it saves

time by getting control over the information flood. Moreover, it

reduces tedious job to manual extract main facts from text

documents. It is beneficial in text summarization to use the templates,

the output of template-driven Information Extraction, to the summary

generation module. Thus, this paper presents Information Extraction

based on Conditional Random Fields (CRFs) by acting IE task as

sequence labeling task to support as a component of Automatic

Myanmar Text Summarization System. The proposed approach is

experimented in different kinds of information and achieved F-

measure score of 0.82%.

Keywords— Information Extraction, Automatic Text

Summarization, Conditional Random Fields

I. INTRODUCTION

NFORMATION Extraction (IE) can be defined as locating

the instances of facts from unstructured or semi-structured

text such as Web pages, news articles, call for papers (CFP),

e-mail, blogs and so on and produced them as structured

representation like relational database. It can also be stated as

the template filling process in which a predefined set of slots

in a particular template related to specific domain are filled

with the suitable values from the text and delivered the

template. Information Extraction is an element of other natural

language processing applications including Automatic Text

Summarization, Question Answering (QA), Machine

Translation, Document indexing etc.

Information Extraction (IE) is mainly used in Named Entity

Win Thuzar Kyaw, University of Computer Studies, Yangon, Republic of

the Union of Myanmar (e-mail: [email protected]).

Khin Mar Soe, University of Computer Studies, Yangon, Republic of the

Union of Myanmar (e-mail: [email protected]).

Hla Hla Hty, University of Computer Studies, Yangon, Republic of the

Union of Myanmar (e-mail: [email protected]).

Recognition (NER) which can stand independently as an

application or which can serve as a component of other natural

language processing applications such as Machine Translation,

Question Answering. NER means identification of proper

nouns, names of organizations, persons, locations etc, dates,

identification numbers, phone numbers, e-mail addresses and

so on.

IE is used in a variety of domains. In biomedical mining, it

is important to identify genes, proteins or other biomedical

entities automatically from a large set of scientific

publications. Similarly, intelligent analysts need to extract

information about terrorism events, people, used weapons and

the targets of the events automatically from a huge volume of

text documents. Information Extraction can also assist in

advanced search technology of search engines such as entity

search, structured search and question answering. In addition,

Information Extraction is used to automatically update a

natural disaster database by extracting relevant facts from

natural disaster news reports. The intention of both

Information Extraction and Text Summarization is to extract

applicable facts from user’s interested documents. However,

the presentation method of output to user is different. IE

delivers output as template or structured information (e.g.,

databases). On the other hand, Text Summarization uses one

form of summaries (running text or visualized form like table,

charts, graphs) to give as an output. Because the goal of these

two tasks is to retrieve relevant facts, by giving templates

obtained from template-based IE to the summarization

generation task that uses templates to develop summaries

enhances the summarization job.

Information Extraction process is performed based on

various methodologies. Three most popular supervised

machine learning approaches for IE task are rule learning

based method, classification model based method and

sequential labeling based method. These methods have two

main phases: training and extraction. The output of the training

stage is the model which is developed to identify the sub-

sequence of words or text that need to extract and consider the

input text as a sequence of words or text. In extraction stage,

with the use of resulted model, the data are extracted and

annotated as particular information according to the predefined

metadata.

The rest of the paper is organized as follows. In Section II,

some related work on information extraction is described.

Section III, identifies Information Extraction approaches. In

section IV, a detailed explanation of the proposed approach is

presented. Then, experimental results and evaluation of the

proposed approach is presented in Section V. The paper is

Information Extraction from Myanmar Text

using Condition Random Fields

Win Thuzar Kyaw, Khin Mar Soe, and Hla Hla Htay

I

3rd International Conference on Computational Techniques and Artificial Intelligence (ICCTAI'2014) Feb. 11-12, 2014 Singapore

62

concluded with summary and future work directions in Section

VI.

II. RELATED WORK

C. ZHANG et al. implemented automatic keyword

extraction from documents based on Conditional Random

Fields by considering keyword extraction as string labeling

task [4]. They compared their performance with other machine

learning methods such as support vector machine, multiple

linear regression etc and found that CRF performs better than

those methods.

K.M Schneider applied Conditional Random Fields by

regarding information extraction problem as a token

classification task to extract important information like

conference names, titles, dates, locations and submission

deadlines from call for papers (CFP) about academic

conferences, workshops, etc. received via email [11]. Generic

token classes, domain-specific dictionaries and layout features

are used together and found that layout features improved in

accuracy.

F.Peng and A. McCallum presented a paper that applies

Conditional Random Fields (CRFs) for the task of extracting

various common fields from the headers and citation of

research papers [8]. They described a large collection of

experimental results on two traditional benchmark data sets.

Dramatic improvements are obtained in comparison with

previous SVM and HMM based results.

A.T. Valero et al. proposed a system using a machine

learning approach to extract online news reports and

automatically populate a natural disaster database [3].

Although their system can be easily adapted to specific

domains and languages because they only used lexical

features without depending on the complex syntactic

attributes, this system has two drawbacks. The first is that

they cannot extract information from documents that contain

more than one disaster content. And the second one is that

the system is not allowed to group the extracted data of the

same kind of natural disasters from different documents.

They evaluated their system based on Boolean and Term

Weightings and the SVM, Naïve Bayes and C4.5 learning

algorithms and found that the combination of Boolean

Weighting and the SVM algorithm got the better accuracy

than other two supervised learning approaches.

M. Hatmi et al. proposed French named entity recognition

(NER) system by presenting a multi-level methodology based

on conditional random fields (CRFs) [12].In order to handle

structured tagging; they defined three levels of annotation. The

first level consists of annotating the 32 categories in a flat way.

The second level has to do with the annotation of components.

The last level allows overlapping annotation when a category

includes another category. They trained a CRF model for each

level of annotation with the use of CRF++toolkit which is an

open source implementation of CRF to implement the different

models.

III. INFORMATION EXTRACTION APPROACHES

A. Rule Learning based Approach

This approach can be categorized into three groups:

dictionary based method, rule based method, and wrapper

induction. Traditional Information Systems also called pattern

based systems construct a pattern (template) dictionary, and

then use the dictionary to extract needed information from the

new untagged text. AutoSlog [6], AutoSlog-TS [7], and

CRYSTAL [14] are dictionary based extraction systems.

In Rule-based Method, information extraction grammars are

developed manually by linguistic and domain experts to

recognize the entities or relations. There are two main rule

learning algorithms of rule-based information systems. They

are bottom-up method which learns rules from special cases to

general ones, and top-down method which learns rules from

general cases to special ones.

Wrapper induction is another type of rule based method. A

wrapper is an extraction procedure, which consists of a set

extraction rules and also program codes required to apply

these rules. Wrapper induction is a technique for automatically

learning the wrappers. Given a training data set, the induction

algorithm learns a wrapper for extracting the target

information. The typical wrapper systems include WIEN [13],

Stalker [9], and BWI [5].

Although this approach is simple and fast to construct with

skill and experience, collection and maintenance of rules is a

laborious and tedious process and cannot resolve ambiguity

because of the variety of forms and contexts of source text.

B. Classification Model based Approach

The method formalizes the IE problem as a classification

problem. One of the most popular methods for classification is

Support Vector Machines (SVMs). This type of IE system

consists of two distinct phases: learning and extracting. In the

learning phase our system uses a set of labeled documents to

generate models which we can use for future predictions. The

extraction phase takes the learned models and applies them to

new unlabelled documents using the learned models to

generate extractions. Many approaches can be used to training

the classification models, for example, Support Vector

Machines [15], Maximum Entropy [1]. Thus it has more

generalization capabilities than the rule based method. In

several real-world applications, it can outperform the rule

based method. Its drawback is that its model is usually

complex and it is difficult for the general user to understand

(e.g. the feature definition). Thus the performances of

extraction differ from application to application

C. Sequence Labeling based Approach

IE tasks are also considered as sequence labeling problem in

which each word (token) in the text is annotated with a tag

choosing from a predefined set of tags by using statistical

sequence models like Hidden Markov Models (HMMs) [16],

Maximum Entropy Markov Models (MEMMs) [8] and

Conditional Random Fields (CRFs) [20]. HMMs have been

successfully applied to Information Extraction by considering

IE as sequence labeling task. HMM is joint probability

distribution P (label sequence y, observation sequence x) and


63

it cannot represent overlapping features or non-independent

features between observed elements. A discriminative

conditional model P (label sequence y | observation sequence

x), Conditional Random Field (CRF) model has proven an

advantage over HMM. CRFs allow arbitrary, non-independent

features on the observation sequence X. It solves the problem

of complex independencies, the main difficulty of Hidden

Markov Models (HMMs) and avoid label bias problem which

is a restriction of Maximum Entropy Markov Models

(MEMMs). Statistical approaches need large amount of

training data which are very expensive to acquire and re-

annotation of large quantities of training data are required

when something needs to change in specification. However,

system expertise is not required for this change and domain

portability is quite easy.

IV. INFORMATION EXTRACTION USING CRF

A. Introduction to CRF

A Conditional Random Field (CRF), a variant of Markov

Random Network can be viewed as an undirected graphical

model. It combines classification and graphical modeling for

segmenting and labeling sequential data. Therefore, it has been

widely used in many natural language processing tasks.

Because CRF is simply a conditional probability distribution,

it is able to solve the problem of complex independencies, the

main difficulty of Hidden Markov Models (HMMs) that define

the joint probability distribution. Moreover, CRF avoids label

bias problem which is a restriction of Maximum Entropy

Markov Models (MEMMs). Let D (D1, D2,…, Dn) be the

observation sequential data and L (L1, L2,…, Ln) be the labels.

A linear chain CRF can be defined as follows:

(1)

in which ZD is a normalization factor which can be defined as

(2)

and fj (li-1,li,D,i) is a feature function and is the weight for

feature fj.

B. CRF based Information Extraction Process

1) Preprocessing

Myanmar natural disaster news articles collected from

Myanmar official newspapers are accepted as input. Myanmar

Language like other Asian languages such as Japanese, Thai

and Korea there is no boundaries between words. Thus, Word

Segmentation becomes an important preprocessing stage of

most Natural Language Processing applications. Before Word

Segmentation, Syllabification is performed, the process of

constituting and representing characters into syllables where a

syllable is a unit of sound composed of a central peak of

sonority (usually a vowel), and the consonants that cluster

around this central peak. Syllabification phase is required

because word segmentation works better with syllables than

with characters. After preprocessing, segmented words are

received as output.

2) Training for Information Extraction Model

We used CRF++ tool which is a simple, customizable, and

open source implementation of Conditional Random Fields

(CRFs) for segmenting/labeling sequential data by taking

Information Extraction as sequence labeling task. To train a

CRF model, we have to develop manually annotated data. In

this annotation, the word representing some information that

needs to extract is tagged with the type of information. For

example, “ B-Place”, “ I-

Place”, “ I-Place”,

“ I-Place”,

“ I-Place”, “ I-Place” represents the

place of the natural disaster by using tag ‘Place’. The tags are

described in IOB2 format. ‘B’ means the beginning word of

the information; ‘I’ refers to Intermediate word and O

represents the word that excludes in the answer tag. Tag sets of

the type of information are described in table I.

TABLE I

TAG SET OF THE RESPECTIVE NATURAL DISASTERS FOR CRF MODEL TO

EXTRACT REQUIRED INFORMATION

Types of Natural Disasters Tags

Earthquake Date, Time, Place, Magnitude,

Epicenter, Latitude, Longitude, Depth,

Fatalities, Injuries, H, Damage

Landslides Date, Time, Place, Cause, Volume,

Fatalities, Injuries, Missing People,

Damage

Floods Date, Time, Place, Cause, Rainfall,


Damage

Volcanic Eruption Date, Time, Place, Volcano, Fatalities,

Injuries, Missing People, Damage

Forest Fire Date, Time, Place, Size, Fatalities,

Injuries, Missing People, Damage

Tornado Date, Time, Place, Fatalities, Injuries,

Missing People, Damage

Storms Date, Time, Place, Name, Type, Rate,


Damage

In CRF++, feature functions are produced according to

predefined feature templates. The templates defined in our

system are shown in table II. In this template, the first

character ‘U’ means unigram template. And B refers to bigram

template. %x[x,y] will be used to specify a token in the input

data. That means row specifies the relative position from the

current focusing token and col specifies the absolute position

of the column. Here, the current word, its neighbor words and

the combination of current word and its neighbor words are

used as features.


64

TABLE II

FEATURE TEMPLATE

# Unigram

U00:%x[-2,0]

U01:%x[-1,0]

U02:%x[0,0]

U03:%x[1,0]

U04:%x[2,0]

U05:%x[-2,0]/%x[-1,0]/%x[0,0]

U06:%x[-1,0]/%x[0,0]/%x[1,0]

U07:%x[0,0]/%x[1,0]/%x[2,0]

U08:%x[-1,0]/%x[0,0]

U09:%x[0,0]/%x[1,0]

# Bigram

B

After the training phase, CRF model for information

extraction is ready to use in testing phase.

3) Information Extraction as template

After accepting new natural disaster news articles, they are

preprocessed and features are extracted. After that, predicted

answer tags can be produced with the use of CRF model

achieved from the training stage. The extracted information for

the following sample sentences can be depicted as a template

like in table III.

TABLE III

TEMPLATE OF EARTHQUAKE NEWS

V. EXPERIMENTAL RESULTS

A. Natural Disaster News Corpus

We collected around 600 Myanmar news articles about

seven types of natural disasters including Earthquakes, Floods,

Landslides, Volcano Eruption, Tornado, Hurricanes and

Wildfires from official Myanmar Newspapers. Since CRF is

supervised machine learning approach, it needs to tag the

collected news manually.

B. Evaluation Measures

The result of Information Extraction based on CRF is

evaluated through the general measuring method used in

Information Retrieval evaluation. The measures Precision,

Recall and F-Measure are defined as follows:

(3)

(4)

(5)

C. Experimental Results and Discussion

TABLE IV

EVALUATION RESULTS FOR DIFFERENT TYPES OF INFORMATION

Type of Information Precision Recall F-Measure

Date of Disaster

Time of Disaster

Place of Disaster

Fatalities

Injuries

Missing People

Physical Damage

Magnitude(Earthquake)

Epicenter (Earthquake)

Latitude (Earthquake)

Longitude (Earthquake)

Depth (Earthquake)

Cause (Landslide)

Name of Volcano

Size (Forest Fire)

Type (Storms)

Rate (Storms)

Name of Storms

0.96

0.93

0.92

0.95

0.44

0.46

0.82

0.95

0.75

0.93

0.93

0.8

0.87

0.8

0.6

0.92

0.71

0.64

0.96

1.0

0.98

0.98

0.47

0.67

0.93

0.95

0.86

1.0

1.0

0.5

1.0

0.8

0.75

1.0

0.83

0.69

0.96

0.96

0.95

0.96

0.46

0.54

0.87

0.95

0.80

0.97

0.97

0.62

0.93

0.8

0.67

0.96

0.77

0.67

Average 0.80 0.86 0.82

Table IV shows the evaluation results for different types of

extracted information. According to this table, it can be seen

that there are a variety of categories, and some categories are

much more similar among each other. In these similar kinds of

information, the performance of the approach is poor. For

example, the values of the attributes fatalities, injuries and

missing people are cardinalities. Thus, although fatalities are

well extracted, the precisions of injuries and missing people

are poor. That is because the fatalities are much including in

the training set than the other two. In order to improve this

performance, it will be necessary to collect a greater training

corpus. It is interesting to notice that for all categories the

recall rates are better than the precision scores. This fact


65

indicates that our system could extract most of the relevant

information from the natural disaster news, but that is also

extracts several irrelevant data.

VI. CONCLUSION AND FUTURE WORK

The use of template produced form the template-driven

Information Extraction in the summary generation module

enhances the automatic text summarization system. This paper

proposed Information Extraction from Myanmar text using

Conditional Random Fields which is one of supervised

machine learning approaches by considering the information

extraction as a sequence labeling task. The proposed approach

is applied in natural disaster news collected from official

Myanmar News papers. About 600 news articles of seven

disaster types are used as training data and over 80 documents

in seven different disasters are used for testing. The

experimental result is shown according to the type of extracted

information and got F-measure 0.82%. Since it is based on

lexical features without using complex syntactic attributes, it is

easily adaptable to specific domains and languages. To get

higher performance, it needs to expand bigger training set and

other features like Part of Speech Tagging.

REFERENCES

[1] A. L. Berger, S. A. Della Pietra, & V. J. Della Pietra, “A maximum

entropy approach to natural language processing.”, in 1996 In

Computational Linguistics (Vol.22, pp.39-71). MA: MIT Press.

[2] A. McCallum, D. Freitag, & F. Pereira, “Maximum Entropy Markov

Models for information extraction and segmentation,” in 2000 In

Proceedings of the 17th International Conference on Machine Learning

(ICML’00). pp. 591-598.

[3] A.T. Valero, M.M. Gomez and L.V. Pineda, “Using Machine Learning

for Extracting Information from Natural Disaster News Report,”

Computer and Systems Vol. 13 No. 1, 2009, pp 33-44 ISSN 1405-5546.

[4] C. ZHANG, H. WANG, Y. LIU, D. WU, Y. LIAO and B. WANG,

“Automatic Keyword Extraction from Documents Using Conditional

Random Fields,” Journal of Computational Information

Systems4:3(2008) 1169-1180.

[5] D. Freitag, & N. Kushmerick, “Boosted wrapper induction,” in 2000

In Proceedings of 17th National Conference on Artificial Intelligence.

pp. 577-58.

[6] E. Riloff, “Automatically Constructing a Dictionary for Information

Extraction Tasks,” in 1993 In Proceedings of the Eleventh National

Conference on Artificial Intelligence. pp. 811-816.

[7] E. Riloff, “Automatically Generating Extraction Patterns from Untagged

Text,” in 1996 In Proceedings of the Thirteenth National Conference on

Artificial Intelligence. pp. 1044-1049.

[8] F. Peng and A. McCallum, “Accurate Information Extraction from

Research Papers using Conditional Random Fields,” in 2004 HLT-

NAACL, pp. 329-336.

[9] I. Muslea, S. Minton, & C. Knoblock, STALKER: “Learning extraction

rules for semi-structured, web-based information sources,” in 1998 In

AAAI Workshop on AI and Information Integration. pp 74-81.

[10] J. Lafferty, A. McCallum, & F.Pereira, “Conditional Random Fields:

Probabilistic models for segmenting and labeling sequence data,” in

2001 In Proceedings of the 18th International Conference on Machine

Learning (ICML’01). pp. 282-289.

[11] K.M. Schneider, “Information Extraction from Calls for Papers with

Conditional Random Fields and Layout Features,” Artificial Intelligence

Review, Volume 25 Issue 1-2, April 2006, Pages 67 – 77.

[12] M. Hatmi, C. Jacquin, E. Morin and S. Meignier, “Named Entity

Recognition in Speech Transcripts following an Extended Taxonomy,”

Proceedings of the First Workshop on Speech, Language and Audio in

Multimedia (SLAM), Marseille, France, August 22-23, 2013.

[13] N. Kushmerick, D. S. Weld, & R. Doorenbos, “Wrapper induction for

information extraction,” in 1997 In Proceedings of the International

Joint Conference on Artificial Intelligence(IJCAI’97). pp. 729-737.

[14] S. Soderland, D. Fisher, J. Aseltine, & W. Lehnert, “CRYSTAL:

Inducing a conceptual dictionary,” in 1995 In Proceedings of the

Fourteenth International Joint Conference on Artificial Intelligence

(IJCAI’95). pp 1314-1319.

[15] V. Vapnik, “Statistical Learning Theroy,” in 1998 Springer Verlage,

New York.

[16] Z. Ghahramani, & M. I. Jordan, “Factorial Hidden Markov Models,” in

Machine Learning, Vol.29, pp.245-273.


66