event and temporal entity extraction in urdu …

i

EVENT AND TEMPORAL ENTITY

EXTRACTION IN URDU LANGUAGE

TEXT

By

DALER ALI

Thesis submitted in the partial fulfilment of the requirement

for the degree of

DOCTOR OF PHILOSOPHY

In

COMPUTER SCIENCE

Fall 2016-2019

Department of Information Technology

The Islamia University of Bahawalpur, Pakistan

ii

Dedication

To the ALLAH Subhanahu Wa ta'ala

To the Last Prophet Hazrat MUHAMMAD (S.A.W.W.)

iii

Student’s Declaration I hereby declare that the work described in this dissertation was carried out by me under

the supervision of Dr. Malik Muhammad Saad Missen at the Department of Computer

Science & Information Technology, The Islamia University of Bahawalpur, Pakistan.

I also hereby declare that the substance of this dissertation has neither been submitted

elsewhere nor is being concurrently submitted for any other degree.

I further declare that the dissertation embodies the results of my research or advance studies

and that it has been composed by me and where appropriate, I have acknowledged the work

of others.

Daler Ali S/O Malik Hazoor Bukhsh Awan

iv

Supervisor’s Declaration It is hereby certified that the work presented by Mr. Daler Ali S/O Malik Hazoor Bukhsh,

Roll No. FA16M2PA003 in the thesis “EVENT AND TEMPORAL ENTITY

EXTRACTION IN URDU LANGUAGE TEXT” is based on a research study conducted

under my supervision. No portion of this work has been formerly offered for a higher degree

in this university or any institute of learning and to the best of the authors acknowledge, no

material has been used which is not his work except where due acknowledgment has been

made.

He has fulfilled all the requirements and is qualified to submit this thesis in partial

fulfilment for the degree of Doctor of Philosophy (Ph.D.) in the field of Computer Science,

in the Faculty of computing, at the Islamia University of Bahawalpur.

Dr. Malik Muhammad Saad Missen

Assistant Professor,

Department of Information Technology

Faculty of Computing

The Islamia University of Bahawalpur

v

Acknowledgment

My all gratitude and bows to Allah Almighty Who gave me the strength to achieve

my research goals.

I would like to pay the deepest gratitude to my great supervisor Dr. Malik

Muhammad Saad Missen, Assistant Professor, Department of Information

Technology, who stood with me in even and odd. He spent countless hours to

facilitate and guide me in research work throughout the research period. His

behaviour was tremendous during the discussions of research issues.

I am also very thankful to Dr. Dost Muhammad Khan, Assistant Professor, HoD

Department of Information Technology. He supported and encouraged me to

complete my research work. His kind and cooperative behaviour is a role model for

me.

A very special thanks to Dr. Mujtaba Husnain Assistant Professor, Department of

Information Technology, he boosted and helped me to synthesis my research work.

His moral support was an invaluable factor in my research work. I am very thankful

to my parents for their precious prayers and moral support.

Daler Ali

vi

Table of Contents

Contents Page No.

Dedication i

Declaration of Originality iii

Acknowledgement vi

Table of Contents vii

List of Tables x

List of Figures xii

List of Abbreviation xiii

Abstract xiv

Chapter No. 1

Introduction

1.1 Concept of Event and Temporal Entity 05

1.1.1 Event 05

1.1.2 Temporal Entity 06

1.1.2.1 Fully Qualified 06

1.1.2.2 Deictic 06

1.1.2.3 Anaphoric 07

1.1.3 Temporal Entities in the Urdu Language 07

1.1.3.1 Urdu Fully Qualified Date 08

1.1.3.2 Different Types of Urdu Fully Qualified Date 08

1.1.3.3 Urdu Deictic 09

1.1.3.4 Urdu Anaphoric 09

1.1.4 Applications of Temporal Entities 10

1.2 Event Detection 10

1.3 Event Classification 11

1.3.1 Binary Classification 11

1.3.2 Multiclass Classification 12

1.3.3 Multilabel Classification 12

1.4 Event Classification in the Urdu Language 13

1.5 Challenges in Event Detection and Classification 14

vii

1.5.1 General Challenges 14

1.5.2 Event Detection and Classification Methodology Challenges 15

1.5.3 Pre-Processing Challenges in Event Detection and

Classification

15

1.5.4 Feature Extraction Challenges 15

1.6 Challenges in Event Detection and Classification from the

Urdu Language

15

1.7 Importance of Time in Event Detection 16

1.8 Research Motivation 16

1.9 Research Problem 18

1.10 Research Objectives 19

1.11 Thesis Organization/Structure 20

Chapter No. 2

Background and Related Work 21

2.1 Event Detection and Classification 22

2.2 Existing Methodologies 28

2.2.1 Data Driven Approach 28

2.2.2 Knowledge Driven Approach 28

2.2.3 Hybrid Approach 28

2.3 Temporal Entity Extraction 29

Chapter No. 3

Dataset 32

3.1 Multiclass Urdu Language Labelled Sentences (MULLS) 34

3.1.1 Data Collection 35

3.1.2 Pre-processing 36

3.1.2.1 Post Splitting 36

3.1.2.2 Stops Words Elimination 36

3.1.2.3 Noise Removal 36

3.1.2.4 Filtering Sentences 37

3.1.3 Annotation Guidelines 37

3.1.4 Training Dataset 41

3.1.5 Testing/Validation Dataset 42

viii

3.2 Urdu Named Entity Recognition (UNER) Dataset 42

Chapter No. 4

Event Classification

4.1 Proposed Methodology for Event Classification 45

4.2 Experimental Setup of Multiclass Event Classification 47

4.2.1 Feature Space 47

4.2.2 Feature Vector Generating Techniques 47

4.2.2.1 Word Embedding 47

4.2.2.2 Pretrained Word Embedding Models 47

4.2.2.3 One Hot Encoding 49

4.2.2.4 TF_ IDF 49

4.3 Deep Learning Models 50

4.3.1 Deep Neural Network Architecture (Feedforward/DNN) 50

4.3.2 Recurrence Neural Network (RNN) 50

4.3.3 Convolutional Neural Network (CNN) 50

4.3.4 Hyperparameters 50

4.3.5 Performance Measuring Parameters 52

4.4 Results 52

4.4.1 Deep Learning Classifiers 52

4.4.1.1 Pretrained Word Embedding Models 53

4.4.1.2 TF_ IDF Feature Vector 53

4.4.1.3 ONE-HOT-ENCODING 57

4.5 Traditional Machine Learning Classifiers 57

4.5.1 K- Nearest Neighbour (K-NN) 57

4.5.2 Decision Tree 58

4.5.3 Naïve Bayes Multinominal (NBM) 59

4.5.4 Logistic Regression (LR) 60

4.5.5 Random Forest (RF) 60

4.5.6 Support Vector Machine (SVM) 61

Chapter No. 5

Temporal Entity Extraction

5.1 Proposed Methodology for Temporal Entity Extraction 64

ix

5.2 Rule-based Approach (Regular Expression) 64

5.3 Experimental Setup of Temporal Entity Extraction 65

5.4 Results 65

5.5 Discussion 67

Chapter No. 6

Conclusion and Future Work

6.1 Conclusion 70

6.2 Future Work 71

References 72

Appendix 80

Published Work 87

Special Thanks 88

x

List of Tables

Table

No.

Title of Tables Page

No.

1.1 Top 5 widely spoken languages in the world 04

1.2 Deictic words 09

1.3 Examples of Event 14

2.1 TempEx Tagger for an exact match on tag span and value calculation

(Bittar et al., 2011)

23

2.2 Performance of classifier on Portaguess dataset (Costa & Branco,

2012)

24

2.3 Evaluation results for PET for event recognition (Yaghoobzadeh et al.,

2012)

25

2.4 Summary of the related research 27

3.1 Urdu label sentences 38

3.2 Sentence tokenization 38

3.3 Class label 38

3.4 The summary of dataset 39

3.5 The example of count vectorizer 41

4.1 Pretrained word embedding model and custom word embedding model 49

4.2 Event sentence 49

4.3 Event sentence converted to numbers using One-Hot Encoding 49

4.4 DNN’s Hyperparameters 51

4.5 RNN’s Hyperparameters 51

4.6 CNN’s Hyperparameters 52

4.7 Classification accuracy of the CNN model 53

4.8 Performance measuring parameters for DNN model 54

4.9 Performance measuring parameters for RNN model 55

4.10 Performance measuring parameters for the CNN model 56

4.11 Performance measuring parameters for the K-NN model 58

4.12 Performance measuring parameters for the DT model 59

4.13 Performance measuring parameters for NB Multinominal model 59

xi

4.14 Performance measuring parameters for the LR model 60

4.15 Performance measuring parameters for the RF model 60

4.16 Performance measuring parameters for SVM model 61

5.1 All dates extraction results on original dataset 65

5.2 UFQD &UPFQD on extended dataset 66

5.3 Deictic date analysis 66

xii

List of Figures

Table

No.

Caption of Figures Page

No.

1.1 Internet users in the world 02

1.2 Social Media users in the world 03

1.3 Usage of Urdu language on Facebook 04

1.4 Usage of Hindi language on Facebook 05

1.5 Usage of Arabic language on Facebook 05

1.6 Types of temporal entities 07

1.7 Types of temporal entities in Urdu language text 08

1.8 Binary classification 11

1.9 Multiclass classification 12

1.10 Three types of classification 13

1.11 A generic application diagram of our proposed system 19

3.1 Dataset life cycle (DLC) 33

3.2 Urdu and Hindi language text on Social Media 35

3.3 Instances of the pre-processed dataset 39

3.4 Maximum number of instances of each type of event 40

4.1 Event classification methodology 46

4.2 Event classification methodology’s flow diagram 46

4.3 RNN’s accuracy 55

4.4 CNN’s accuracy distribution 56

4.5 CNN, RNN, and DNN accuracy using one-hot-encoding 57

4.6 Machine learning algorithms’ accuracy using TF_IDF 61

5.1 Temporal entity extraction methodology 65

5.2 All Dates extraction results on original dataset 66

5.3 Average of UFQD & UPFQD on extended dataset 67

5.4 Deictic date analysis 67

xiii

List of Acronym

NLP: Natural Language Processing

ML: Machine Learning

DL: Deep Learning

MULLS: Multiclass Urdu Language Labelled Sentences

UNER: Urdu Named Entity Recognition

DNN: Deep Neural Network

RNN: Recurrent Neural Network

CNN: Convolutional Neural Network

TE: Temporal Entity

Regex: Regular Expression

UFQD: Urdu Fully Qualified Date

HUFQD: Hybrid Urdu Fully Qualified Date

SVM: Support Vector Machine

RF: Random Forest

DT: Decision Tree

K-NN: K-Nearest Neighbour

LR: Logistic Regression

NBM: Naïve Bayes Multinominal

xiv

Abstract The digital world created space for multiple languages for communication via the Internet.

The Internet provided various facilities like real-time availability and open access to

different platforms i.e., social media, news websites, vlogs, and weblogs, etc. for

communication. People demographically located in different areas of the world are hooked

with one another like a global village via Internet. They are generating unstructured and

structured (heterogeneous/homogeneous) data during conversation. Huge bulky stuff of

data exists in different languages on social media and news website that contains invaluable

insights. To add new milestones in the field of NLP it is very important to process the

contents of other languages instead of only the English language. Many applications can

be developed by processing the local languages i.e., Monitoring system, topic detection,

event classification, and recommendation system, etc. that will certainly improve the

services of business and performance of (private and public/government) institutes and

many more.

The Urdu language is one of the resource-poor languages that has more than 300 million

users all around the world. A huge volume of Urdu contents in textual form exists on social

media and news websites that contains worthy insights related to different events that are

happening around us in specific time span i.e., terrorist attack, political campaign, protest,

and sports.

The research work consists of two major tasks i.e., temporal (time/date) entity extraction

and event classification. There are many sub-tasks under the major tasks. Temporal entity

extraction and multiclass event classification is performed in Urdu language text. The

research tasks are performed on the textual corpus. It consists of 0.15 million labelled

instances(sentences), named as “Multiclass Urdu Language Labelled Sentences

(MULLS)”.

In this thesis, we described and accomplished three main tasks related to Urdu language

text.

1) Event Extraction: It is the task to retrieve event information

2) Event Classification: It is a task to assign pre-defined event label to input data and

3) Extracting temporal information associated with events.

To achieve our goals, the proposed methodologies are data-driven (deep learning) for event

classification and knowledge base (rule-based) for temporal entity extraction.

xv

To achieve our research objectives, we have explored machine learning and deep learning

classifiers. The famous machine learning classifiers i.e., SVM. k-NN, DT, LR, and NBM

are evaluated on corpus MULLS. We also evaluated the deep learning classifiers i.e., CNN,

DNN(Deep/Feedforward), and RNN (bi-directional) on the same corpus named as

MULLS. Deep learning models outperformed the machine learning models. Among the

deep learning models, DNN has shown the highest accuracy of 84% for multiclass event

classification.

Secondly, temporal entities (TE) are very important to predict the time of the events. We

have explored the various types of TE’s that are helpful to develop NLP applications. We

decided to use the rule-based approach i.e., Regular Expression. TE’s are extracted from

UNER dataset that is publicly available for researcher purposes. We have also further

suggested new names to different types of fully qualified dates (FQD) lying in Urdu

language text. Our proposed regex showed considerable results for FQD TE. Although

anaphoric and deictic TE’s extracted successfully but the require contextual information to

give the actual meaning. Such type of TE’s can be analysed using the deep learning

approach.

1

CHAPTER 1

INTRODUCTION

2

1 Introduction

In this age of technology, the use of paper and pen is going to be demised. People prefer to

use digital gadgets instead of traditional communication sources like post-mail. High-speed

communication networks i.e., Internet is being popular for instant communication (chat). It

has become a vital source of communication among people located at different locations.

The rapid growth of Internet users up to 2020 can be observed in figure1.1.

Figure 1.1: Internet Users in the World1

These days social networks have fascinated millions of people all around the world. People

use these networks for different purposes i.e., to share opinions, events, news,

advertisement, and research ideas, etc., on Facebook, Twitter, Instagram, and news

websites. A pictorial overview of social network users from 2010 to 2021 by Statista

(website) is given in figure 1.2. We can observe the popularity, dominating influence, and

craze of such communication networks in society.

1https://dazeinfo.com/2016/06/13/number-internet-users-worldwide-2016-2020/

3

Figure 1.2: Social Media Users in the World2

People from every corner of the world, speaking different languages are producing

gigabytes of information regularly. Internet has provided the facility to use online input

tools like google input tools3 that support many languages (more than 187). These tools

allow the users to type fast, efficiently, and easily in various languages. The fundamental

factors that are causing to produce huge amount of multilingual data are: continuous growth

of Internet users, multiple language’s supporting tools and those platforms that have

provided communication facility in multiple languages.

To develop intelligent applications, it is essential to extract worthy insight from huge

amount of structured and unstructured data existing on digital platforms. English indeed is

the highly dominant language of the Internet. However, online social networks have

facilitated their users to use local languages for communication that caused huge amount

of data in local languages. The use of local languages is being highly prefer because of

many factors like easiness and to promote local languages. There are more than 187

languages that are supported by google input tools. The top 5 widely used languages in the

2 https://www.statista.com/statistics/278414/number-of-worldwide-social-network-users/ 3 https://www.google.com/intl/ur/inputtools/try/

4

world are mentioned in Table 1.1.

Table 1.1: Top 5 Widely Spoken Languages in the World

Language Speakers

English 1132 Million

Mandarin 1117 Million

Hindi 615 Million

Spanish 514 Million

Urdu 300 Million

The English language is the most widely used language having 1132 million users in the

world. Other non-English languages i.e. Mandarin (1117 million), Hindi (615 million),

Spanish (534), and Urdu (300 million), ( M. D. Eberhard, 2019) also have a considerable

number of users. The statistics of multilingual users, low amount of referential works for

other non-English languages, the importance of conversation predicting, and monitoring

motivated us to focus on non-English languages of Asian countries.

Naturally, humans prefer local languages for communication which delivers effective

implications of the situation. All the above-mentioned factors are the reasons that

promoting the use of local languages over the Internet. It can be seen in, figure 1.3 to figure

1.5 that usage of local languages i.e., Urdu, Hindi, and Arabic, etc. on social networks being

popular. The use of Urdu, Hindi, Arabic, and their Roman scripts are very common in

Pakistan, India, Bangladesh, and Saudi Arabia.

Figure 1.3: Usage of Urdu language on Facebook

5

Figure 1.4: Usage of Hindi language on Facebook

Figure 1.5: Usage of Arabic Language on Social Media

The above examples highlight the usage of local languages on social networks. It is

noticeable that people move instantly towards social networks when something happens or

specific event occurs i.e., music concerts, sports, political campaigns, protests, and bomb

blasts, etc. to get know the detail and real facts. People give their opinion, show reactions,

and post comments on these events. These responses are very important in developing

Natural Language Processing (NLP) applications.

1.1 Concept of Event and Temporal Entity

1.1.1 Event

The definition of event varies from domain to domain. In literature, the event is defined in

various aspects, such as a verb, adjective, and noun based environmental situation, etc. (Dr.

D. Ramehs, 2016)(AHMED et al., 2016). Similarly, an event can be defined as “specific

actions, situations or happenings occurred in a certain period (Yang et al., 1998)(Tomas,

2015).

Events can be represented as follow (Dr. D. Ramehs, 2016):

6

• Tensed verbs i.e., Ali took juice yesterday. “Took” represents the event,

• Un-Tensed Word i.e., the valuable statement of PM is to appreciate talented and

industrial people. “Appreciate” is the event,

• Nominalization i.e., Pakistani Air Force (PAF) strike has opened the eyes of the

whole world. “Strike” is the nominal event related to PAF.

• Adjectives i.e., Pakistani cricket team seems helpless before the Australian

Kangaroos. “Helpless” is an adjective event.

• Predicative Clause i.e., there is no reason why our people would not be prepared

to face the war escalation. In the given example “be prepared” is a predicative

clause event.

• Prepositional Phrase i.e., all the 157 people on board Boeing 737 jet died due to

airplane crash. “Onboard” represents the event.

1.1.2 Temporal Entity

Temporal Entities are time-dependent information that changes state-to-state by time i.e.,

event & act. It can be also defined as “Anything in temporal expression that represents

‘When’, ‘How long’, ‘How often’ something happening.” It represents the time of specific

phenomena. Time and date are very important temporal entities to develop many Natural

Language Processing (NLP) applications. Categorically, temporal entity (date) can be

classified into three types i.e., fully qualified date, deictic date, and anaphoric date (Ahn,

2005) (Filannino, p. 2015).

1.1.2.1 Fully Qualified

A temporal entity that consists of complete information of data i.e. day, month, and year.

The format of a fully qualified temporal entity is like dd/mm/yyyy. For example,

08/12/2020.

1.1.2.2 Deictic

A temporal expression that represents such type of date that requires further analysis. An

expression of words that required the utterance time of words. For example, today and

tomorrow, etc.

1.1.2.3 Anaphoric

A case of deictic expression for which utterance of time varies according to previously

mentioned temporal expression. For example, that year, last week and two months, etc.

It is “an entity that represents time in dataset”. In literature generally, three types of

temporal entities have been discussed. We have further explored these types of entities and

7

proposed (introduced) a new type of temporal entity named as “Partially Fully Qualified

Date”.

Figure 1.6: Types of Temporal Entities

1.1.3 Temporal Entities in the Urdu Language

Deep analysis of dates, written in the Urdu language text revealed that it can be divided

into four types i.e., Urdu Fully Qualified, Urdu Partially Fully Qualified, Urdu Deictic,

and Urdu Anaphoric. A tree diagram of temporal entities is a vivid depiction of various

types of temporal entities in the Urdu language. These types of temporal entities are not

reported before our work for any languages.

Name Entity

Temporal Entity

Deictic Anaphoric Fully qualified

8

Figure 1.7: Types of Temporal Entities in Urdu Language Text

1.1.3.1 Urdu Fully Qualified Date

A temporal expression that consists of complete date information. It consists of day,

month, and year i.e. dd/mm/yyyy (20/10/2018) دوہزار اٹھارہ بیس اکتوبر. A date written in the

Urdu language which consists of day, month and year is called Urdu Fully Qualified

Date”. For Example, (1) آٹھ فروری انیس سو اکانوے

Day, Month, Year, and century can be represented in the following manner:

(1) Roman Numbers 0,1, 2…,9 i.e. 02-10-2012

(2) Arabic Numbers i.e. (۰۳/۱۱/۱۹۹۱)

(3) Urdu words ایک، دو ،تین،چار۔۔۔۔۔ i.e. دو ہزار اٹھارہ دسمبر چودہ اگست انیس

(4) Mix up of all i.e. 25-2018-جولائی

1.1.3.2 Different Types of UFQD Regarding Processing

The analysis showed that Fully Qualified Date (FQD) in the Urdu language can be

represented in a different format, so for the convenience of understanding we have

suggested a name i.e., Hybrid Urdu Fully Qualified Date (HUFQD). These dates are

given here:

• Numeric Day and Urdu Month/Year i.e. 25دسمبر دوہزارسات 25مارچ انیس سو چالیس,

• Urdu Day/Year and Numeric Month i.e. دوہزار بارہ 5چار دس 8دوہزار تیرہ,

• Urdu D/Month and Numeric year i.e. 2008 مارچ یکم, 2009پندرہ جون

9

• Urdu Partially Fully Qualified Date

A type of date written in Urdu textual language which is missing one of the given i.e. day,

month, or year. For example, 26/2008, 08/2016 or 26/08 in English while in Urdu دس جولائی

.(07/2007) جولائی دوہزارسات ,(10/07)

1.1.3.3 Urdu Deictic

A temporal expression represents such a type of date that requires further analysis. An

expression of words that required the utterance of words (Filannino M. G., 2015). For

example, (1) آج(‘today’), (2) کل(‘tomorrow’), etc. is such type of date that cannot directly

be mapped to standard date format. It requires further analysis of context to give the

purposeful meanings i.e. وقت، دن اب،تب، رات اور صبح وغیرہ. A comprehensive but limited

collection of deictic words used in the Urdu language to represent time is given in table

1.2.

Table1.2: Deictic Words

Deictic Words Representing Time

فرصت جمعرات رات لمحہ

مہلت جمعہ روز دقیقہ

وقفہ ہفتہ یوم ساعت

دورانیہ اتوار وار پل

آغاز تب شب گھڑی

شروع کب صبح لحظہ

اختتام ابھی مہینہ سیکنڈ

رُت دیر سال آن

تاریخ تاخیر برس دم

حیات جلدی صدی عہد

زندگی اثنا سحر دور

باری سردی فجر زمانہ

موسم گرمی دوپہر آن

موقع خزاں سہ پہر وقت

عمر بہار شام قرن

دراز اوقات تڑکا مدت

روزانہ میعاد سویرا زمانہ

آج ازل سوموار منٹ

کل ابد منگل گھنٹا

پرسوں عرصہ بدھ دن

1.1.3.4 Urdu Anaphoric

A case of deictic expression for which utterance of time varies according to the temporal

expression as previously mentioned in the text. For example اسُ سال (‘that year’), پچھلے

etc. it is a special case of deictic date which ,(’two months‘) دو ماہand (’last week‘)ہفتے

10

requires utterance time which varies from time to time to conclude meaningful

information i.e. اگلے سال، پچھلے دن، کئی سال.

1.1.4 Applications of Temporal Entities

There are several applications to utilize extracted insights related to temporal entity i.e.

timeline construction, tracking the history of stories, estimating document creation time

(DCT), improving the news reading experience, and enhancing information retrieval

capabilities of systems(Zaraket & Makhlouta, 2012).

Detection and extraction of temporal entities from contents of documents instead of meta-

data i.e., Document Creation Time is preferable for the research community. Because the

last modification date of any created document is not its DCT. Sometime data may be

copied or uploaded at that time meta-data of document updated that is not the actual

DCT(Li, Hang, et al., 2009) Assigning automated event-time period to documents i.e.

medical reports, to an event of accidents, to news articles, and traveling history, etc. is

probably significant rather than relying on writing/publishing date of all these (Llidó et al.,

2001)

1.2 Event Detection

Event detection is a fundamental task in NLP which can be used to analyze risk factor, to

predict the law and order situation, take the decision (Hogenboom et al., 2016), in mediation

information system (Barthe-Delanoë et al., 2014), analysis of firm-specific, social media

monitoring (Jiang et al., 2014)in Vehicle routing (Pillac et al., 2012)Environment scanning

(Wei & Lee, 2004) New personalization system (Borsje et al., 2010), Discovering defects

in products (Hogenboom et al., 2016), Advance Spatio-temporal reasoning of moving

objects (Jin et al., 2013), Algorithmic trading (Nuij et al., 2014)Financial risk analysis,

Quality assurance (Abrahams et al., 2012), Terrorism detection (Conlon et al., 2015), E-

commerce (A.S. Abrahams, 2002) and design a timeline, etc. Event extraction, with its

origins from the 1980s, has become an interesting and popular problem due to the

availability of resources i.e., datasets, processing tools, etc. for many languages. As defined

by (Li et al., 2017) an event is a general term used for referring to happening; some

situations or actions depending on time. Events consist of “event trigger” and “events

arguments”. Event triggers are the factors that cause events to happen i.e., “Action/Verb

Words “. Events arguments are the name entities i.e., person, place, and organization. Event

detection is categorized into two subtasks i.e. retrospective event detection and new event

detection (Jin et al., 2013). Former task extracts events from pre-collected resources while

11

later discovers new events from real time streams of text.

Event detection can be performed at sentence (Naughton et al., 2010), paragraph (D’Andrea

et al., 2019), phrase and document level (Jacobs et al., 2019). The extracted information

can represent different types of events, i.e. sports, politics, terrorist attacks and inflation,

etc. information related to the event can be detected and classified at a different level of

granularity, i.e. document level (D’Andrea et al., 2019), sentence-level (Jacobs et al., 2019),

word level, character level, and phrase-level (D’Andrea et al., 2019).

1.3 Event Classification

“Event classification is an automated way to assign a predefined label to the new instances.”

It also can be defined as “The automated way of assigning predefined labels of events to

new instances by using pre-trained classification models is called event classification.” All

the classifiers are trained on label instances of the dataset that are later used to predict the

event class of new unknown instances.

Event classification information can be used to develop several different NLP applications

like content labeling, topic modeling, and finding the latest trend, etc. Event classification

is one of the practicable and challenging tasks of NLP. It is pertinent to mention that events

can be classified in more than one class (Sokolova & Lapalme, 2009)] i.e. binary

classification, multiclass classification, and multilabel classification.

1.3.1 Binary Classification

In binary classification, data point is assigned one (1) class among total of Two (2) classes.

It is used for such type of dataset that contains two classes as output. For example, positive

or negative, male or female (D. Ali et al., 2016) and spam or not-spam.

12

Figure 1.8: Binary Classification

1.3.2 Multiclass Classification

A type of classification in which new instances or data points are classified into one class

of multiple classes. For example, event classification i.e., terrorist attack, murder, accident,

and outbreak, etc. In such a type of classification, every sentence is labelled with one of the

multiple classes.

Figure 1.9: Multiclass Classification

It is the task of automatically assigning the most relevant one class from the given multiple

classes. A major and serious challenge for multiclass classification is i.e. sentences are

overlapping in multiple classes (Kong et al., 2011, Sarker & Gonzalez, 2015) generally

13

affect the overall performance of the classification system.

1.3.3 Multilabel Classification

In multilabel classification, an object can be assigned more than one class. The collective

comparison of the different three types of classification is shown in figure 1.10.

Figure 1.10: Three types of Classification Including Multilabel Classification

1.4 Event Classification in the Urdu Language

There are several hurdles to process Urdu language text for event classification. Some of

them are determining the boundary of events in a sentence, identifying event triggers, and

assigning an appropriate label. (Naz et al., 2013) reported that more than 100 million

people understand speak and write the Urdu language in the sub-continent. Urdu has got

much popular on the web especially in online social networks because of the availability

of input tools for Urdu. Extracting events from the social network for the Urdu language

is a unique and challenging task. The Urdu language has complex writing script and right-

to-left writing style. Its grammatical structure is different from other languages i.e.,

English, French and German etc. It follows the subject, object, and verb sequence (SOV)

(Daud et al., 2017). Urdu language is complex because it consists of joined letters and

non-joined letter sets (Pal & Sarkar, 2003). Each letter of the joined-letter set can be

written in three different locations of the word having different forms i.e. at the

beginning, middle, and end (Pal & Sarkar, 2003). Some words can be combined to make

a single word i.e. ےیاسل ےیاس ل (so that) etc. it makes it hard to process the Urdu language

by existing tools.

14

Table 1.3: Examples of Event

Urdu Roman English English

:کتوں کا سب سے ایکور یجنوب

۔ ایگ ایبڑا مذبح خانہ بند کر د

Janobi Koriya: Kutoon

ka sab sy barra mizbah

khana bandd ker dia

gaya.

South Korea: The largest

Dog’s slaughterhouse

had banned.

1981 مرتبہ یکے بعد پہل

میڈیسٹ ںیم رانیکو ا نیخوات

یک کھنےید چیآ کر م ںیم

۔یاجازت مل

1981 kay baad pahli

martaba khawateen ko

Iran mein stadium mein

aa ker match daikhny ki

ijazat mili.

After 1981first time

women could come to the

stadium for watching the

match in Iran.

1 Example given in sentence one, the word ( بند کرنا-banned) representing action

word while (کتوں-dogs) and (مذ بح خانہ-slaughterhouse) are nouns. Extracting and sorting

this information is helpful to know the events for the Urdu language.

2 In the second example 1981 and (بعد-after) are temporal entities while (اجازت ملنا-

allowed) is a word representing events. Such information can be used to construct a

historical timeline about Iran.

In our research problem event can be defined as “An environmental change that occurs due

to some reasons or actions for a specific period.” For example, the explosion of the gas

container, a collision between vehicles, terrorist attacks, and rainfall, etc.

Social media provides the platform to share information in different languages on various

topics. The classification of this information is very important in NLP tasks. Urdu is one of

the local languages being used on online social media for sharing information. Classifying

such information into different types can be helpful for different NLP applications i.e. risk

factor analyzer, law and order situation predictor, and event timeline constructor for certain

areas of the world. In our research work, we are exploring the social media textual data

written in the Urdu language for the classification of events into different categories. To

our best knowledge, we are the first ones who are exploring the Urdu textual data for event

classification.

1.5 Challenges in Event Detection and Classification

1.5.1 General Challenges

The social network has become the central hub in the world to share information that

15

generated a large volume and variety of data (Al-Dyani, W. Z., Yahya, A. H., & Ahmad,

2018). The extraction of worthy information from this huge bulk of data is one of the

challenging tasks. Lack of publicly available datasets that are platform-dependent (McMinn

et al., 2013) i.e. related to Twitter data leading to repetition and comparison of different

approaches (Panagiotou et al., 2016). Many users’ generated data on social networks has

irregular grammar, irrelevant terms, limited length, and misspelled errors (Parikh, R., &

Karlapalem, 2013).

1.5.2 Event Detection and Classification Methodology Challenges

In general, two main approaches for event detection are i.e. document pivot and feature

pivot (McMinn et al., 2013). Clustering based on documents similarity is performed which

cannot handle the large amount of data on the social network. Identical terms are used in

different events which degrade the accuracy of event detection using the document pivot

approach (McMinn et al., 2013). Event detection using the supervised method i.e. feature

pivot shows good results as mention in the literature (Mohamad, A. Y., Mustapha, S. S., &

Razali, 2010,S. Lavanya, R. Kavipriya, Y. Yang, J. Q. Carbonell, R. D. Brown, B.

Archibald, 2014) but it is time-consuming, required a large volume of training data, and a

lot of human effort (Al-Dyani, W. Z., Yahya, A. H., & Ahmad, 2018).

1.5.3 Pre-Processing Challenges in Event Detection and Classification

Social stream is full of noise i.e. advertisement, hoaxes, spam messages, etc. identifying

eventual content from noisy content is another challenge in event detection (Li, Xia,

Yongqing Zheng, 2014). Data representation techniques i.e., Bag of words (BOW) and term

frequency (TF) have their limitations. In case of BOW technique event classification is

challenging because it do not maintain the sequence of words. The overlapping sentence

leads towards misclassification while contrary term frequency utilizes more resources i.e.

time and memory (Al-Dyani, W. Z., Yahya, A. H., & Ahmad, 2018).

1.5.4 Feature Extraction Challenge

Features are an important and crucial element of event detection. Social streams contain a

huge number of features (Dou, Wenwen, Xiaoyu Wang, William Ribarsky, 2012) (Lu,

Zhongyu, Weiren Yu, Richong Zhang, Jianxin Li, 2015) Dependency among the extracted

features is leading to ambiguity in event detection and classification because different

events can be expressed using similar words (identical features).

16

1.6 Challenges in Event Detection and Classification from the Urdu

Language

The major challenges of events extraction and classification for the Urdu language are:

• Writing style and structure of the Urdu language are challenges in event extraction

and classification,

• Lack of processing resources i.e., part-of-speech (PoS) tagger, name entity

recognizer, and annotations tools is another big challenge in event detection and

classification for the Urdu language,

• People are generally unfamiliar with the meaning and usage of the Urdu language,

• Misusage of different terms representing events by people made event classification

a more challenging task because of mostly the same words representing different

events. It is one of the important features that badly affect the accuracy of the

classification system,

• Extraction and classification of event from Urdu language text using knowledge

based and data driven approach face challenges in the form of unavailability of

publicly event datasets,

• Extracting temporal entities are important for classifying events as retrospective

(old) and real-time (new) event. Cursive and complex writing format made

extracting events and temporal information from the Urdu language an interesting

and challenging task.

1.7 Importance of Time in Event Detection

Events are occurrences in a certain period. Time detection in various NLP and IR

application is very crucial. To retrieve information about event that happened in specific

time, it required temporal entities. Time plays an important role to retrieve exact

information. It saves user’s time and other machinery resources i.e., processing power etc.

In case of natural disaster, terrorist attack, and acute accidents temporal information can be

useful to predict the start of the rescue operation, analyze the risk factor of injuries,

estimation of losses, and expected number of causalities for a certain duration. Identifying

events from social media in specific time interval relays on temporal information (Kamila,

Sabyasachi, Mohammad Hasanuzzaman, Asif Ekbal, 2018) Detection of temporal entities

in textual data helps to construct a timeline of events, order them, and classify them as a

retrospective and real-time events (Li et al., 2017).

17

1.8 Research Motivation

Although very useful information can be extracted from social networks but extracting

events and temporal information is a very practicable and challenging task. Temporal

information is also crucial to order the sequence of events to identify between retrospective

and real-time events. Appraisable research work exists for non-cursive languages for

information extraction in textual data i.e. English, German, French, and Japanese (Nadeau

& Sekine, 2007)but very few amounts of research work exist for other languages (Riaz,

2008). The Urdu language is also one of the languages highly used on the web but with

minimum processing resources (Malik, M. K. and Sarwar, 2016). It has more than 100

million users in the world (Naz et al., 2013).

Generally, it can be observed that research on conversation predicting and monitoring had

considerable referential work for the English language (Konstantinidis et al., 2017)The

pandemic outbreak (covid-19) highly affected Italy (De Santis et al., 2020). A heated debate

was generated on Twitter in the Italian language. Italian keyword i.e. Salvini, Conte, PD.

Calcio and Carceree etc. were used to collect 1044645 tweets to predict the relevant topic

(De Santis et al., 2020). To monitor the sentiment, loyalty, and behavior of people about

the product an Italian public broadcasting service was analyzed using the 1000 posts on

Facebook (O’Keeffe et al., 2011). Cyberbullying on social media is causing aggression. A

major (Xu et al., 2012) and national health problem (Limber, n.d.)badly affecting the people

physiologically, physically, and academically (Al-Garadi et al., 2019). An aggression

demonstrating system was designed to predict and monitor cyberbullying (Somooro, 2019).

During the general election of Pakistan in 2018, the status of the Urdu language on Twitter

is analyzed (Jaidka et al., 2019). The work highlighted the occasionally rapid use of Urdu

language on social media i.e.,f Twitter. A lot of Urdu text was generated to promote the

election campaign of the political party. The purpose of the system was to predict the

sentiment regarding the general election (Jaidka et al., 2019). Sentiment information is

mined using tweets to predict the outcome of the election of Pakistan, India, and Malaysia

(Jaidka et al., 2019).

To narrow down our research problem we decided to choose one of the 187 languages i.e.,

the Urdu language conversation for predicting and monitoring. The Urdu language has

complex writing script, right-to-left writing style, free order of words. It is one of the

resource poor languages (Chowdhury et al., 2013) . To our best knowledge, there exist no

such event extraction, event classification, conversation predicting and monitoring system

18

for the Urdu language text. Instead of predicting the sentiment of a specific group of people,

feedback about the product, personality, or policy, we decided to extract, classify events

and to predict and monitor the general and specific conversation of all types of users. The

events are classified into the top twelve broad categories i.e., sports, inflation, terrorist

attack, murder, death, sexual assault, fraud and corruption, weather, earthquake, business,

politics, and showbiz.

Text of Urdu language has considerable volume on social media and news websites. It

contains invaluable information that are certainly essential to develop many NLP

applications. To our best knowledge still, no referential research work exists for events

extraction in the Urdu language. In our research thesis, we propose to extract events and

temporal information related to events from textual data for the Urdu language.

1.9 Research Problem

Nowadays, the local languages of the most populous countries created their interaction

space on social media, mobile devices, and news websites. The usage of Urdu language

text on social media, news websites, and mobile devices is growing rapidly. The invaluable

information can be extracted from these sources of data. Event classification and temporal

entity extraction from the text written in Urdu language script are the major problems of

our research work.

In our research work we decided to accomplish three main tasks related to the Urdu

language: 1) event extraction: It is the task to extract event information for input data, 2)

event classification: It task to assign a predefined label to input i.e. public protest, sports,

terrorist attack, inflation, murder, death, etc. and 3) extracting temporal information

associated with events to identify as a retrospective and real-time (new) events from Urdu

script.

19

Figure1.11: A generic application diagram of our proposed system

1.10 Research Objectives

The objectives of our research are given here:

1 To propose an approach to identify and classify events in Urdu language text

2 To propose an approach to classify events in Urdu language text into different

types i.e., sports, politics, and protest, etc.

3 To propose an approach for extracting temporal entities from Urdu language text,

4 To propose an approach to segregate events into real and retrospective by using

extracted temporal entities.

20

5 To develop an event-based Urdu text data collection.

1.11 Thesis Organization The thesis is organized into different chapters to explain the research work. A detailed

summary of the chapters and related information is given below:

Chapter 1

In this chapter, a comprehensive introduction of the research work is presented that covers

the importance and scope of our work.

Chapter 2

The research work related to our research problem is given in the second chapter.

Chapter 3

The detail of the dataset is given in chapter 3.

Chapter 4

The detail of the methodology and experimental results are given in chapter 4.

Chapter5

All the detail about temporal entities and an overall discussion of the research work is

given in chapter 5.

Chapter 6

The research work is concluded in this chapter. We also included the future work after the

conclusion.

21

CHAPTER 2

BACKGROUND AND RELATED

WORK

22

In this chapter we have described the comprehensive review of literature work. The

related work has presented in two sections. In first section, event detection and

classification work that already exist for other languages has been reported, while

the second section consists of the literature review of temporal entity extraction.

2.1 Event Detection and Classification

Initially, event detection was used for the biomedical domain i.e. to extract gene and protein

entities (Hogenboom et al., 2016). Event extraction from textual data found on the internet

supported by Advanced Research Project Agency (ARPA) originated in the late 1980s to

message understanding and automatically detecting terrorism-related text from newswire

(Hogenboom et al., 2016). These days event detection enlarged its scope from gene

expression to protein expression i.e. finding events related to gene and protein entity

(Yakushiji et al., 2001). Event detection is not confined to the biomedical domain but being

used in other domains To remain updated by the latest events occurring on social media

(Ritter et al., 2015) weakly supervise approaches are used. Seed example of candidate

events as given as input to the system and new events can be categories. Computer security

events are reported in work i.e., Denial of services, Data breaches, and data hijack.

Identifying events and events location performed (Bahir & Peled, 2016)using keywords

and contextual information. The analysis showed that name of the event location was used

in many instances of the textual message. Event location detection is very important. For

example, in case of disastrous events i.e., fire, earthquake, or typhoons rescue team requires

to know the “location” of the event. A Twitter data corpus Edinburgh (Petrovic et al., 2010)

consists of 97 million tweets that had been used for retrospective event extraction (Li et al.,

2017) used the temporal module to differentiate between clusters of real-time (new) event

and retrospective (old) event. Jyoti et al. (J. P. Singh et al., 2019) developed a neural

network-based system to classify events to help out the people in a natural disaster like a

flood by analyzing tweets. The Markov model used to classify and predict the location that

showed 81% accuracy for classification tweet as a request for help and 87% accuracy to

locate the location. Research work was conducted on life event detection and classification

i.e. marriage, birthday and traveling, etc. to anticipate products and services to facilitate the

people (Cavalin Rodrigo Paulo, 2016). The data about life event exist in very small amount.

Linear regression, naïve bayes and nearest neighbor algorithms were evaluated on original

dataset that was very small but did not show favorable results. Oversampling of training

dataset greatly affected the performance of algorithms in which linear regression

23

outperformed and showed considerable results.

An exhaustive review of the literature concentrated our findings that events can be detected

by three general approaches i.e. Data-Driven, Knowledge-Driven, and Hybrid (Allen,

1983). In general Machine Learning Classification is performed based on hand-crafted

rules i.e., pattern matching by regular expression, handcrafted rules using lexical features

(full-length words, Part of Speech Tagger), syntactic features (Parsing dependency), and

external knowledge features (WordNet). A knowledge-based system developed by (Ferro

Lisa, Gerber Laurie, Mani Inderjeet, 2005) was applied to Arabic Tweets using an

unsupervised rule-based technique. It had focused to extract three parameters related to

events i.e., trigger, time, and identification. The proposed system had accuracy 75.9%,

87.5%, and 97.7% respectively. A system developed to extract events from a tweet a noisy

piece of information i.e. TWICAL is an open domain event extractor from Twitter

(Filannino & Nenadic, 2015). French Timebank corpus was developed by Adre Bittar et al.

to (Bittar et al., 2011) extract time, events, and the relation between them. Cross-language

annotation and French language guidelines had been specified to improve the ISO-

TimeML. An improvement in modality capturing system was made after the analysis of

French text that revealed that modality was expressed using inflected verbs. A set of the

normalized value of modality attribute i.e., necessity, possibility, obligation, and

permission provided in manual annotation context. Another contribution was made to

provide a way to capture the difference between neutral aspect value and inchoative aspect

value of support verb construction. Finally, a new type of event class i.e. Event-Container

was introduced to distinguish predicates that taken an event nominal as subject (Bittar et

al., 2011). Some correspondence had made between English and French Grammar,

imperfect morphological tense included in French which was not exist in English. (Caselli

& Sprugnoli, 2017) tools were used to process and evaluate the contents consisting of

61,000 tokens.

Table 2.1: TempEx Tagger for an exact match on tag span and value calculation (Bittar et al., 2011).

System Precision Recall F1-Measure

Match TempEx 84.2 81.8 83.0

DEDO 83.0 79.0 81.0

Value TempEx 55.0 44.9 49.4

DEDO 56.0 45.0 50.0

TempEval-2 released TimeML annotated data for Chinese, English, French, Italian,

24

Korean, Spanish (Costa & Branco, 2012) and (Caselli & Sprugnoli, 2017). Portuguese

dataset consists of 70,000 words annotated in TimeML language was developed by (Costa

& Branco, 2012) used to extract event and temporal information related to events.

Table 2.2: Performance of Classifier on Portaguess Dataset (Costa & Branco, 2012)

F-Measure

Group Algorithm English Portuguese

A KStar 0.59 0.58

Baseline 0.57 0.59

B Decision Table 0.73 0.77

Baseline 0.56 0.56

C SMO 0.54 0.54

Baseline 0.47 0.47

A corpus developed for the Persian language based on ISO-TimeML annotation to extract

event and temporal information. (Yaghoobzadeh et al., 2012) did the first attempt in the

Persian language and developed 4237 events from 30, 000 sentences (Yaghoobzadeh et al.,

2012).

• Multiple tokens located at a different location in the same sentence can be

marked as event i.e. Bârân (Rain) be (to) mantaq-e (area) sadame-e

(damage) zyâdê (large) khâhad zad (will do).

• Translation: The rain will largely damage the area.

• Part of PersTimeML output will be: <Event xml:id=” e1”

target=”#token3#token5” text= “sadame-e khâhad zad”… />

• Some changes to event attributes, the value of these attributes, annotation

rules, and event extents.

• A perTimeML corpus specifies the annotation of generic as the event.

• Gerund phrases are also annotated as an event even when they represent

generic events.

• Objective deverbal adjectives in PersTimeML are adjectives that derived

from passive modes of verbs.

• Compound words consisting of non-verbal and light verbs are marked as

events.

25

Table 2.3: Evaluation Results for PET for Event Recognition (Yaghoobzadeh et al., 2012)

Rule-based Learning-based

Category Precision Recall F-Measure Precision Recall F-Measure

All 78.9 72.5 75.6 79.2 87.5 83.1

Verb 96.5 99.3 97.9 97.1 99.5 98.3

Noun 66.3 64.4 65.3 82.1 81.8 77.3

Adjective 88.5 55.8 68.4 78.3 76.4 77.3

Using TimeML for other Non-English languages rather than English coined two

approaches:

• Modification of annotation scheme starting from automatic porting of existing and

annotated English Corpus to other languages,

• Design of language-specific annotation specification and the corresponding

annotated resources from scratch (Caselli & Sprugnoli, 2017).

In the past, researchers were impassive in the Urdu language because of limited processing

resources, i.e., datasets, annotators, Part-of-Speech (PoS) taggers, and translators (A. R. Ali

& Ijaz, 2009), etc. However, now, in the last few years, feature-based classification for

Urdu text documents started the use of machine learning models (Daud et al., 2017,

Mehmood et al., 2019 and AHMED et al., 2016). A framework was proposed (Zia et al.,

2015) to classify Chinese short texts into 7 kinds (Zia et al., 2015) of emotion and product

review. The event-level information of sentence from text and contextual information from

the external sources (lexicon, knowledge base) is provided as supplementary supporting

material to the neural models.

A fusion of CNN and RNN models is used to classify sentences using a movie review

dataset and achieved 93% accuracy (Abdlrauf, 2017). Urdu text classification at document

level is presented (Zhou et al., 2018) that has showed the comparative analysis of Machine

learning and deep learning models. CNN and RNN single layer/multilayer architectures are

used to evaluate three different sizes of the dataset (Liu & Guo, 2019). The idea was to

analyze and predict the quality of product using the feedback of customers. The categories

the feedback as Valuable, Not valuable, Relevant, Irrelevant, Bad, Good, or Very Good (Y.

Zhang, 2012).

Different datasets reported in state-of-art i.e. Northwestern Polytechnical University Urdu

26

(NPUU) consists of 10K news articles labeled into six classes, Naïve dataset including 5003

news articles consists of five classes (Zia et al., 2015) while COUNTER has 1200 news

articles and five classes (Y. Zhang, 2012). A joint framework consisting of CNN and RNN

layers used for sentiment analysis (A. R. Ali & Ijaz, 2009). Two datasets Stanford movie

review and Stanford Treebank dataset were used to evaluate the designed system. The

system had showed 93.3% and 89.2% accuracy, respectively.

In (A. R. Ali & Ijaz, 2009), the authors had performed a supervised text classification in

the Urdu language by using a statistical approach like Support Vector Machine (SVM) and

Naïve Bayes. The classification was initiated by applying different preprocessing

approaches, namely, stemming, stop word removal, and both stop words elimination &

stemming. The experimental results had showed that the steaming process had little impact

on improving performance. On the other hand, the elimination of stop words showed a

positive effect on results. The SVM outperformed the Naïve Bayes by achieving the

classification accuracies of 89.53% and 93.34% based on polynomial and Radial function,

respectively.

Similarly, the SVM is also applied in the News Headlines classification (Usman et al.,

2016) in Urdu text showing a very low amount of accuracy improvement of 3.5%. News

headlines are a small piece of information that frequently does not describe the contextual

meaning of the contents. In (Usman et al., 2016), the majority voting algorithm is used for

text classification in the Urdu language showed 94% accuracy. The classification is

performed on seven different types of news text. However, the number of instances was

very limited. A dynamic neural network (Kalchbrenner et al., 2014) was designed to model

the sentiment of sentences. It consists of Dynamic K-modeling, pooling, and global pooling

over a linear sequence that performs multi-class sentiment classification.

A quite different task is performed (Awais & Shoaib, 2019) in which the authors used a

hybrid approach of rule-based and machine learning-based techniques to perform the

sentiment classification while analyzing the Urdu script (Awais & Shoaib, 2019) at the

phrase level. The performance of the system was 31.25%, 8.46%, and 21.6% for recall,

precision, and accuracy, respectively. In (J. P. Singh et al., 2019),the limitations of

traditional approaches BOW and n-gram features were tackled by using variant of RNN

know as Long-Short-Term-Memory (LSTM).

A neural network-based system In (J. P. Singh et al., 2019) was developed to classify

events. The purpose of the system was to help the people in natural disasters like floods by

analyzing tweets. The Markov model was used to classify and predict the location that

27

showed 81% accuracy for classification tweets as a request for help and 87% accuracy to

locate the location. Research work was conducted on life event detection and classification,

i.e., marriage, birthday and traveling, etc. to anticipate products and services to facilitate

the people (Cavalin Rodrigo Paulo, 2016).

A multiple minimal reduct extraction algorithm was designed (Al-Radaideh & Al-Abrat,

2019) by improving the quick reduct algorithm. The multiple reducts are used to generate

the set of classification rules which represent the rough set classifier. A corpus that was the

collection of 2700 Arabic text documents, had been evaluated using multiple and single

reducts. The proposed system showed 94% and 86% accuracy, respectively. Experimental

results also had shown that both the k-NN and J48 algorithms outperformed regarding

classification accuracy using the dataset on hand. Table 1 depicts a summary of the related

research discussed.

Table 2.4: Summary of the Related Research Work

Paper

Reference Classifier used Dataset Accuracy

(Hassan,

2018) CNN and RNN Movie Reviews 92%

(Zia et al.,

2015) CNN and RNN

1. Stanford movie

review dataset

2. Stanford

Treebank dataset

93.3% and 89.2%

(A. R. Ali

& Ijaz,

2009)

Naïve Bayes and

SVM

Corpus of Urdu

documents 89.53% and 93.34%

(Usman et

al., 2016)

Dynamic neural

network News articles 96.5%

(Awais &

Shoaib,

2019)

Rule-based modeling Urdu corpus of news

headlines 31.25%

(J. P.

Singh et

al., 2019)

LSTM Tweets 81.00%

(Al-

Radaideh

& Al-

Abrat,

2019)

K-NN and J48 Arabic corpus of

2700 documents 95% and 86%

28

2.2 Existing Methodologies

Three broad approaches for the temporal entity and Events extraction are Data-Driven,

Knowledge-Driven, and Hybrid approaches (Allen, 1983).

2.2.1 Data-driven Approach

Statistic, Machine Learning, and Linear Algebra methods are used in this approach. This

approach requires a huge volume Corpora. It does not consider the semantic of words i.e.

the meaning of words while discovering relations in the dataset. It is helpful to develop a

language independent Event Detection System (Allen, 1983). Classification, clustering,

and regression are common types of the machine learning approach. Deep learning is an

emerging innovation in machine learning is another approach for NLP.

2.2.2.1 Different Techniques/Methods in Data-Driven Approach (Allen, 1983)

• Word Frequency Count,

• Ranking by mean of TF-IDF,

• Word Sense Disambiguation,

• N-grams,

• Clustering,

• Hierarchical Clustering,

• Weighted undirected bipartite Graph and Clustering.

2.2.2 Knowledge-Driven Approach

It is the pattern-based approach. Patterns help to design rules for event extraction from

textual data. There are two types of linguistic patterns i.e. Lexicon-syntactic and Lexicon-

Semantic. Lexicon-syntactic uses the grammatical features i.e. Part of speech and tense

while Lexicon-Semantic uses the contextual meaning of the words. Regular expressions

are used to combine Lexicon and Syntactic patterns. In Personal Blogs, experiences events

were extracted using three words i.e., Place, Object, and Verb that together represent the

event. Semantic information is used to find patterns about the event. Semantics are added

by using gazetteers or by ontology (Allen, 1983).

2.2.3 Hybrids Approach

The hybrid approach combines the properties of both the Data-Driven and Knowledge-

Driven approach. It is highly suitable when we have a small amount of data and less

command of the target language (Allen, 1983).

29

2.3 Temporal Entity Extraction

Initiative in name entity recognition system was taken (Woodward, 2001)which extract the

‘company’ names by using heuristic and handcraft rules. Rules design by learning patterns

in contents by a human. Language is a big factor in the case of textual data analysis, a good

portion of the research is made for the English language but mostly researchers highlighted

multilingual and language independence which is due to the heterogeneous and

unstructured nature of data (Nadeau & Sekine, 2007).

To quickly convey the idea of contents a tool (Nadeau & Sekine, 2007) developed by

learning about people, place, things, and events. Information visually provided by

extracting entities and events form textual data. Historical collection of American civil war

news article published on Wikipedia which incorporated with all Named Entities with

proper time are analyzed by Stanford NLP CRF achieved 79.1% f-measure. Change,

causality, and actions are defined in term of time, many artificial intelligence applications

require time and reasoning about time i.e., to answer a “When” query system need to anchor

events. Similarly, “How long” questions require event duration to respond properly. An

approach developed (Llidó et al., 2001)to automatically assign document event-time by

extracting temporal expression from the text. It helps to retrieve related documents based

on temporal values and finding the relationship between them.

(Hao et al., 2018) designed a novel method TEER to extract and normalize temporal

expression from heterogeneous clinical text. They use heuristic rules, summarization, and

automatic patterns learning. The developed system evaluated on two datasets i.e. English

and Chinese clinical text which consists of 400 English and 1459 Chinese discharge

summaries. Precision and recall for English and Chinese languages are 0.948, 0.877, and

0.941, 0.932 respectively. A sequencer system developed for the ana

lysis of temporal entities (Walenz et al., 2010) existing in news articles and user-generated

unstructured contents. It is based on crawling, clustering, extracting, and visualizing.

WordNet-based features used in the CRF model. Many annotation schemes i.e. PoS

Tagging, Partial Parsing, Semantic Interpretation, case frame instantiation, and discourse

analysis were used to extract temporal expression from textual data (Ferro Lisa, Gerber

Laurie, Mani Inderjeet, 2005). Another system developed to extract fluent information that

is valuable for a certain period. They claimed that many proposed systems focused on static

information while mostly newswire text and Wikipedia are predominant temporal

expression (Ling & Weld, 2010) precision and recall of Temporal Information Extraction

30

were 0.50 to 0.99. In general, Temporal Expression identification performed by machine

learning approaches based on lexical and morphological features (Ahn et al., 2005) Support

vector Machine and Condition Random Field CRF gives considerable results for Non-

cursive languages respectively (john et al., 2001) (Khan Wahab, Daud Ali, Nassir A Jamal,

2016). A system was designed to monitor events that were being reported on social media

for specific time (Huang et al., 2018).

A considerable volume of research work exists for non-cursive languages especially for

English, French, German, Dutch, and Spanish (Nadeau & Sekine, 2007) which achieved

noticeable accuracy for developing mature artificial intelligence applications.

In Urdu literature, there is no proper exhaustive research work exists for temporal entities.

In 2008 International Joint Conference on Natural Language Processing IJCNLP (IJCNLP)

proposed a set of 12 named entities for South-Asian language including Temporal Entity

i.e. Date and Time as single Entity (Liao & Veeramachaneni, 2009). A rule-based approach

was adopted in (U. P. Singh et al., 2012), which focused on date and time tags. They used

Regular Expression (RE) to extract the specific pattern of date i.e., 01.08.2015 or

01/01/2014. The same system is also able to identify date like May 01, 2018 and achieved

90.83% F1-Measure. To our best knowledge, there is no detailed discussion about different

types and formats of date in the Urdu language. Lack of resources i.e., lexicon, gazetteers,

and dataset are the main factors to adopt rule-based approaches. In (Riaz, 2010) a generic

name entity recognition system used a rules-based approach to extract name entities

including Date from the Urdu language which achieved considerable F1-Measure for

specific patterns i.e. ‘1996’ but unfortunately there exists no detail about types and format

of dates in the Urdu language. Central Language of Engineering (CLE) is working for the

Urdu language which offered different datasets available on the website at affordable

prices. A Part of Speech Tagger (PoS) was also developed by CLE and providing services

online. It tags 100 words per attempt free. For further processing, full access can be

provided on request. A system developed by Central Language of Engineering (CLE) does

not evaluate the Temporal Entities (TE). At the same website, a small WordNet which

contains data in UTF-8 format is also publicly available with some charges. Now, from last

few decades’ cursive languages are being popular and attracted researchers to explore for

development of NLP applications. Temporal Data in Urdu language introduced at very

basic level internationally and nationally in different research papers but still, no significant

work proceeded in favor of Urdu Temporal Entity ‘Date extraction’. To our best

knowledge, we are the first one working on Temporal Entity ‘Date’ in Cursive Language

31

Urdu.

A dataset for Urdu language processing (Khan Wahab, Daud Ali, Nassir A Jamal, 2016) is

publicly available for researchers. It is named as Urdu Named Entity Recognition (UNER).

It is specifically developed for name entity extraction from Urdu language text. In the

dataset 12 different types of Name Entity reported including temporal entities. There are

206 tags that represent the temporal entity in the dataset. Existing tools that are designed

for English languages are incompatible with Urdu language processing. The situation

highlighted the need of new methods and approach to process the Urdu language text.

Arabic temporal entity has been extracted using morphological analysis and finite state of

the transducer. Their proposed system identified 12 temporal morphological categories that

augmented the size of the Arabic lexicon with 550 more tags. It achieved 94.6% and 84.2%

recall and precision respectively for temporal entity detection. They also analyzed the

dataset and reported 89.7% recall and 90.8% precision for temporal entity boundary

detection (Zaraket & Makhlouta, 2012).

In the past many cursive languages like Urdu, Arabic, Persian and Hindi were neglected by

researchers because of lacking in resource(Malik, M. K. and Sarwar, 2016b). Only a few

numbers of cursive languages were known publicly, due to lack of interest, inconvenience

in processing, and unavailability of resources i.e. Lexicon, Databases, Dictionaries,

Annotations schemes, and Datasets (Riaz, 2008). To develop generic NLP applications,

time demands to include cursive language into the research stream. A temporal entity based

module was used to extract the cluster of real-time and retrospective (Old) events (Li et al.,

2017).Temporal information is essential to classify latest and ancient events. An approach

was developed (Llidó et al., 2001) to automatically assign document event-time by

extracting temporal expression from the text. It helped to retrieve related documents based

on temporal values and finding the relationship between them.

Summary

We have discussed the research work that exists for event detection, event classification

and temporal entity extraction for different languages i.e., English, Arabic, Hindi and

Persian etc. We also tried our best to discuss the related work of Urdu language. The detail

of existing approaches that are used for event and temporal entity extraction is also

described in this chapter.

32

CHAPTER 3

DATASETS

33

The purpose of our research work is to classify multiple types of events that are occurring

in a specific time duration. Extracting temporal entities is very important to construct event

timelines. To achieve our objectives, we decided to use two different datasets. i.e. one for

Event Classification named “Multiclass Urdu Language Labelled Sentences” (MULLS)

and another dataset that is “Urdu Named Entity Recognition (UNER) (Khan Wahab, Daud

Ali, Nassir A Jamal, 2016)” dataset for Temporal entity extraction.

1. Multiclass Urdu Language Labelled Sentences (MULLS)

2. Urdu Named Entity Recognition (UNER)

Dataset plays a vital role to achieve research goals. We prepared our dataset in comma-

separated value CSV format. The cycle of dataset preparation consists of various phases.

The complete life cycle of dataset preparation is presented in the figure 3.1.

Figure 3.1: Dataset Life Cycle

Data collection

Assigning Label

Paragraphs Splitting

Concatenating other information

Pre-processing

• Stop words elimination

• Data Cleaning

• Tokenization

• Max. and min. length

•

Data set distribution

Phase I

Phase II

Phase III

• Machine Learning

• Deep Learning

34

The detail of both datasets is given in proceeding sections.

Phase I

The initial phase of data collection consists of several important steps. Each step is

explained in detail in proceeding sections.

3.1 Multiclass Urdu Language Labelled Sentences (MULLS)

The MULLS corpus is used in our researcher work for multiclass event extraction and

classification. We have named it as Multiclass Urdu Language Labelled Sentence

(MULLS). The detail of different steps that were performed during the preparation of

dataset are given below in subsections.

3.1.1 Data Collection

In literature various datasets are reported but none of those is specific for event

classification (Sharjeel et al., 2017 and Zia et al., 2015).

So, we have created a larger dataset-specific for event classification. Instead of focusing on

a specific product (Akhter et al., 2020) analysis, or phrase-level sentiment analysis (Awais

& Shoaib, 2019), we decided to classify sentences into multiple event classes. Instead of

using the joint framework of CNN and RNN for sentiment analysis (A. R. Ali & Ijaz, 2009),

we evaluated the performance of deep learning models and popular machine learning

classification models for multiclass event classification.

Data is the core element of the research phase, so to decide and select the source of data is

very sensitive. Authenticated, reliable, and popular data sources should be selected for data

collection. Nowadays social networks and news websites are very popular sources for

information. We decide to collect Urdu text from different data providing sources instead

of a single source to develop a more generic system. In our case we started data collection

from both social networks and news websites like Twitter, Facebook and Geo News

websites, Urdu point, respectively.

A script (crawler) of Python (3.6 version) is used to retrieve data using event-related

keywords (دھماکہ، کھیل،بارش موت، جلسہ اور حکومت وغیرہ) from Twitter and Facebook.

We spent a couple of weeks and crawled 26000 posts for a specific time from 2017 to 2018

(25-06-2017 to 15-11-2018). The collection consists of twenty-two (22) classes of different

events i.e., sports, inflation, murder, terrorist attack, death, accident, politics, education,

showbiz, interesting and strange, government-official, earthquake, fraud and corruption,

religious, weather, science and technology, international, business, health, law and order,

sexual assault and others.

While collecting data, a PHP-based web scraper is written to crawl data from popular news

35

websites, i.e., Geo News Channel4 website, BBC Urdu5 , and Urdu point6 . A complete post

is retrieved from the website and stored in MariaDB (database). It consists of a title, body,

published date, location, and URL. The sample text or tweet of both languages of the South

Asian countries, i.e., Urdu language on Twitter and Hindi language on Facebook, is shown

in figure 3.2.

Figure 3.2: Urdu and Hindi Language Text on Social Media

There are 0.15 million (1,50,000) Urdu language sentences. The diversity of data collection

sources helped us to develop multiclass data sets. It consists of twelve types of events. The

subset of data datasets can be useful for other researchers.

As we have described earlier that our task is to classify events at sentence level instead of

whole document classification. Our dataset contains 0.15 million instances (sentences). All

the instances (sentences) are labelled with twelve (12) different types of events. The detail

of each event and its total number of instances are shown in figure 3.4.

4https://urdu.geo.tv/ 5https://www.bbc.com/urdu 5https://www.urdupoint.com/daily/ 6https://www.kaggle.com/rtatman/urdu-stopwords-list

36

Phase II

3.1.2 Pre-processing

In the first phase of dataset preparation, we have performed some pre-processing steps, i.e.,

noise removing and sentence annotation/labeling. All non-Urdu words, sentences,

hyperlinks, URLs, and special symbols were removed from dataset. It was necessary to

clean out the dataset to annotate/label the sentences properly. The initial steps are

performed on the corpus to prepare for machine learning algorithms. Because textual data

cannot directly process by machine learning classifiers. It also contains many irrelevant

words. So, we must apply some pre-processing steps, in our research case. The detail of all

the pre-processing steps followed in our research problem is given below.

3.1.2.1 Post Splitting

The PHP crawler extracted the body of the post. It comprises of many sentences as a

paragraph. In the Urdu language script, sentences end with a sign called “-“Hyphen

(Khatma-تمہ). It is a standard punctuation mark in the Urdu language to represent the end

of the sentence. As mentioned earlier, we are performing event classification at the sentence

level. So, we split paragraphs of every post into sentences. Every line in the paragraphs

ending at Hyphen is split as a single line.

3.1.2.2 Stop Words Elimination

Generally, those words that occur frequently in text corpus are considered as stop words.

These words merely affect the classifier's performance. Punctuation marks (“!”, “@”,” #”,

etc.) and frequent words of the Urdu languages ( , کا ، کے ،کی وغیرہ etc.) are the common

examples of stop words. All the stop words (Capet et al., 2008) that do not play an

influential role in event classification for the Urdu language text are eliminated from the

corpus. Stop words elimination reduces the memory and processing utilization and make

the processing efficient. A list of standard stop words of the Urdu language is available

here7.

3.1.2.3 Noise Removal

Our data is collected by different sources (see section 3.1). It contains a lot of noisy

elements i.e., Multilanguage words, links, mathematical characters, and special symbols,

etc. In collected croups, we found many multi-lingual sentences in the post. To make our

corpus clean and ready for further processing, we removed those sentences, irrelevant links,

7 https://www.kaggle.com/rtatman/urdu-stopwords-list

https://www.kaggle.com/rtatman/urdu-stopwords-list

37

and special characters from the corpus.

3.1.2.4 Filtering Sentences

The nature of our problem confined us to define the limit of words per sentence. Because

of the multiple types of events, it is probably hard to find a sentence of the same length.

We decided to keep the maximum number of sentences in our corpus. All those sentences

which are very short and very long removed from our corpus.

Our observation depicted that a lot of sentences vary in length from 5 words to 250 words.

We decided to use sentences that consist of 5 words to 150 words to lemmatize our research

problem and consumption of processing resources.

3.1.3 Annotation Guidelines

• Go through each sentence and assign a class label,

• Remove ambiguous sentences,

• Merge relevant sentences to a single class i.e., accident, murder, and death, etc.,

• Assign one of the twelve types of events i.e., Sports, Inflation, Murder and Death,

Terrorist Attack, Politics, Law and Order, Earthquake, Showbiz, Fraud and

Corruption, Weather, Sexual Assault, and Business to each sentence.

Sentence annotation is the imperative and attentive task of our research project. It was a

very exhaustive and time-consuming task. It took a couple of weeks to assign labels to each

sentence.

To annotate our dataset, two M.Phil. (Urdu) level language experts were engaged. They

deeply read and analyzed the dataset sentence by sentence before assigning event labels.

They recommended removing (46035) sentences from the dataset because those sentences

would not contain information that useful for event classification. Finally, after annotation,

the dataset size was reduced to 103965 imbalanced instances of twelve different types of

events. The annotation inter-agreement i.e., Cohen Kappa score is 0.93 which indicates the

strong agreement between the language expert annotators. The annotated dataset is almost

perfect according to the annotation agreement score.

Few examples of labelled sentences are given in the table below:

38

Table 3.1: Urdu Label Sentences

Example no. Title Sentences Class/Label

انگلینڈ اور نیوزی لینڈ کے درمیان کرکٹ ولڈ کپ کا کھیل 1

یادگارمیچ ہوا ہے۔

Sports

عمران خان نے جلسوں کی مدد سے لوگوں میں سیاسی سیاست 2

شعور پیدا کیا۔

Politics

ڈیینگی مچھر کےکاٹنے کی وجہ سے دس افراد لقمہ اجل بن موت 3

گئے۔

Death

4

مہنگائی

قیمتوں میں اضافہ نے غریب عوام کی خوردونوش اشیا کی

کمر توڑ دی۔

Inflation

کوئٹہ کے نواحی علاقہ میں خود کش دھماکہ متعدد افراد دھماکہ 5

جاں بحق ہوئے۔ Terrorist

Attack

After performing data cleaning and stop word removal every sentence is tokenized into

words, based on white space. An example of sentence tokenization is given in table 3.2.

Table 3.2: Sentence Tokenization

Sentence Tokenized sentence

کرونا وائرس متعدد لوگوں جان لے لی کرونا وائرس نے متعدد لوگوں کی جان لے لی۔

گر گئی طوفانی بارش سے کئی گھروں کی چھت طوفانی بارش کئی گھروں چھت گر گئ

The previous pre-processing steps revealed that many sentences are varying in length.

Some sentences were so short, and many were very long. We decide to define the length

boundary for tokenized sentences. There exist many sentences in the dataset that have

length range from 5 words to 250 words. We selected those sentences that consist of 5

words to 150 words. An integer value was assigned to the type of event for all the selected

sentences. The detailed description of different types of events and their corresponding

numeric (integer) values that are used in the dataset is also given in the table below.

Table 3.3: Class Label

Event Label Event Label

Sports 1 Earthquake 7

Inflation 2 Showbiz 8

Murder and Death 3 Fraud and Corruption 9

Terrorist Attack 4 Rain/Weather 10

Politics 5 Sexual Assault 11

Law and Order 6 Business 12

In figure 3.3 few instances of the dataset after pre-processing are presented. It is a comma-

separated value (CSV) file that consists of two fields i.e., sentence and label.

39

Figure 3.3: Instances of the Pre-Processed Dataset

In our dataset, three types of events have a larger number of instances i.e., sports (18746),

politics (33421), and fraud and corruption (10078). Contrary to three other types of events

that have a smaller number of instances i.e., sexual assault (2916), Inflation (3196), and

earthquake (3238).

While the remaining types of events have a smaller difference of instances among them.

There are 51814 unique words, and the total number of tokens is 20,79,967. The summary

of the dataset is given below in the table.

Table 3.4: The Summary of Dataset

Total types of

sentences

Total number

of tokens

Maximum no. of

Unique tokens

Total

sentences

Labelled

Sentences

12 20,79,967 51814 1,50,000 1,03,965

The visualization in figure 3.4 showed that the dataset is imbalanced.

40

Figure 3.4: Maximum Number of Instances of Each Type of Event

While during the second phase of the pre-processing dataset is prepared to machine-

understandable format. Some of the steps that are performed to convert text data into

numeric format are:

• Count_ Vectorizer

All the machine learning classifiers can learn patterns by numeric values. Whenever we try

to learn the pattern in natural languages as human being text are appropriate for us to

understand the structure of information. Machines cannot understand textual data directly.

Whenever we try to solve the problems of the natural languages, it may be simple or

complex, we need to convert all non-numeric data (text, audio, video, image, and graphs,

etc.) into numeric format before feeding into the machine learning classifiers. Machine

learning models do not accept the textual data so, there is not to vectorize the data into the

numeric format. It counts the total number of times a term has occurred in the documents.

It provides a way to tokenize the collection of documents and create a vocabulary of known

words. This vocabulary works like a dictionary to generate feature vectors. We can encode

new documents using that vocabulary.

An encoded vector is returned with a length of the entire vocabulary and an integer count

41

for the number of times each word appeared in the document.

The working strategy of the count vectorizer is given in table 3.1.

For example, we have two sentences:

i. Millions of people died of Covid-19.

ii. The fifth war is a psychological game among people.

Table 3.5: The Example of Count Vectorizer

Million War Fifth Covid-

19

Died People Psychological Game

Sentence 1 1 0 0 1 1 1 0 0

Sentence 2 0 1 0 0 1 1 1 1

• Term Frequency Inverse Document Frequency (TF_ IDF)

Term frequency referrers to the total count of term t in the documents. Generally, text

mining, information retrieval, and classification are based on the weighted value of tf-idf.

The weighted values are the statistical measure that shows how a word/term is important

in the collection of corpus/dataset.

The importance increases proportionally to the number of times a word appears in the

document but is offset by the frequency of the word in the corpus.

One of the simplest ranking functions is computed by summing the tf-idf for each query

term; many more sophisticated ranking functions are variants of this simple model.

Tf-idf can be successfully used for stop-words filtering in various subject fields including

text summarization and classification.

( )( )

_( )

Total number of documentsIDF t log e

Number of documentswith term t in it= (Wu et al., 2008) (1)

TF_IDF TF*IDF= (2)

3.1.4 Training Dataset

To develop a generic model for multiclass event classification, we divided our dataset into

three sub-set i.e., training dataset, testing, and validation dataset. Random distribution of

data is performed by using python library scikit. We distributed a 75% dataset randomly

for training purposes. There are 77974 labeled instances for different types of events in our

training dataset. A multiclass-instances-based training dataset is used to training deep

learning models.

42

3.1.5 Testing/Validation Dataset

To evaluate the performance of our trained model, we used a 25% dataset for

testing/validations purpose. It consists of 25991 unknown instances that were never seen

by trained models while10% of the testing dataset was used for validation purposes.

Phase III

Data distribution

We further divided our event classification dataset into two sub-datasets. Our purpose is to

evaluate the performance of traditional machine learning classifiers and advanced machine

learning classifiers (deep learning). It can be observed in Figure 3.1 that we did not use

Machine Learning for dataset preparation. Although in the last stage of dataset preparation

life cycles, we mentioned that we have created two different datasets from the same corpus

one for machine learning classifiers and other for deep learning classifiers. The main

different between these two datasets is that the dataset used for deep learning classifiers

contains only sentence while the dataset used for machine learning contains different

attributes like title, length, and sentence.

1 Machine Learning

A dataset that consists of other related information (title, location, and date) as described

in above table 4.1 is used to evaluate the traditional machine learning classifier. It contains

the same number of instances as the whole dataset.

2 Deep Learning

Another dataset that contains only simple tokenized sentences instead of other features i.e.,

title, location, and date is used to evaluate deep learning classifiers. The maximum number

of sentences also the same as reported in the above paragraphs.

The detail of the temporal entity dataset is given in the next section.

3.2 Urdu Named Entity Recognition (UNER) Dataset

Another dataset that is used in our work, publicly available for researchers to extract named

entities from the Urdu language text.

Dataset for the Urdu language generally exists for name entity extraction with a small

number of instances which are:

• Enabling Minority Language Engineering (EMILLE) (only 200000 tokens)

(Baker, Paul, Andrew Hardie, Tony McEnery, 2003).

• Becker-Riaz corpus (only 50000 tokens) (Becker & Riaz, 2002)

43

• International Joint Conference on Natural Language Processing (IJCNLP)

workshop corpus (only 58252 tokens)

• Computing Research Laboratory (CRL) annotated corpus (only 55,000 tokens are

publicly available data corpora (Kanwal et al., 2019)

A rule-based named entity recognition system for the Urdu language was proposed to

extract named entities (Riaz, 2010). To our knowledge, there is no specific data set

available for temporal entities extraction from the Urdu language. We selected a dataset to

develop for name entity extraction (Khan Wahab, Daud Ali, Nassir A Jamal, 2016). It

consists of 206 date tags including a single month name, year, or both. It is about National,

Sports, and International News including Urdu Fully Qualified, Urdu Hybrid Fully

Qualified, Urdu Deictic, and Urdu Anaphoric. The exhaustive analysis revealed that there

are only 5-10 Fully Qualified Dates which made us impassive. It also revealed that 18

different date patterns are lying in limited date tags which created a problem for generating

generic regular expressions for date extraction.

We decided to extend the existing dataset by adding 200 Urdu Fully Qualified dates and 50

Urdu deictic words. We added 50 dates for UFQD and 150 dates for HUFQD in UNER

dataset.

Similarly, 50 deictic words were added 25 of them were representing dates while 25 deictic

words were representing name entities. We have placed these dates at a different location

in documents i.e., at the sentence level, at the beginning, middle, and end of the sentence.

For example, لکھا جائے گا۔ ںیالفاظ م یچھ ستمبر دوہزار اٹھارہ کو سنہر ںیم خیتار یپاکسان ک (Pakistan

ki tareekh mein chey stmber do hazar atharah ko sunhari alfaz mein likha jay ga) in the

sentenceچھ ستمبر دوہزار اٹھارہ (chey stmber do hazar atharah) represents date which is placed

in the middle of the sentence.

Summary

In this chapter we have discussed in detail about two datasets that we have used in our

research work.

44

CHAPTER 4

EVENT CLASSIFICATION

45

In this chapter the detail of experiments, different feature vector generating techniques,

proposed methodology and result related to event classification has been discussed in

detail. 4.1 Proposed Methodology for Event Classification

The selection of methodology is tightly coupled with the research problem. In our problem,

we decided to use machine learning (traditional machine learning and deep learning

approaches) classifiers. Some traditional machine learning algorithms, i.e., K Nearest

Neighbor (K-NN), Random Forest (RF), Support Vector Machine (SVM), Decision Tree

(DT), and Multinomial Naïve Bayes (MNB) are evaluated for multiclass event

classification. Deep learning models, i.e., a Convolutional Neural Network (CNN), Deep

Neural Network (DNN), and Recurrent Neural Network (RNN), are also evaluated for

multiclass event classification.

A collection of Urdu text documents D = {d1, d2……, dn} is split into the set of sentences

S = {s1, s2, …., sn}. Our purpose is to classify the sentences to a predefined set of events

E= {e1, e2, en}. Various feature generating methods are used to create a feature vector for

deep learning and machine learning classifiers, i.e., TF_IDF, one-hot-encoding, and word

embedding. Feature vectors generated by all these techniques are fed up with input into the

embedding layer of neural networks. The output generated by the embedding layers is fed

up with the next fully connected layer (dense layer) of deep learning models, i.e., RNN,

CNN, and DNN. A relevant class label out of twelve categories is assigned to each sentence

at the end of model processing in the testing/ validation phase.

Bag-of-Words is a common method to represent text. It ignores the sequence order and

semantic of text (Joachims, 1998) Text while the one-hot-coding method maintains the

sequence of text. Word embedding methods word2Vec and Glove8 that are used to generate

feature vectors for deep learning models are highly recommended for textual data.

However, in the case of Urdu text classification, pre-existing wrod2Vec and Glove are

incompatible. The framework of our designed system is represented in figure 4.2. It shows

the structure of our system from taking input to producing output.

8 https://ybbaigo.gitbooks.io/26/pretrained-word-embeddings.html

46

Figure 4.1: Event Classification Methodology

In figure 4.2 comprehensive and detailed flow of the process is shown. It the generic

representation of the experimental setup that summarises all of the steps.

Figure 4.2: Event Classification Methodology’s Flow Diagram

شدید دھند کی وجہ سے نظامِ ذندگی

درہم برہم ہے

Pre-processing

• Stop word elimination

• Tokenization

• Annotation

Feature Selection

Feature Engineering • TF_ IDF

• One-Hot-

Encoding

• Word

Embedding

Classifiers

• DNN (feedforward)

• RNN (LSTM)

• CNN

• KNN

• SVM

• Random Forest

• NBM

• Decision Tree

• Logistic Regression

Train

Test/Validate

1. Sports

2. Inflation

3. Murder

4. Terrorist Attack

5. Politics

6. Law and Order

7. Earthquake

8. Showbiz

9. Fraud and Corruption

10. Weather

11. Sexual Assault

12. Business

Prediction

47

In this chapter experiments and results of event classification and temporal entity,

extraction is described section-wise. In the first section, we mentioned the experimental

setup and results of multiclass event classification while in the second section the detail

about the experimental and results of temporal entities are given.

4.2 Experimental Setup of Multiclass Event Classification

We have performed many experiments on our dataset by using various traditional machine

learning and deep learning classifiers. The purpose of many experiments is to find the most

efficient and accurate classification model for the multiclass event on an imbalance dataset

for the Urdu language text.

4.2.1 Feature Space

Unigram and bigram tokens of the whole corpus are used as features to create the feature

space. TF_ IDF vectorization is used to create a dictionary-based model. It consists of

656608 features. The training and testing dataset are converted to TF-IDF dictionary-based

feature vectors. A convolutional sequential model consists of three layers, i.e., the input

layer, hidden layer, and out layer is used to evaluate our dataset. Similarly, word-

embedding and one hot encoding are also included in our feature space to enlarge the scope

of our research problem.

4.2.2 Feature Vector Generating Techniques

Feature vectors are the numerical representation of text. It is an actual form of input that

can be processed by the machine learning classifier. There are several feature generating

techniques used for text processing. We used the following feature vector generating

techniques.

4.2.2.1 Word Embedding

A numerical representation of the text that each word is considered as a feature vector. It

creates a dense vector of real values that captures the contextual, semantical, and syntactical

meaning of the word. It also ensures that similar words should have a related weighted

value (AHMED et al., 2016).

4.2.2.2 Pretrained Word Embedding Models

Usage of a pre-trained word embedding model for a small amount of data is highly

recommended by researchers in state-of-art. Glove and word2vec are famous word

embedding models that are developed by using a big amount of data. Word embedding

models for text classification, especially in the English language, showed promising results.

48

It has emerged as a powerful feature vector generating technique among others, i.e., TF,

TF-IDF, and one-hot encoding, etc.

In our research case, sentence classification for different events in the Urdu language using

the word embedding technique is potentially preferable. Unfortunately, the Urdu language

is lacking in processing resources. We found only three word-embedding models. A word

embedding model (Baker, Paul, Andrew Hardie, Tony McEnery, 2003) that is developed

by using three publicly available Urdu datasets, Wikipedia’s Urdu text, another corpus

having 90 million tokens (Jawaid et al., 2014) and 35 million tokens(Baker, Paul, Andrew

Hardie, Tony McEnery, 2003). It has 102214 unique tokens. Each token comprises 300

dimensional real values. Another model publicly available for research purposes consists

of 25925 unique words of the Urdu language (Abdlrauf, 2017). Every word has a 400-

dimensional value. A word embedding model comprises web-based text, created to classify

text. It consists of 64653 unique Urdu words and 300 dimensions for each word.

The journey of research is not over here, to expand our research scope and find the most

efficient word embedding model for sentence classification, we decide to develop

custom/own (domain/data specific) word embedding models. We have developed four

word-embedding models that contain 57251 unique words.

The results of pre-trained existing word embedding models are good at the initial level but

very low, i.e., 60.26% is the highest accuracy. We further have explored the contents of

these models, which revealed that many words are irrelevant and borrowed from other

languages, i.e., Arabic and Persian. The contents of Wikipedia are entirely different than

news websites that is also affected the performance of embedding models. Another major

factor, i.e., low amount of data, has affected the feature vector generation quality. Stop

words in the pre-trained word embedding model are not eliminated and considered as a

token, while in our dataset all the stop words are removed. It also reduces the size of the

vocabulary of the model while generating a feature vector. Therefore, we decided to

develop a custom word embedding model on our preprocessed dataset. To postulate the

enlargement of the research task, four different word embedding models has been

developed. The detail of all used pre-trained word embedding models is given in Table 4.1

below.

49

Table 4.1: Pretrained Word Embedding Model and Custom Word Embedding Model

Existing Pre-trained Word Embedding Models

Sr. No. Unique

Words Dimension Window Size

1 (Pillac et al.,

2012) 64653 300 -

2 (Nuij et al.,

2014) 102214 100 -

3 53454 300 - Custom Pre-trained Word Embedding Models

1 57251 50 2

2 57251 100 2

3 57251 100 3

4 57251 350 1

4.2.2.3 One Hot Encoding

Text cannot be processed directly by machine learning classifiers; therefore, we need to

convert the text into a real value. We have used the one-hot encoding to convert text to

numeric features. For example, the sentences given in table 4.2 can be converted to a

numeric feature vector using one-hot encoding as shown in table 4.3.

Table 4.2: Event Sentence

Urdu Sentence English Sentence

.Ali plays football علی فٹ بال کھیلتا ہے

کرو نا وائرس نے لاکھوں

۔ لوگوں کی جان لے لی Corona Virus killed

millions of people.

Table 4.3: Event Sentence Converted Using One-Hot Encoding

Sentence کرونا وائرس لاکھوں لوگوں جان علی فٹ بال کھیلتا

1 1 1 1 1 0 0 0 0 0

2 0 0 0 0 1 1 1 1 1

4.2.2.4 TF-IDF

TF and TF-IDF are feature engineering techniques that transform the text into a numerical

format. It is one of the most highly used feature vectors for creating a method for text data.

Three deep learning models were evaluated on our corpus. The sequential model with

embedding layers outperformed over pre-trained word embedding models (Haider, 2019)

reported in state-of-art (Adeeba, F., Akram, Q., Khalid, H., and Hussain, 2014). The

detailed summary of the evaluation results of CNN, RNN, and DNN is discussed in the

proceeding section.

50

4.3 Deep Learning Models

4.3.1 Deep Neural Network Architecture/Feedforward Neural Network

(DNN)

Deep neural network are the artificial neural networks. The simple DNN also known as

feedforward neural network. The architecture of model (DNN) that is used in our research

work consists of three layers, i.e., input layer, 150 Hidden (Dense) layers, and 12 output

layers. Feature Vector is given as input into a dense layer that is fully connected. The

SoftMax activation function is used in the output layer to classify sentences into multiple

classes.

4.3.2 Recurrence Neural Network (RNN)

The recurrence neural network is evaluated using a Long-short-term memory LSTM

classifier. RNN consists of embedding, dropout, LSTM, and dense layers. A dictionary of

30000 unique most frequent tokens is made. The sentences are standardized to the same

length by using a padding sequence. The dimension of the feature vector is set at 250. RNN

showed an overall 81% accuracy that is the second highest in our work.

4.3.3 Convolutional Neural Network (CNN)

CNN is a class of deep neural networks that are highly recommended for image processing

(Valueva et al., 2020). It consists of the input layer (embedding layer), multiple hidden

layers, and an output layer. There are series of convolutional layers that convolve with a

multiplication. The Embedded sequence layer and average layer

(GloobalAveragePooling1D) are also part of the hidden layer. The common activation of

CNN is RELU Layer. The detail of the hypermeters that are used in our problem to train

the CNN model is given in Table 4.6.

4.3.4 Hyperparameters

In this section, all the hyperparameters that are used in our experiments are given in the

tabular format. To maintain the conciseness and brevity of the dissertation only those

hyperparameters are being discussed that achieved the highest accuracy for DNN, RNN,

and CNN models. The hyperparameters of DNN that are fine-tuned in our work given in

table 4.4.

51

Table 4.4: DNN’s Hyperparameters

Parameter Value Parameter Value

Max_ words 5000 Layers 04

Batch Size 128 Training/Testing 70%-30%

Embedding_

Dim 512 No. of Epochs 05

Activation

Function SoftMax Loss Function

Sparse

Categorical

Cross-Entropy

The RNN model showed the accuracy (80.3% and 81%) on two sets of hyperparameters

that are given in Table 4.5. Similarly, Table 4.6 provides the detail of the hyperparameters

of the convolutional neural network.

Table 4.5: RNN’s Hyperparameters

RNN (LSTM) (80.3%)


Max_ words 50000 Recurrent

Dropout 0.2


Embedding_


Activation


Sparse

Categorical

Cross-Entropy

RNN (LSTM) (81%)


Max_ words 30000 Recurrent

Dropout 0.2


Embedding_


Activation


Sparse

Categorical

Cross-Entropy

52

Table 4.6 CNN’s Hyperparameters

CNN (79.28%)


Max_ words 20000 Dense_ Node 256


Embedding_


Activation


Categorical

Cross-Entropy

Note: These are the optimal number of epochs for our models that showed the highest

results.

4.3.5 Performance Measuring Parameters

The most common performance measuring (Al-Radaideh & Al-Abrat, 2019) parameters

i.e., precision, recall, and f1-measure, are used to evaluate the proposed framework. The

selection of these parameters was decided because of the multiclass classification and

imbalance dataset. In case of imbalance dataset, the to report only accuracy of system is

biased and unreliable. Hence, we reported all other standard metrices (parameters) to

determine the reliability of the proposed system.

( )

TPPrecision

TP FP=

+ (3)

( )

TPRecall

TP FN=

+ (4)

( )

( )

* 1 2*

Precision RecallF

Precision Recall=

+ (5)

( )

( )

TP TNAccuracy

TP TN FP FN

+=

+ + + (6)

Where TP, TN, FP, and FN represent Total Positive, Total Negative, False Positive, and

False Negative values, respectively. Precision is defined as the closeness of the

measurements to each other, and recall is the ratio of the total amount of relevant (TP

values) instances that were retrieved during the experimental work. It is noteworthy that

both precision and recall are the relative values of the measure of relevance.

4.4 Results

4.4.1 Deep Learning Classifiers

The feature vector can be generated using different techniques. The results of feature vector

generating techniques that are used in our work, i.e., “multiclass event classification for the

53

Urdu language text” are given in the proceeding subsections.

4.4.1.1 Pre-trained Word Embedding Models

The convolutional neural network model is evaluated on the feature vectors that were

generated by all pre-trained word embedding models. The summary of all results generated

by pre-trained (Haider, 2019) and custom pre-trained word-embedding models are given in

Table 4.7. Our custom pre-trained word embedding model that contains 57251 unique

tokens, larger dimension size 350 and 1 as the size of a window showed 38.68% accuracy.

The purpose of developing a different custom pre-trained word embedding model was to

develop a domain-specific model and achieve the highest accuracy. However, the results

of both pre-existing pre-trained word embedding models and domain-specific custom word

embedding models are very low.

Table 4.7: Classification Accuracy of the CNN Model

Sr. No. Existing pre-trained model’s validation_

accuracy

Custom pre-trained model’s validation_

accuracy

1 58.00 36.85

2 60.26 38.04

3 56.68 37.38

4 - 38.68

4.4.1.2 TF_ IDF Feature Vector

DNN architecture consists of an input layer, a dense layer, and a max pool layer. The dense

layer is also called a fully connected layer comprised of 150 nodes. SoftMax activation

function and sparse_ categorical_ cross-entropy are used to compile the model on the

dataset.

25991 instances are used to validate the accuracy of the DNN model. The DNN with

connected layer architecture showed 84% overall accuracy for all event classes. The detail

of the performance measuring parameters for each class of events is given in the table

below. Law and Order, the 6th type of event in our dataset, consists of 2000 instances that

are used for validation. It showed 66% accuracy that is comparatively low to the accuracy

of other types of events. It affected the overall performance of the DNN model. The main

reason behind these results is that the sentence of law-and-order overlaps with the sentences

of politics. Generally, sometime human hardly distinguishes between law and order and

political statements.

For example,

54

“ خطرہ ہے۔ ےیذ مہ دارانہ گفتگو خطے کے امن کے ل ریغ یک ریوز یحکومت “

“The irresponsible talk of state minister is a threat to peace in the region.”

The performance detail of the DNN model is given in Table 4.8 that showed 84% accuracy

for multiple classes of events. All the other performance measuring parameters i.e.,

precession, recall, and F1_ score of each class of events is given in Table 4.8.

Table 4.8: Performance Measuring Parameters for DNN Model

Class Precision Recall F1-Score Support

1 0.96 0.95 0.96 4604

2 0.91 0.91 0.91 776

3 0.75 0.75 0.75 1697

4 0.78 0.70 0.74 770

5 0.81 0.85 0.83 8424

6 0.71 0.63 0.67 2000

7 1.00 1.00 1.00 817

8 0.92 0.90 0.91 1839

9 0.70 0.70 0.71 2524

10 0.95 0.99 0.97 856

11 0.95 0.99 0.97 741

12 0.82 0.73 0.77 943

Accuracy 0.84 25991

Macro avg 0.84 0.84 0.85 25991

Weighted avg 0.84 0.84 0.84 25991

The expected solution to tackle the sentence overlapping problem with multiple classes is

to use a “Pre-trained word-embedding” model like W2Vec and Glove (for English

Language). But for Urdu language there is no mature (efficient, accurate) pre-trained word

embedding model.

The RNN sequential model architecture of deep learning is used in our experiments. The

recurrent deep learning model architecture consists of a sequence of following layers, i.e.

embedding layer having 100 dimensions, SpatialDropout1D, LSTM, and dense layers.

Sparse_ categorical_ cross-entropy loss function has been used for the compilation of the

model. Multiclass categorical classification is handled by a sparse categorical cross-

entropy loss function instead of categorical cross-entropy. A SoftMax activation function

is used at a dense layer instead of the sigmoid function. SoftMax can handle nonlinear

classification i.e., multiple classes, while sigmoid is limited to linear classification and

handles binary classification.

55

A bag-of-words consisting of 30000 unique Urdu language words is used to generate a

feature vector. The maximum length of the feature vector is 250 tokens.

The overall accuracy of the RNN model is presented in Table 4.9 that achieved 81%

validation accuracy for our problem by using TF-IDF feature vectors. Other performance

evaluation parameters of each class are also given in Table 4.9.

Table 4.9: Performance Measuring Parameters for RNN Model

Class Precision Recall F1-score Support

1 0.95 0.95 0.95 4604

2 0.78 0.77 0.78 776

3 0.70 0.72 0.71 1697

4 0.78 0.64 0.70 770

5 0.78 0.84 0.81 8424

6 0.67 0.57 0.62 2000

7 1.00 1.00 1.00 817

8 0.91 0.87 0.89 1839

9 0.70 0.63 0.66 2524

10 0.93 0.98 0.95 856

11 0.86 0.94 0.90 741

12 0.76 0.67 0.71 943

Accuracy 0.81 25991

macro avg 0.82 0.80 0.81 25991

weighted avg 0.81 0.81 0.81 25991

The accuracy of the RNN model can be viewed in figure 4.3, where the y-axis represents

the accuracy, and the x-axis represents the number of epochs. RNN achieved 81% accuracy

for multiclass event classification.

Figure 4.3: RNN’s Accuracy

56

Although CNN is highly recommended for image processing, but it has showed

considerable results for multiclass event classification on textual data. The performance

measuring parameters of the CNN classifier is given in Table 4.10.

Table 4.10: Performance Measuring Parameters for the CNN Model


1 0.96 0.93 0.95 5661 2 0.81 0.65 0.72 967 3 0.72 0.68 0.70 2115 4 0.78 0.54 0.64 878 5 0.73 0.88 0.80 10030 6 0.64 0.51 0.57 2293 7 0.99 0.99 0.99 970 8 0.91 0.86 0.88 2259 9 0.71 0.61 0.66 3044 10 0.93 0.94 0.93 1031 11 0.91 0.82 0.86 889 12 0.77 0.63 0.70 1052

Accuracy 0.80 31189

macro avg 0.82 0.75 0.78 31189

weighted avg 0.80 0.80 0.80 31189

The distributed accuracy of the CNN classifier for the twelve classes can be viewed in

figure 4.4. There is more than one peak (higher accuracies) in figure 4.4 that showed

datasets are imbalanced.

Figure 4.4: CNN’s Distribution Accuracy

57

4.4.1.3 ONE-HOT-ENCODING

The results of deep learning classifiers that are used in our researcher work, their

performance on One-Hot-Encoding features is presented in figure 4.5. The One-Hot-

Encoded feature vectors are given as input to CNN, DNN, and RNN deep learning

classifiers. RNN showed better accuracy as compared to CNN while the DNN

outperformed among them. RNN and DNN achieved 81% and 84% accuracy, respectively,

for multiclass event classification.

Figure 4.5: CNN, RNN, and DNN Accuracy Using One-Hot-Encoding

4.5 Traditional Machine Learning Classifiers

To enlarge the research scope and develop generic efficient model. Some famous machine

learning classifiers are also evaluated for multiclass event classification on MULLS dataset

like k-NN, Decision Tree, Naïve Bayes Multinominal, Random Forest, Logistic Regression

and Support Vector Machine.

All these models are evaluated using TF-IDF and one-hot encoding features, as a feature

vector. It is observed that the results generated by using TF-IDF features are better than the

results generated using one-hot encoding features. A detailed summary of the results of the

above-mentioned machine learning classifiers is given in the proceeding section.

4.5.1 K-Nearest Neighbour (k-NN)

K-NN performs the classification of a new data point by measuring the similarity distance

80% 81% 84%

0%

20%

40%

60%

80%

100%

CNN RNN DNNVali

dati

on

_ A

ccu

raci

es

Deep Learning Models

One Hot Encoding

58

between the nearest neighbors. In our experiments, we set the value of k = 5 that measures

the similarity distance among five existing data points (Guo G, Wang H, Bell D, Bi Y, n.d.)

Although the performances of traditional machine learning classifiers are considerable but

must be noted that it is lower than deep learning classifiers. The main performance

degrading factor of the classifiers is the imbalanced number of instances and sentences

overlapping. The performance of the K-NN machine learning model is given in Table 4.11.

It has showed overall 78% accuracy for all classes.

Table 4.11: Performance Measuring Parameters for the K-NN Model


1 0.91 0.93 0.92 5661

2 0.62 0.83 0.71 967

3 0.67 0.71 0.69 2115

4 0.64 0.60 0.62 878

5 0.78 0.82 0.80 10030

6 0.66 0.50 0.57 2293

7 0.93 1.00 0.96 970

8 0.91 0.80 0.85 2259

9 0.71 0.62 0.66 3044

10 0.85 0.93 0.89 1031

11 0.72 0.85 0.78 889

12 0.75 0.61 0.67 1052

Accuracy 0.78 31189

Macro avg 0.76 0.77 0.76 31189

Weighted avg 0.78 0.78 0. 78 31189

4.5.2 Decision Tree (DT)

Decision Tree (DT) is a type of supervised machine learning algorithm (Zhong, 2016)where

the data input is split according to certain parameters. Decision Tree showed 73% accuracy

while other performance detail of each classes is given in Table 4.12.

59

Table 4.12: Performance Measuring Parameters for The DT Model


1 0.91 0.89 0.90 5661

2 0.83 0.97 0.89 967

3 0.57 0.52 0.54 2115

4 0.58 0.54 0.56 878

5 0.72 0.75 0.73 10030

6 0.44 0.41 0.42 2293

7 0.99 1.00 1.00 970

8 0.79 0.77 0.78 2259

9 0.57 0.55 0.56 3044

10 0.98 0.98 0.93 1031

11 0.86 0.98 0.92 889

12 0.61 0.56 0.58 1031

Accuracy 0.73 31189

Macro avg 0.73 0.74 0.74 31189

Weighted avg 0.73 0.73 0. 73 31189

4.5.3 Naïve Bayes Multinominal (NBM)

Naïve Bayes Multinominal is one of the computational (S., 2018) efficient classifiers for

text classification but it showed only 70% accuracy that is very low as compared to K-NN,

DT, and RF. The performance detail of all twelve (12) type of classes is given in Table

4.13.

Table 4.13: Performance Measuring Parameters for NB Multinominal Model


1 0.94 0.91 0.93 5683

2 0.82 0.34 0.48 956

3 0.66 0.47 0.55 2121

4 0.91 0.20 0.32 919

5 0.56 0.95 0.70 10013

6 0.70 0.22 0.34 2387

7 0.98 0.95 0.97 959

8 0.94 0.75 0.83 2188

9 0.75 0.40 0.52 3031

10 0.96 0.78 0.86 998

11 0.96 0.32 0.48 863

12 0.84 0.25 0.39 1071

Accuracy 0.70 31189

Macro avg 0.84 0.54 0.61 31189

Weighted

avg 0.76 0.70 0. 67 31189

60

4.5.4 Logistic Regression (LR)

Linear Regression is highly recommended for the prediction of continuous output instead

of categorical classification (T. Zhang & Oles, 2001) but logistic regression has been used

for multiclass classification tasks. The Table 4.14 is representing the performance of the

Logistic Regression model i.e., 80% overall accuracy for multiclass event classification.

Table 4.14: Performance Measuring Parameters for The LR Model


1 0.95 0.94 0.94 5661

2 0.83 0.64 0.72 967

3 0.72 0.69 0.70 2115

4 0.77 0.55 0.64 878

5 0.73 0.88 0.80 10030

6 0.64 0.53 0.58 2293

7 1.00 1.00 1.00 970

8 0.91 0.84 0.88 2259

9 0.73 0.62 0.67 3044

10 0.94 0.92 0.93 1031

11 0.90 0.80 0.85 889

12 0.77 0.66 0.71 1052

Accuracy 0.80 31189

Macro avg 0.82 0.76 0.79 31189

Weighted avg 0.80 0.80 0. 80 31189

4.5.5 Random Forest (RF)

It comprises many decision trees (Ali J, Khan R, Ahmad N, 2012). It has showed the highest

accuracy among all other evaluated machine learning classifiers. A detailed summary of

the results is given in Table 4.15.

Table 4.15: Performance Measuring Parameters for The RF Model


1 0.94 0.93 0.94 5661

2 0.94 0.96 0.95 967

3 0.72 0.63 0.67 2115

4 0.80 0.58 0.67 878

5 0.71 0.90 0.79 10030

6 0.67 0.41 0.51 2293

7 1.00 1.00 1.00 970

8 0.93 0.80 0.86 2259

9 0.75 0.58 0.65 3044

10 0.94 0.98 0.96 1031

11 0.96 0.98 0.97 889

12 0.84 0.63 0.72 1052

Accuracy 0.80 31189

Macro avg 0.85 0.78 0.81 31189

Weighted avg 0.81 0.80 0. 80 31189

61

4.5.6 Support Vector Machine (SVM)

The support vector machine (SVM) is one of the highly recommended models for binary

classification. It is based on statistical theory (Y. Zhang, 2012). Its performance details are

given below in Table 4.16.

Table 4.16: Performance Measuring Parameters for SVM Model


1 0.84 0.94 0.89 5683

2 0.72 0.43 0.54 956

3 0.72 0.49 0.58 2121

4 0.73 0.43 0.54 919

5 0.64 0.90 0.75 10013

6 0.74 0.24 0.36 2387

7 0.90 0.99 0.94 959

8 0.86 0.78 0.82 2188

9 0.65 0.47 0.57 3031

10 0.85 0.87 0.82 998

11 0.81 0.62 0.70 863

12 0.77 0.63 0.67 1071

Accuracy 0.73 31189

Macro avg 0.77 0.63 0.67 31189

Weighted avg 0.77 0.73 0. 71 31189

A comparative depiction of results obtained by the traditional machine learning classifiers

is given in figure 4.6. Random forest showed the highest accuracy among all other machine

learning classifiers. Although machine learning classifiers showed considerable results but

are low as compared to deep learning models.

Figure 4.6: Machine Learning Algorithms Accuracy using TF_ IDF

73% 70%

80%73%

80% 78%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Va

lid

ati

on

_ A

ccu

racy

AccuracySVM

NBM

RF

DT

LR

KNN

62

(Note that the results are reported in dissertation belongs to one (deep learning) dataset

while the results of other dataset are under consideration.)

Summary

In this chapter we have discussed in detail about experiments and results. We explored both

machine learning and deep learning classifiers and the best results are reported in the

chapter. It can be observed that deep learning classifiers outperformed among machine

learning classifiers. Deep Neural Network (Feedforward) has showed the highest accuracy

84% using TF_ IDF as feature vectors.

63

CHAPTER 5

TEMPORAL ENTITY EXTRACTION

65

Figure 5.1: Temporal Entity Extraction Methodology

5.3 Experimental Setup of Temporal Entity Extraction

We started our experiments on plain textual Urdu corpus neglecting the annotation tags.

The complex structure and varying format of Urdu temporal entities converge our strength

to use regular expressions. In our exhaustive analysis, we found the different writing styles

of date written in Urdu language text i.e., Urdu Fully Qualified date, Urdu Deictic Date,

and Urdu Anaphoric Date.

5.4 Results

All date extraction results are shown in tables which showed considerable results for all

types of date.

Table 5.1: All Dates Extraction Results on the original dataset

Type of Date Precision Recall

F1-

Measure

Numeric Year 0.91 1.00 0.95

Urdu Month and Urdu Year 0.58 1.00 0.77

Urdu Year 1.00 1.00 1.00

Urdu Month and Numeric Year 1.00 1.00 1.00

Numeric Day and Urdu Month 0.95 1.00 0.97

Only Urdu Month 0.100 1.00 1.00

Urdu Day and Month 0.50 1.00 0.67

UFQ Date and Urdu Hybrid FQ Date 0.95 0.95 0.95

Deictic and Anaphoric 1.00 1.00 1.00

Rule-

based

Approach

Temporal Entities

• Fully

Qualified

• Partially

Fully

Qualified

• Deictic

Input/Text

66

Table 5.2: UFQD &UPFQD on Extended dataset

Example

Date Type Precision Recall

F1-

Measure

Numeric Year 0.96 0.92 0.94 2019جون

Numeric Month 1.00 1.00 1.00 8پانچ

فروری 8 Numeric Day 0.94 1.00 0.97

تین اگست دو ہزار

Urdu FQ Date 1.00 1.00 1.00 انیس

Average 0.97 0.98 0.98

Table 5.3: Deictic date analysis

Deictic date

Precision Recall

F1-

Measure

Recognition 0.50 1.00 0.66

Retrieval 1.00 1.00 1.00

Figure 5.2: All Dates Extraction Results on Original Dataset

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Numeric

Year

Urdu Year Numeric

Day &

Urdu

Month

Urdu Day

& Urdu

Month

Urdu

Month &

Urdu Year

Urdu

Month &

Numeric

Year

Urdu

Month

UFD and

UPFQD

Deictic &

Anaphoric

F--

Mea

sure

Different format of Temporal Entities

Results of Rule Base system for Temporal Entities

67

Figure 5.3: F-Measure of UFQD & UPFQD on an extended dataset

Figure 5.4: Deictic date analysis

5.5 Discussion

The influence of the Internet via social media and news websites is remarkable. Generally,

nowadays sharing feelings, thoughts, ideas, events, problems research work, advertisement,

and criticism, etc. on the Internet using social media is common practice. It can be observed

that the usage of local languages on social networks i.e., Twitter, Facebook, and WhatsApp,

etc. is increasing because people feel easiness to convey messages in local languages. Many

tools are available to support multiple languages on the Internet. As a result, a huge bulk of

multilingual data is being generated exponentially in speed. The analysis of such pieces of

stuff using pre-existing processing tools is inadequate because these tools are insufficient

and incompatible for all types of languages.

100% 97%

0%10%20%30%40%50%60%70%80%90%

100%

Urdu Fully Qualified Date Urdu Partially Fully Qualified Date

F--

Mea

sure

Urdu Fully Qualified Date and Urdu Partially

Fully Qualifed Date

Results of Rule Base system for Temporal

Entities

100%

50%

100% 100%

0%10%20%30%40%50%60%70%80%90%

100%

Retrieval Recognition

F--

Mea

sure

Urdu Fully Qualified Date and Urdu Partially

Fully Qualifed Date

Results of Rule Base system for Temporal

Entities

68

The Urdu language is one of the resource-poor languages that cannot be handle by the

existing tools that are good enough for the English language. It is a morphological rich,

having a complex writing style right-to-left, and diacritic writing script. These

characteristics make it different from other languages. It is lacking in processing resources,

language annotators, Part of Speech (PoS) Taggers Word2vec models, and datasets. Only

a few numbers of datasets exist on the website of the Center for Language Engineering

CLE9 that are purchasable. The Urdu language has more than 300 million users who can

read, written, and understand it. It also the national language of Pakistan which the 6th

populous country in the world.

Lack of resources is a major hurdle in research for Urdu language texts. Events are an

important piece of information related to our lives. In research period we have found in

literature only a few numbers of datasets that are specifically developed for “Named Entity

Recognition” instead of event classification. We developed our dataset for multiclass event

classification. We collected more than 0.15 million sentences from different types of

events. To classify multiclass events at the sentence level we decided to use machine

learning and deep learning approaches. Six famous machine learning classifiers i.e., SVM,

RF, DT, NBM, K-NN, and LR, and three deep learning models i.e., CNN, RNN, and DNN

are used for multiclass event classification. Different feature vector generating techniques

are explored like Count vectorizer, TF-IDF, One-Hot-Encoding, and word-embedding.

Interestingly TF-IDF has outperformed among other techniques. DNN showed 84%

accuracy using TF-IDF feature vectors.

While in the case of temporal entity extraction, we deeply analyzed the different writing

formats of Dates in Urdu language text. We found more than 20 different formats of date.

It is observed that people generally do not follow the standard format of temporal entities.

Fully Qualified TE’s can be extracted by writing appropriated regular expressions and

accuracy can be ensured but in the case of anaphoric and deictic TE’s, the only regex is

insufficient to conclude the temporal values from the text. There is a need to analyze the

contextual information for anaphoric and deictic TE’s.

Summary

Temporal Entities are necessary to predict the occurrence time of any event. In this chapter

we explored various types of TE’s that exist in Urdu language text. Regular expression has

been used to extract different TE’s from plain text.

9 http://www.cle.org.pk/

69

CHAPTER 6

CONCLUSION AND FUTURE WORK

70

6.1 Conclusion

In a comprehensive review of Urdu literature, we found only a few numbers of referential

works related to Urdu text processing. The main hurdle in Urdu exploration is the

unavailability of the processing resources, i.e., event dataset, close-domain part of speech

tagger, lexicons, annotators, and other supporting tools.

It is reported in the dissertation that dataset is imbalanced, we performed the experiments

on the same (imbalanced) dataset. In case of imbalanced dataset, the accuracy values seem

unreliable since the results produced by the classifiers are biased. To resolve this issue, we

reported the output performance on basis of other metrices like precision, recall and f-

measure.

We have explored many feature vectors generating techniques. Different classification

algorithms of traditional machine learning, and deep learning approaches are evaluated on

these feature vectors. The purpose of performing many experiments on various feature

vector generating techniques was to develop the most efficient and generic model of

multiclass event classification for Urdu language text.

Word embedding feature generating technique is considered an efficient and powerful

technique for text analysis. Word2Vector (W2Vec) feature vectors can be generated by pre-

trained word embedding models or using dynamic parameters in embedding layers of deep

neural networks.

In general, the word embedding performs well in feature vector generating technique. It is

one of the most widely used techniques for relatively small sized data sets. Furthermore,

the pre-trained wording embedding models perform the key role to handle large and

complex datasets. In our research problem, we explored only three pre-trained word

embedding models for the Urdu language, also cited in the dissertation. Unlikely, those pre-

trained word embedding models did not perform well in our case since these models are

trained on the blogosphere extracted text. In contrary to these types of datasets, our dataset

is relatively different since it is the collection of different events that are reported/discussed

on the social media. This is the reason; these modules showed results with quite low

accuracy.

Another argument in support of this conclusion is that only a few pre-trained word

embedding models exist for Urdu language texts. These models are trained on considerable

number of tokens but domain-specific Urdu text. There is a need to develop generic word

embedding models for the Urdu language on a large corpus. CNN and RNN(LSTM) single-

71

layer architecture and multilayer architecture did not affected the performance of the

proposed system.

Experimental results are the vivid depiction that the one-hot-encoding method is better than

the word embedding model and pre-trained word embedding model. However, TF-IDF

outperformed among other feature generating techniques like word embedding and One-

Hot-Encoding. It showed the highest accuracy 84% by using DNN deep learning classifier.

While the same task using traditional machine learning classifiers showed considerable

performance but lower than deep learning models. Deep learning algorithms, i.e., CNN,

DNN, and RNN are preferable over traditional machine learning algorithms. Because there

is no need for a domain expert to find relevant features in deep learning like traditional

machine learning. DNN and RNN outperformed among all other classifiers and showed

overall 84% and 81% accuracy, respectively, for the twelve classes of events.

Comparatively the performance of CNN and RNN is better than Naïve Bayes and SVM.

Multi-Class event classification at the sentence level performed on an imbalance dataset;

events that are having a low number of instances for a specific class affect the overall

performance of the classifiers. We can improve the performance by balancing the instances

of each class. It can be concluded that:

• Lack of resources is the barrier in research work,

• For the Urdu language, there are only few pretrained word embedding models but

that models showed very poor results,

• We also created our word-embedding models but that also showed very poor results,

• We evaluated the six famous machine learning classifiers and three deep learning

classifiers,

• Deep learning classifiers using TF_ IDF have shown the best results as compared

to machine learning classifiers. The DNN has shown 84% accuracy,

• There is no specific work in literature related to Temporal Entities,

• Regular expressions have shown considerable results for Fully Qualified date while

Deictic and Anaphoric required the contextual information.

6.2 Future Work

There are a lot of tasks that can be accomplished for Urdu language text in the future. Some

of those are mentions here:

1. In future we have a plan to extend our research work to improve the accuracy of

proposed models, to increase the size of datasets, using BERT encoder and perform

72

event classification at the document level and phrase level.

2. We also decided to use machine learning and deep learning approaches to extract

and classify deictic and anaphoric TE’s.

3. To classify event and as real-time and retrospective using fuzzy rules.

4. To propose an approach to differentiate Temporal Entities written in Urdu language

text from other languages like Sindhi, Arabic and Persian that have the relatively

similar writing scripts.

73

References A.S. Abrahams. (2002). Developing and Executing Electronic Commerce Applications

with Occurrences.

Abdlrauf, H. and M. A. (2017). Deep learning for sentence classification. IEEE Explorer.

Abrahams, A. S., Jiao, J., Wang, G. A., & Fan, W. (2012). Vehicle defect discovery from

social media. Decision Support Systems. https://doi.org/10.1016/j.dss.2012.04.005

Adeeba, F., Akram, Q., Khalid, H., and Hussain, S. (2014). N-grams, CLE Urdu books. . .

In Conference on Language and Technology, CLT 14,Karachi, Pakistan.

AHMED, K., ALI, M., KHALID, S., & KAMRAN, M. (2016). Framework for Urdu

News Headlines Classification. Journal of Applied Computer Science &

Mathematics. https://doi.org/10.4316/jacsm.201601002

Ahn, D., Adafre, S. F., & De Rijke, M. (2005). Towards task-based temporal extraction

and recognition. Dagstuhl Seminar Proceedings.

Akhter, M. P., Jiangbin, Z., Naqvi, I. R., Abdelmajeed, M., & Fayyaz, M. (2020).

Exploring deep learning approaches for Urdu text classification in product

manufacturing. Enterprise Information Systems.

https://doi.org/10.1080/17517575.2020.1755455

Al-Dyani, W. Z., Yahya, A. H., & Ahmad, F. K. (2018). Challenges of event detection

from social media streams. International Journal of Engineering & Technology, 7,

72–75.

Al-Garadi, M. A., Hussain, M. R., Khan, N., Murtaza, G., Nweke, H. F., Ali, I., Mujtaba,

G., Chiroma, H., Khattak, H. A., & Gani, A. (2019). Predicting Cyberbullying on

Social Media in the Big Data Era Using Machine Learning Algorithms: Review of

Literature and Open Challenges. IEEE Access.

https://doi.org/10.1109/ACCESS.2019.2918354

Al-Radaideh, Q. A., & Al-Abrat, M. A. (2019). An Arabic text categorization approach

using term weighting and multiple reducts. Soft Computing.

https://doi.org/10.1007/s00500-018-3249-z

Ali, A. R., & Ijaz, M. (2009). Urdu text classification. Proceedings of the 6th

International Conference on Frontiers of Information Technology, FIT ’09.

https://doi.org/10.1145/1838002.1838025

Ali, D., Muhammad, M., Akhtar, N., Salamat, N., Asmat, H., & Firdous, A. (2016).

Gender Prediction for Expert Finding Task. International Journal of Advanced

Computer Science and Applications. https://doi.org/10.14569/ijacsa.2016.070525

Ali J, Khan R, Ahmad N, M. I. (2012). Random Forests and Decision Trees. International

Journal of Computer Science Issues.

Allen, J. F. (1983). Maintaining Knowledge about Temporal Intervals. Communications

of the ACM. https://doi.org/10.1145/182.358434

Awais, M., & Shoaib, M. (2019). Role of discourse information in Urdu sentiment

classification: A Rule-based Method and Machine-learning Technique. ACM

Transactions on Asian and Low-Resource Language Information Processing.

https://doi.org/10.1145/3300050

Bahir, E., & Peled, A. (2016). Geospatial extreme event establishing using social

network’s text analytics. GeoJournal. https://doi.org/10.1007/s10708-015-9622-x

Baker, Paul, Andrew Hardie, Tony McEnery, and B. D. J. (2003). Corpus data for South

Asian language processing. .." In Proceedings of the 10th Annual Workshop for

South Asian Language Processing, EACL.

Barthe-Delanoë, A. M., Truptil, S., Bénaben, F., & Pingaud, H. (2014). Event-driven

agility of interoperability during the Run-time of collaborative processes. Decision

Support Systems. https://doi.org/10.1016/j.dss.2013.11.005

74

Becker, D., & Riaz, K. (2002). A study in Urdu corpus construction.

https://doi.org/10.3115/1118759.1118760

Bittar, A., Amsili, P., Denis, P., & Danlos, L. (2011). French TimeBank: An ISO-

TimeML annotated reference corpus. ACL-HLT 2011 - Proceedings of the 49th

Annual Meeting of the Association for Computational Linguistics: Human Language

Technologies.

Borsje, J., Hogenboom, F., & Frasincar, F. (2010). Semi-automatic financial events

discovery based on lexico-semantic patterns. International Journal of Web

Engineering and Technology. https://doi.org/10.1504/IJWET.2010.038242

Capet, P., Delavallade, T., Nakamura, T., Sandor, A., Tarsitano, C., & Voyatzi, S. (2008).

A risk assessment system with automatic extraction of event types. IFIP

International Federation for Information Processing. https://doi.org/10.1007/978-0-

387-87685-6_27

Caselli, T., & Sprugnoli, R. (2017). It-TimeML and the Ita-TimeBank: Language Specific

Adaptations for Temporal Annotation. In Handbook of Linguistic Annotation.

https://doi.org/10.1007/978-94-024-0881-2_36

Cavalin Rodrigo Paulo, D. F. and C. da S. M. S. (2016). Classification of Life Events on

Social Media.

Chowdhury, S. R., Imran, M., Asghar, M. R., Amer-Yahia, S., & Castillo, C. (2013).

Tweet4act: Using incident-specific profiles for classifying crisis-related messages.

ISCRAM 2013 Conference Proceedings - 10th International Conference on

Information Systems for Crisis Response and Management.

Conlon, S. J., Abrahams, A. S., & Simmons, L. L. (2015). Terrorism information

extraction from online reports. Journal of Computer Information Systems.

https://doi.org/10.1080/08874417.2015.11645768

Costa, F., & Branco, A. (2012). TimeBankPT: A TimeML annotated corpus of

Portuguese. Proceedings of the 8th International Conference on Language

Resources and Evaluation, LREC 2012.

D’Andrea, E., Ducange, P., Bechini, A., Renda, A., & Marcelloni, F. (2019). Monitoring

the public opinion about the vaccination topic from tweets analysis. Expert Systems

with Applications. https://doi.org/10.1016/j.eswa.2018.09.009

Daud, A., Khan, W., & Che, D. (2017). Urdu language processing: a survey. Artificial

Intelligence Review. https://doi.org/10.1007/s10462-016-9482-x

De Santis, E., Martino, A., & Rizzi, A. (2020). An Infoveillance System for Detecting

and Tracking Relevant Topics from Italian Tweets during the COVID-19 Event.

IEEE Access. https://doi.org/10.1109/ACCESS.2020.3010033

Dou, Wenwen, Xiaoyu Wang, William Ribarsky, and M. Z. (2012). Event detection in

social media data. In IEEE Vis Week Workshop on Interactive Visual Text Analytics-

Task Driven Analytics of Social Media Content.

Dr. D. Ramehs, D. S. S. K. (2016). EVENT EXTRACTION FROM NATURAL

LANGUAGE TEXT. IJESRT.

Ferro Lisa, Gerber Laurie, Mani Inderjeet, S. B. and W. G. (2005). TIDES 2005 standard

for the annotation of temporal expressions.

Filannino, M., & Nenadic, G. (2015). Temporal expression extraction with extensive

feature type selection and a posteriori label adjustment. Data and Knowledge

Engineering. https://doi.org/10.1016/j.datak.2015.09.002

Guo G, Wang H, Bell D, Bi Y, G. K. (n.d.). On the Move to Meaningful Internet

Systems. OTM Confederated International Conferences.

Haider, S. (2019). Urdu word embeddings. LREC 2018 - 11th International Conference

on Language Resources and Evaluation.

75

Hao, T., Pan, X., Gu, Z., Qu, Y., & Weng, H. (2018). A pattern learning-based method

for temporal expression extraction and normalization from multi-lingual

heterogeneous clinical texts. BMC Medical Informatics and Decision Making.

https://doi.org/10.1186/s12911-018-0595-9

Hogenboom, F., Frasincar, F., Kaymak, U., De Jong, F., & Caron, E. (2016). A Survey of

event extraction methods from text for decision support systems. Decision Support

Systems. https://doi.org/10.1016/j.dss.2016.02.006

Huang, P. Y., Liang, J., Lamare, J. B., & Hauptmann, A. G. (2018). Multimodal filtering

of social media for temporal monitoring and event analysis. ICMR 2018 -

Proceedings of the 2018 ACM International Conference on Multimedia Retrieval.

https://doi.org/10.1145/3206025.3206079

Jacobs, G., Lefever, E., & Hoste, V. (2019). Economic Event Detection in Company-

Specific News Text. https://doi.org/10.18653/v1/w18-3101

Jaidka, K., Ahmed, S., Skoric, M., & Hilbert, M. (2019). Predicting elections from social

media: a three-country, three-method comparative study. Asian Journal of

Communication. https://doi.org/10.1080/01292986.2018.1453849

Jawaid, B., Kamran, A., & Bojar, O. (2014). A tagged corpus and a tagger for Urdu.

Proceedings of the 9th International Conference on Language Resources and

Evaluation, LREC 2014.

Jiang, S., Chen, H., Nunamaker, J. F., & Zimbra, D. (2014). Analyzing firm-specific

social media and market: A stakeholder-based event analysis framework. Decision

Support Systems. https://doi.org/10.1016/j.dss.2014.08.001

Jin, B., Zhuo, W., Hu, J., Chen, H., & Yang, Y. (2013). Specifying and detecting spatio-

temporal events in the internet of things. Decision Support Systems.

https://doi.org/10.1016/j.dss.2013.01.027

Joachims, T. (1998). Text categorization with support vector machines: Learning with

many relevant features. Lecture Notes in Computer Science (Including Subseries

Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics).

https://doi.org/10.1007/s13928716

john, lafferty, Andrew, M., & Fernando, C. N. P. (2001). Conditional Random Fields:

Probabilistic Models for Segmenting and Labeling Sequence Data. ICML ’01:

Proceedings of the Eighteenth International Conference on Machine Learning.

https://doi.org/10.29122/mipi.v11i1.2792

Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural

network for modelling sentences. 52nd Annual Meeting of the Association for

Computational Linguistics, ACL 2014 - Proceedings of the Conference.

https://doi.org/10.3115/v1/p14-1062

Kamila, Sabyasachi, Mohammad Hasanuzzaman, Asif Ekbal, and P. B. (2018). Tempo-

Hindi WordNet: A Lexical Knowledge-base for Temporal Information Processing.

ACM Transactions on Asian and Low-Resource Language Information Processing

(TALLIP).

Kanwal, S., Malik, K., Shahzad, K., Aslam, F., & Nawaz, Z. (2019). Urdu named entity

recognition: Corpus generation and deep learning applications. ACM Transactions

on Asian and Low-Resource Language Information Processing.

https://doi.org/10.1145/3329710

Khan Wahab, Daud Ali, Nassir A Jamal, A. T. (2016). Named Entity Dataset for Urdu

Named Entity Recognition Task.

Kong, X., Shi, X., & Yu, P. S. (2011). Multi-label collective classification. Proceedings

of the 11th SIAM International Conference on Data Mining, SDM 2011.

https://doi.org/10.1137/1.9781611972818.53

76

Konstantinidis, K., Papadopoulos, S., & Kompatsiaris, Y. (2017). Exploring twitter

communication dynamics with evolving community analysis. PeerJ Computer

Science. https://doi.org/10.7717/peerj-cs.107

Li, Hang, Yunhua Hu, Guangping Gao, Yauhen Shnitko, Dmitriy Meyerzon, and D. M.

(n.d.). Techniques for extracting authorship dates of documents. U.S. Patent

Application, 141,935.

Li, Xia, Yongqing Zheng, and Y. D. (2014). Discovering Evolution of Complex Event

Based on Correlations Between Events. IEEE.

Li, Q., Nourbakhsh, A., Shah, S., & Liu, X. (2017). Real-Time novel event detection from

social media. Proceedings - International Conference on Data Engineering.

https://doi.org/10.1109/ICDE.2017.157

Liao, W., & Veeramachaneni, S. (2009). A simple semi-supervised algorithm for named

entity recognition. https://doi.org/10.3115/1621829.1621837

Limber, R. K. and S. P. (n.d.). Psychological, Physical, and Academic Correlates of

Cyberbullying and Traditional Bullying. Journal of Adolescent Health, 13–20.

Ling, X., & Weld, D. S. (2010). Temporal information extraction. Proceedings of the

National Conference on Artificial Intelligence.

Liu, G., & Guo, J. (2019). Bidirectional LSTM with attention mechanism and

convolutional layer for text classification. Neurocomputing.

https://doi.org/10.1016/j.neucom.2019.01.078

Llidó, D., Berlanga, R., & Aramburu, M. J. (2001). Extracting temporal references to

assign document event-time periods. Lecture Notes in Computer Science (Including

Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in

Bioinformatics). https://doi.org/10.1007/3-540-44759-8_8

Lu, Zhongyu, Weiren Yu, Richong Zhang, Jianxin Li, and H. W. (2015). Discovering

event evolution chain in microblog. In High Performance Computing and

Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace

Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded

Software and Systems (ICESS).

M. D. Eberhard, S. F. G. and C. D. F. (2019). Ethnologue: Languages of the world. SIL

International.

Malik, M. K. and Sarwar, S. M. (2016a). Named Entity Recognition System for

Postpositional Languages: Urdu as a Case Study.

Malik, M. K. and Sarwar, S. M. (2016b). Named Entity Recognition System for

Postpositional Languages: Urdu as a Case Study. International Journal of Advanced

Computer Science and Applications, 141–147.

McMinn, A. J., Moshfeghi, Y., & Jose, J. M. (2013). Building a large-scale corpus for

evaluating event detection on twitter. International Conference on Information and

Knowledge Management, Proceedings. https://doi.org/10.1145/2505515.2505695

Mehmood, K., Essam, D., & Shafi, K. (2019). Sentiment analysis system for Roman

Urdu. Advances in Intelligent Systems and Computing. https://doi.org/10.1007/978-

3-030-01174-1_3

Mohamad, A. Y., Mustapha, S. S., & Razali, M. S. (2010). Automatic Event Detection on

Reuters News.

Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification.

Lingvisticae InvestigationesLingvisticæ Investigationes. International Journal of

Linguistics and Language ResourcesLingvisticæ Investigationes / International

Journal of Linguistics and Language ResourcesLingvisticæ Investigationes.

https://doi.org/10.1075/li.30.1.03nad

Naughton, M., Stokes, N., & Carthy, J. (2010). Sentence-level event classification in

77

unstructured texts. Information Retrieval. https://doi.org/10.1007/s10791-009-9113-

0

Naz, M., Akram, Q. U. A., & Hussain, S. (2013). Binarization and its evaluation for Urdu

Nastalique document images. 2013 16th International Multi Topic Conference,

INMIC 2013. https://doi.org/10.1109/INMIC.2013.6731352

Nuij, W., Milea, V., Hogenboom, F., Frasincar, F., & Kaymak, U. (2014). An automated

framework for incorporating news into stock trading strategies. IEEE Transactions

on Knowledge and Data Engineering. https://doi.org/10.1109/TKDE.2013.133

O’Keeffe, G. S., Clarke-Pearson, K., Mulligan, D. A., Altmann, T. R., Brown, A.,

Christakis, D. A., Falik, H. L., Hill, D. L., Hogan, M. J., Levine, A. E., & Nelson, K.

G. (2011). Clinical report - The impact of social media on children, adolescents, and

families. In Pediatrics. https://doi.org/10.1542/peds.2011-0054

Pal, U., & Sarkar, A. (2003). Recognition of printed Urdu script. Proceedings of the

International Conference on Document Analysis and Recognition, ICDAR.

https://doi.org/10.1109/ICDAR.2003.1227844

Panagiotou, N., Katakis, I., & Gunopulos, D. (2016). Detecting events in online social

networks: Definitions, trends and challenges. Lecture Notes in Computer Science

(Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in

Bioinformatics). https://doi.org/10.1007/978-3-319-41706-6_2

Parikh, R., & Karlapalem, K. (2013). Et: events from tweets. .." In Proceedings of the

22nd International Conference on World Wide Web, 613–620.

Petrovic, S., Osborne, M., & Lavrenko, V. (2010). The Edinburgh Twitter Corpus.

Computational Linguistics.

Pillac, V., Guéret, C., & Medaglia, A. L. (2012). An event-driven optimization

framework for dynamic vehicle routing. Decision Support Systems.

https://doi.org/10.1016/j.dss.2012.06.007

Riaz, K. (2008). Concept search in urdu. International Conference on Information and

Knowledge Management, Proceedings. https://doi.org/10.1145/1458550.1458557

Riaz, K. (2010). Rule-Based Named Entity Recognition in Urdu. Proceedings of the 2010

Named Entities Workshop.

Ritter, A., Wright, E., Casey, W., & Mitchell, T. (2015). Weakly supervised extraction of

computer security events from twitter. WWW 2015 - Proceedings of the 24th

International Conference on World Wide Web.

https://doi.org/10.1145/2736277.2741083

S. Lavanya, R. Kavipriya, Y. Yang, J. Q. Carbonell, R. D. Brown, B. Archibald, and X.

L. (2014). A Survey on Event Detection in News Streams. 2(5), 33–35.

S., X. (2018). Bayesian Naïve Bayes classifiers to text classification. Journal of

Information Science, 48–59.

Sarker, A., & Gonzalez, G. (2015). Portable automatic text classification for adverse drug

reaction detection via multi-corpus training. Journal of Biomedical Informatics.

https://doi.org/10.1016/j.jbi.2014.11.002

Sharjeel, M., Nawab, R. M. A., & Rayson, P. (2017). COUNTER: corpus of Urdu news

text reuse. Language Resources and Evaluation. https://doi.org/10.1007/s10579-016-

9367-2

Singh, J. P., Dwivedi, Y. K., Rana, N. P., Kumar, A., & Kapoor, K. K. (2019). Event

classification and location prediction from tweets during disasters. Annals of

Operations Research. https://doi.org/10.1007/s10479-017-2522-3

Singh, U. P., Goyal, V., & Lehal, G. S. (2012). Named entity recognition system for

Urdu. 24th International Conference on Computational Linguistics - Proceedings of

COLING 2012: Technical Papers.

78

Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for

classification tasks. Information Processing and Management.

https://doi.org/10.1016/j.ipm.2009.03.002

Somooro, S. M. G. and T. R. (2019). Current Status of Urdu on Twitter. Sukkur IBA

Journal of Computing and Mathematical Sciences.

Tomas, K. (2015). Event detection from text data. Computational Intelligence, 1312–164.

Usman, M., Shafique, Z., Ayub, S., & Malik, K. (2016). Urdu Text Classification using

Majority Voting. International Journal of Advanced Computer Science and

Applications. https://doi.org/10.14569/ijacsa.2016.070836

Valueva, M. V., Nagornov, N. N., Lyakhov, P. A., Valuev, G. V., & Chervyakov, N. I.

(2020). Application of the residue number system to reduce hardware costs of the

convolutional neural network implementation. Mathematics and Computers in

Simulation. https://doi.org/10.1016/j.matcom.2020.04.031

Walenz, B., Gandhi, R., Mahoney, W., & Zhu, Q. (2010). Exploring social contexts along

the time dimension: Temporal analysis of named entities. Proceedings - SocialCom

2010: 2nd IEEE International Conference on Social Computing, PASSAT 2010: 2nd

IEEE International Conference on Privacy, Security, Risk and Trust.

https://doi.org/10.1109/SocialCom.2010.80

Wei, C. P., & Lee, Y. H. (2004). Event detection from online news documents for

supporting environmental scanning. Decision Support Systems.

https://doi.org/10.1016/S0167-9236(03)00028-9

Woodward, D. (2001). Extraction and Visualization of Temporal Information and Related

Named Entities from Wikipedia. Springs.

Wu, H. C., Luk, R. W. P., Wong, K. F., & Kwok, K. L. (2008). Interpreting TF-IDF term

weights as making relevance decisions. ACM Transactions on Information Systems.

https://doi.org/10.1145/1361684.1361686

Xu, J. M., Jun, K. S., Zhu, X., & Bellmore, A. (2012). Learning from bullying traces in

social media. NAACL HLT 2012 - 2012 Conference of the North American Chapter

of the Association for Computational Linguistics: Human Language Technologies,

Proceedings of the Conference.

Y., Z. (2016). The analysis of cases based on decision tree. In2016 7th IEEE

International Conference on Software Engineering and Service Science (ICSESS).

Yaghoobzadeh, Y., Ghassem-Sani, G., Mirroshandel, S. A., & Eshaghzadeh, M. (2012).

ISO-TimeML event extraction in persian text. 24th International Conference on

Computational Linguistics - Proceedings of COLING 2012: Technical Papers.

Yakushiji, A., Tateisi, Y., Miyao, Y., & Tsujii, J. (2001). Event extraction from

biomedical papers using a full parser. Pacific Symposium on Biocomputing. Pacific

Symposium on Biocomputing. https://doi.org/10.1142/9789814447362_0040

Yang, Y., Pierce, T., & Carbonell, J. (1998). Study on retrospective and on-line event

detection. SIGIR Forum (ACM Special Interest Group on Information Retrieval).

https://doi.org/10.1145/290941.290953

Zaraket, F., & Makhlouta, J. (2012). Arabic Temporal Entity Extraction using

Morphological Analysis. 3(1), 121–136.

Zhang, T., & Oles, F. J. (2001). Text Categorization Based on Regularized Linear

Classification Methods. Information Retrieval.

https://doi.org/10.1023/A:1011441423217

Zhang, Y. (2012). Support vector machine classification algorithm and its application.

Communications in Computer and Information Science. https://doi.org/10.1007/978-

3-642-34041-3_27

Zhou, H., Huang, M., Zhang, T., Zhu, X., & Liu, B. (2018). Emotional chatting machine:

79

Emotional conversation generation with internal and external memory. 32nd AAAI

Conference on Artificial Intelligence, AAAI 2018.

Zia, T., Akhter, M. P., & Abbas, Q. (2015). Comparative study of feature selection

approaches for Urdu text categorization. Malaysian Journal of Computer Science.

82

The above given regular expression covers Month/Year with keyword سنہ اور سال

5 Fully Qualified Urdu Date with keyword سنہ

Table 1: Different format of Extracted Dates

1 2 3 4 5

Regex=re.findall("(آٹھ|سات|چھ|پانچ|چار|تین|یکم|دو|نو|دس|گیارہ|بارہ|تیرہ|چودہ|پندرہ|سولہ

بیس|اکیس|بائیس|تئیس|جوبیس|پچیس|چھبیس|ستائیس|اٹھائیس|انتیس| |سترہ|اٹھارہ|انیس|

جولائی|جون|مئی|مارچ|فروری|جنوری|اپریل|اگست|ستمبر|اکتوبر|نوم )s\+(تیس|اکتیس

)دوہزار|انیس s\+(بر|دسمبر

سو(+)ایک|دو|تین|چار|پانچ|چھ|سات|آٹھ|نو|دس|گیارہ|بارہ|تیرہ|چودہ|پندرہ|سولہ|سترہ|ا

ٹھارہ|انیس|بیس|اکیس|بائیس|تئیس|چوبیس|پچیس|چھبیس|ستائیس|اٹھائیس|انتیس|تیس|اک

ونتیس|پینتیس|چھتیس|سینتیس|اڑتیس|انتالیس|چالیس|اکتالیس|بیالیس| تیس|بتیس|تینتیس|چ

تینتالیس|چوالیس|پینتالیس|چھیالیس|سینتالیس|اڑتالیس|انچاس|پچاس|اکاون|باون|تریپن|چ

ون|پچپن|چھپن|ستاون|اٹھاون|انسٹھ|ساٹھ|اکسٹھ|باسٹھ|تریسٹھ|چونسٹھ|پینسٹھ|سڑسٹھ|اڑ

ھتر|چھہتر|ستتر|اٹھتر|اناسی|اسی|اکاسی|بیاسی| سٹھ|انہتر|ستر|اکہتر|بہتر|تہتر|چوہتر|پچ

تریاسی|چوراسی|پچاسی|چھیاسی|ستاسی|اٹھاسی|نواسی|نوے|اکانوے|بانوے|ترانوے|چ

(s,"+ورانوے|پچانوے|چھیانوے|ستانوے|اٹھانوے|ننانوے(

جنوری|فروری|مارچ|اپریل|مئی )

+d\|اپریل +d\|مارچ +d\|فروری+d\|جنوری+d\|جون|جولائي|اگست|ستمبر|اکتوبر|نومبر|دسمبر

ئي م |\d+ جون|\d+ جولائی|\d+ اگست|\d+ ستمبر|\d+ اکتوبر|\d+ نومبر|\d+ دسمبر|\d+سنہ|سنہ\s

+\d+|\d+\s+سال|سال\

Regex: (“s+\d+|سال\s+\d{1,4}”)

Regex=re.findall(“(سنہ\s+[ آٹھ|سات|چھ|پانچ|چار|تین|یکم|دو|نو|دس|گیارہ|بارہ|تیرہ|چود

ہ|سولہ|سترہ|اٹھارہ|انیس|بیس|اکیس|بائیس|تئیس|جوبیس|پچیس|چھبیس|ستائیس|اٹھائیس|ان ہ|پندر

اگست|مارچ|جنوری|فروری|اپریل| مئی|جون|جولائی|اکتوبر |نومبر ]s\+[تیس|تیس|اکتیس

]s\+[دوہزار]s\+[ستمبر||دسمبر ار|پانچ|چھ|سات|آٹھ|نو|دس|گیارہ|بارہ|تیرہ|چودہ ایک|دو|تین|چ

|پندرہ|سولہ|سترہ|اٹھارہ|انیس|بیس|اکیس|بائیس|تئیس|چوبیس|پچیس|چھبیس|ستائیس|اٹھائیس|انت

یس|تیس|اکتیس|بتیس|تینتیس|چونتیس|پینتیس|چھتیس|سینتیس|اڑتیس|انتالیس|چالیس|اکتالیس|بیال

س|اڑتالیس|انچاس|پچاس|اکاون|باون|تریپن|چون| یس|تینتالیس|چوالیس|پینتالیس|چھیالیس|سینتالی

پچپن|چھپن|ستاون|اٹھاون|انسٹھ|ساٹھ|اکسٹھ|باسٹھ|تریسٹھ|چونسٹھ|پینسٹھ|سڑسٹھ|اڑسٹھ|انہتر|س

تر|اکہتر|بہتر|تہتر|چوہتر|پچھتر|چھہتر|ستتر|اٹھتر|اناسی|اسی|اکاسی|بیاسی|تریاسی|چوراسی|پ

چاسی|چھیاسی|ستاسی|اٹھاسی|نواسی

(s,”(+[|نوے|اکانوے|بانوے|ترانوے|چورانوے|پچانوے|چھیانوے|ستانوے|اٹھانوے|ننانوے

83

جنوری سنہ دوہزارایک

دوہزارایک

تین جنوری

دوہزارایک

2012سنہ سنہ دو اکتوبر

دوہزار چار

مارچ یکم مارچ دوہزاردو سنہ دوہزاردو

دوہزار

سنہ دس فروری

فروری دوہزار

چار

تین نومبر اپریل دوہزارتین سنہ دوہزارتین

دوہزارنو

2025سال سنہ دو اکتوبر

دوہزار پانچ

فروری سنہ دوہزارچار

دوہزارچار

تین مارچ

دوہزارچار

سنہ آٹھ دسمبر سال 2

دوہزار ایک

دو اپریل مئی دوہزارپانچ سنہ دوہزارایک

دوہزارایک

سنہ دو اکتوبر سال 2

دوہزار چار

سات اپریل جون دوہزارچھ سنہ دوہزاردو

دوہزارنو

سنہ دس 2000مارچ

فروری دوہزار

چار

جولائی سنہ دوہزارتین

دوہزارسات

دو مئی

دوہزارچھ

سنہ دو اکتوبر مارچ

دوہزار پانچ

86

B. List of Stop Words in Urdu Language 10

10 https://raw.githubusercontent.com/SyedMuhammadMuhsinKarim/Urdu-Stop-Words/main/stop_words.txt

آجاؤ آج آئے آئیں آئی آو آؤ آ

آپکو آپکا آپ آجکل آجاو آجائیے آجائیں

اسکی اسکا اسطرح اس ابھی اب آیا آپکی

انکی انکا ان الگ اطراف اسے اسی اسکے

اونچے اونچی اونچا اور انہیں انھیں انھوں انکے

اہم اگرچہ اگر اکثر اپنے اپنی اپنا اوپر

بظاہر بس بذریعہ باہم باہر بارے بار بائیں

بیشک بیشتر بھی بہت بند بلاشبہ بغیر بعد

تعداد ترین تر تجھے تجھ تب تاہم بےشک

تمھیں تمھارے تمھاری تمھارا تمکو تمام تم تلک

تھی تھا تک تو تمہیں تمہارے تمہاری تمہارا

جاتی جاتا جائیں تیسرے تیسری تیسرا تھے تھیں

جو جبکہ جبھی جبہی جب جانے جانا جاتے

دائیں خود جیسےکہ جیسے جیسی جیساکہ جیسا جہاں

دوسری دوسرا دوران دور دو دفعہ درمیان

دیکھو دیکھا دینے دینی دینا دی دونوں دوسرے

رکھے رکھیں رکھی رکھا رکھ دے دیکھیں دیکھی

سارا ساتھ زیادہ رہے رہیں رہی رہا رہ

سکتیں سکتی سکتا سبہی سبھی سب سارے ساری

صرف صحیح شدہ شبہ شاید سے سوا سکتے

لو لازمی لا لئے غیر غلط طرف طرح

لینی لینا لیا لی لگے لگیں لگی لگا

مجھے مجھکو مجھ لے لیئے لیے لیکن لینے

ملی ملو ملا مل مشتمل مزید مرتبہ مربوط

نھیں نا میں میرے میری میرا مگر ملے

والا و نے نیچے نیچی نیچا نہیں نہ

پاس ویں وہی وہاں وہ والے والی والوں

پھر پورے پوری پورا پڑی پڑے پڑا پر

چاہا پہلےسے پہلے پہلی پہلا پھرے پھری پھرو

چاہیئے چاہی چاہنا چاہتے چاہتیں چاہتی چاہتا

کب کا چکے چکیں چکی چکا چاہے چاہیے

کردو کرتےہو کرتے کرا کر کتنا کبھی

کرنا کررہے کررہیں کررہی کررہا کردی کردیے کردیا

کرواسکتی کرواسکتا کروانے کروانا کرو کرنے کرنی

کرچکیں کرچکی کرچکا کروائے کروائی کروایا کرواسکتے

کس کرے کریں کرسکتیں کرسکتے کرسکتی کرسکتا کرچکے

کچھ کون کوئی کو کمی کم کل کسی

کہنا کہتے کہتی کہتا کہاں کہا کہ

کیا کی کہے کہیں کہوں کہو کہنے کہنی

کیں کیوں کیلیے کیلئے کیسے کیساتھ کیجیے کیجئے

ہاں گے گیا گی گئے گئیں گا کے

ہو ہمیں ہمارے ہماری ہمارا ہم ہر

ہوسکتا ہورہے ہورہی ہورہا ہوا ہوئے

ہوچکا ہونے ہونگے ہونگی ہونگا ہونا ہوسکتے ہوسکتی

ہوگے ہوگیا ہوگی ہوگا ہوگئی ہوچکی

یا ہے ہیں ہی ہوے ہوئے ہوں ہوگئے

یہاں یہ یوں

87

C. PUBLISHED WORK

1. Ali, D., Missen, M. M. S., & Husnain, M. Multiclass Event Classification

from Text. Scientific Programming, 2021.

https://doi.org/10.1155/2021/6660651

2. Daler Ali, Malik Muhammad Saad Missen, Muhammad Ali Memon,

Muhammad Ali Nizamani, & Asadullah Shaikh. (2020). Extracting Temporal

Entity from Urdu Language Text. University of Sindh Journal of Information

and Communication Technology, 4(3), 181 - 188. Retrieved from

https://sujo.usindh.edu.pk/index.php/USJICT/article/view/2886.

https://sujo.usindh.edu.pk/index.php/USJICT/article/view/2886

88

Special Thanks

All the prostration for Allah Subhana,Taala

I’m very grateful to these honourable people especially my supervisor and Head of the

Department.

Dr. Dost Muhamad Khan HoD (Assistant Professor) ________________________

Dr. Malik Muhammad Saad Miseen (Assistant Professor) ________________________

Dr. Mujtaba Husnain (Assistant Professor) ________________________

Dr. Najia Saher (Assistant Professor) ________________________

Dr. Muhammad Omer (Assistant Professor) ________________________

Dr. Waheed Anwar (Assistant Professor) ________________________

I would specially thank my MS fellows Mr. Zahid Khurshid and Mr. Muzammil

Zubair.

event and temporal entity extraction in urdu …

Documents