event and temporal entity extraction in urdu …
TRANSCRIPT
i
EVENT AND TEMPORAL ENTITY
EXTRACTION IN URDU LANGUAGE
TEXT
By
DALER ALI
Thesis submitted in the partial fulfilment of the requirement
for the degree of
DOCTOR OF PHILOSOPHY
In
COMPUTER SCIENCE
Fall 2016-2019
Department of Information Technology
The Islamia University of Bahawalpur, Pakistan
ii
Dedication
To the ALLAH Subhanahu Wa ta'ala
To the Last Prophet Hazrat MUHAMMAD (S.A.W.W.)
iii
Student’s Declaration I hereby declare that the work described in this dissertation was carried out by me under
the supervision of Dr. Malik Muhammad Saad Missen at the Department of Computer
Science & Information Technology, The Islamia University of Bahawalpur, Pakistan.
I also hereby declare that the substance of this dissertation has neither been submitted
elsewhere nor is being concurrently submitted for any other degree.
I further declare that the dissertation embodies the results of my research or advance studies
and that it has been composed by me and where appropriate, I have acknowledged the work
of others.
Daler Ali S/O Malik Hazoor Bukhsh Awan
iv
Supervisor’s Declaration It is hereby certified that the work presented by Mr. Daler Ali S/O Malik Hazoor Bukhsh,
Roll No. FA16M2PA003 in the thesis “EVENT AND TEMPORAL ENTITY
EXTRACTION IN URDU LANGUAGE TEXT” is based on a research study conducted
under my supervision. No portion of this work has been formerly offered for a higher degree
in this university or any institute of learning and to the best of the authors acknowledge, no
material has been used which is not his work except where due acknowledgment has been
made.
He has fulfilled all the requirements and is qualified to submit this thesis in partial
fulfilment for the degree of Doctor of Philosophy (Ph.D.) in the field of Computer Science,
in the Faculty of computing, at the Islamia University of Bahawalpur.
Dr. Malik Muhammad Saad Missen
Assistant Professor,
Department of Information Technology
Faculty of Computing
The Islamia University of Bahawalpur
v
Acknowledgment
My all gratitude and bows to Allah Almighty Who gave me the strength to achieve
my research goals.
I would like to pay the deepest gratitude to my great supervisor Dr. Malik
Muhammad Saad Missen, Assistant Professor, Department of Information
Technology, who stood with me in even and odd. He spent countless hours to
facilitate and guide me in research work throughout the research period. His
behaviour was tremendous during the discussions of research issues.
I am also very thankful to Dr. Dost Muhammad Khan, Assistant Professor, HoD
Department of Information Technology. He supported and encouraged me to
complete my research work. His kind and cooperative behaviour is a role model for
me.
A very special thanks to Dr. Mujtaba Husnain Assistant Professor, Department of
Information Technology, he boosted and helped me to synthesis my research work.
His moral support was an invaluable factor in my research work. I am very thankful
to my parents for their precious prayers and moral support.
Daler Ali
vi
Table of Contents
Contents Page No.
Dedication i
Declaration of Originality iii
Acknowledgement vi
Table of Contents vii
List of Tables x
List of Figures xii
List of Abbreviation xiii
Abstract xiv
Chapter No. 1
Introduction
1.1 Concept of Event and Temporal Entity 05
1.1.1 Event 05
1.1.2 Temporal Entity 06
1.1.2.1 Fully Qualified 06
1.1.2.2 Deictic 06
1.1.2.3 Anaphoric 07
1.1.3 Temporal Entities in the Urdu Language 07
1.1.3.1 Urdu Fully Qualified Date 08
1.1.3.2 Different Types of Urdu Fully Qualified Date 08
1.1.3.3 Urdu Deictic 09
1.1.3.4 Urdu Anaphoric 09
1.1.4 Applications of Temporal Entities 10
1.2 Event Detection 10
1.3 Event Classification 11
1.3.1 Binary Classification 11
1.3.2 Multiclass Classification 12
1.3.3 Multilabel Classification 12
1.4 Event Classification in the Urdu Language 13
1.5 Challenges in Event Detection and Classification 14
vii
1.5.1 General Challenges 14
1.5.2 Event Detection and Classification Methodology Challenges 15
1.5.3 Pre-Processing Challenges in Event Detection and
Classification
15
1.5.4 Feature Extraction Challenges 15
1.6 Challenges in Event Detection and Classification from the
Urdu Language
15
1.7 Importance of Time in Event Detection 16
1.8 Research Motivation 16
1.9 Research Problem 18
1.10 Research Objectives 19
1.11 Thesis Organization/Structure 20
Chapter No. 2
Background and Related Work 21
2.1 Event Detection and Classification 22
2.2 Existing Methodologies 28
2.2.1 Data Driven Approach 28
2.2.2 Knowledge Driven Approach 28
2.2.3 Hybrid Approach 28
2.3 Temporal Entity Extraction 29
Chapter No. 3
Dataset 32
3.1 Multiclass Urdu Language Labelled Sentences (MULLS) 34
3.1.1 Data Collection 35
3.1.2 Pre-processing 36
3.1.2.1 Post Splitting 36
3.1.2.2 Stops Words Elimination 36
3.1.2.3 Noise Removal 36
3.1.2.4 Filtering Sentences 37
3.1.3 Annotation Guidelines 37
3.1.4 Training Dataset 41
3.1.5 Testing/Validation Dataset 42
viii
3.2 Urdu Named Entity Recognition (UNER) Dataset 42
Chapter No. 4
Event Classification
4.1 Proposed Methodology for Event Classification 45
4.2 Experimental Setup of Multiclass Event Classification 47
4.2.1 Feature Space 47
4.2.2 Feature Vector Generating Techniques 47
4.2.2.1 Word Embedding 47
4.2.2.2 Pretrained Word Embedding Models 47
4.2.2.3 One Hot Encoding 49
4.2.2.4 TF_ IDF 49
4.3 Deep Learning Models 50
4.3.1 Deep Neural Network Architecture (Feedforward/DNN) 50
4.3.2 Recurrence Neural Network (RNN) 50
4.3.3 Convolutional Neural Network (CNN) 50
4.3.4 Hyperparameters 50
4.3.5 Performance Measuring Parameters 52
4.4 Results 52
4.4.1 Deep Learning Classifiers 52
4.4.1.1 Pretrained Word Embedding Models 53
4.4.1.2 TF_ IDF Feature Vector 53
4.4.1.3 ONE-HOT-ENCODING 57
4.5 Traditional Machine Learning Classifiers 57
4.5.1 K- Nearest Neighbour (K-NN) 57
4.5.2 Decision Tree 58
4.5.3 Naïve Bayes Multinominal (NBM) 59
4.5.4 Logistic Regression (LR) 60
4.5.5 Random Forest (RF) 60
4.5.6 Support Vector Machine (SVM) 61
Chapter No. 5
Temporal Entity Extraction
5.1 Proposed Methodology for Temporal Entity Extraction 64
ix
5.2 Rule-based Approach (Regular Expression) 64
5.3 Experimental Setup of Temporal Entity Extraction 65
5.4 Results 65
5.5 Discussion 67
Chapter No. 6
Conclusion and Future Work
6.1 Conclusion 70
6.2 Future Work 71
References 72
Appendix 80
Published Work 87
Special Thanks 88
x
List of Tables
Table
No.
Title of Tables Page
No.
1.1 Top 5 widely spoken languages in the world 04
1.2 Deictic words 09
1.3 Examples of Event 14
2.1 TempEx Tagger for an exact match on tag span and value calculation
(Bittar et al., 2011)
23
2.2 Performance of classifier on Portaguess dataset (Costa & Branco,
2012)
24
2.3 Evaluation results for PET for event recognition (Yaghoobzadeh et al.,
2012)
25
2.4 Summary of the related research 27
3.1 Urdu label sentences 38
3.2 Sentence tokenization 38
3.3 Class label 38
3.4 The summary of dataset 39
3.5 The example of count vectorizer 41
4.1 Pretrained word embedding model and custom word embedding model 49
4.2 Event sentence 49
4.3 Event sentence converted to numbers using One-Hot Encoding 49
4.4 DNN’s Hyperparameters 51
4.5 RNN’s Hyperparameters 51
4.6 CNN’s Hyperparameters 52
4.7 Classification accuracy of the CNN model 53
4.8 Performance measuring parameters for DNN model 54
4.9 Performance measuring parameters for RNN model 55
4.10 Performance measuring parameters for the CNN model 56
4.11 Performance measuring parameters for the K-NN model 58
4.12 Performance measuring parameters for the DT model 59
4.13 Performance measuring parameters for NB Multinominal model 59
xi
4.14 Performance measuring parameters for the LR model 60
4.15 Performance measuring parameters for the RF model 60
4.16 Performance measuring parameters for SVM model 61
5.1 All dates extraction results on original dataset 65
5.2 UFQD &UPFQD on extended dataset 66
5.3 Deictic date analysis 66
xii
List of Figures
Table
No.
Caption of Figures Page
No.
1.1 Internet users in the world 02
1.2 Social Media users in the world 03
1.3 Usage of Urdu language on Facebook 04
1.4 Usage of Hindi language on Facebook 05
1.5 Usage of Arabic language on Facebook 05
1.6 Types of temporal entities 07
1.7 Types of temporal entities in Urdu language text 08
1.8 Binary classification 11
1.9 Multiclass classification 12
1.10 Three types of classification 13
1.11 A generic application diagram of our proposed system 19
3.1 Dataset life cycle (DLC) 33
3.2 Urdu and Hindi language text on Social Media 35
3.3 Instances of the pre-processed dataset 39
3.4 Maximum number of instances of each type of event 40
4.1 Event classification methodology 46
4.2 Event classification methodology’s flow diagram 46
4.3 RNN’s accuracy 55
4.4 CNN’s accuracy distribution 56
4.5 CNN, RNN, and DNN accuracy using one-hot-encoding 57
4.6 Machine learning algorithms’ accuracy using TF_IDF 61
5.1 Temporal entity extraction methodology 65
5.2 All Dates extraction results on original dataset 66
5.3 Average of UFQD & UPFQD on extended dataset 67
5.4 Deictic date analysis 67
xiii
List of Acronym
NLP: Natural Language Processing
ML: Machine Learning
DL: Deep Learning
MULLS: Multiclass Urdu Language Labelled Sentences
UNER: Urdu Named Entity Recognition
DNN: Deep Neural Network
RNN: Recurrent Neural Network
CNN: Convolutional Neural Network
TE: Temporal Entity
Regex: Regular Expression
UFQD: Urdu Fully Qualified Date
HUFQD: Hybrid Urdu Fully Qualified Date
SVM: Support Vector Machine
RF: Random Forest
DT: Decision Tree
K-NN: K-Nearest Neighbour
LR: Logistic Regression
NBM: Naïve Bayes Multinominal
xiv
Abstract The digital world created space for multiple languages for communication via the Internet.
The Internet provided various facilities like real-time availability and open access to
different platforms i.e., social media, news websites, vlogs, and weblogs, etc. for
communication. People demographically located in different areas of the world are hooked
with one another like a global village via Internet. They are generating unstructured and
structured (heterogeneous/homogeneous) data during conversation. Huge bulky stuff of
data exists in different languages on social media and news website that contains invaluable
insights. To add new milestones in the field of NLP it is very important to process the
contents of other languages instead of only the English language. Many applications can
be developed by processing the local languages i.e., Monitoring system, topic detection,
event classification, and recommendation system, etc. that will certainly improve the
services of business and performance of (private and public/government) institutes and
many more.
The Urdu language is one of the resource-poor languages that has more than 300 million
users all around the world. A huge volume of Urdu contents in textual form exists on social
media and news websites that contains worthy insights related to different events that are
happening around us in specific time span i.e., terrorist attack, political campaign, protest,
and sports.
The research work consists of two major tasks i.e., temporal (time/date) entity extraction
and event classification. There are many sub-tasks under the major tasks. Temporal entity
extraction and multiclass event classification is performed in Urdu language text. The
research tasks are performed on the textual corpus. It consists of 0.15 million labelled
instances(sentences), named as “Multiclass Urdu Language Labelled Sentences
(MULLS)”.
In this thesis, we described and accomplished three main tasks related to Urdu language
text.
1) Event Extraction: It is the task to retrieve event information
2) Event Classification: It is a task to assign pre-defined event label to input data and
3) Extracting temporal information associated with events.
To achieve our goals, the proposed methodologies are data-driven (deep learning) for event
classification and knowledge base (rule-based) for temporal entity extraction.
xv
To achieve our research objectives, we have explored machine learning and deep learning
classifiers. The famous machine learning classifiers i.e., SVM. k-NN, DT, LR, and NBM
are evaluated on corpus MULLS. We also evaluated the deep learning classifiers i.e., CNN,
DNN(Deep/Feedforward), and RNN (bi-directional) on the same corpus named as
MULLS. Deep learning models outperformed the machine learning models. Among the
deep learning models, DNN has shown the highest accuracy of 84% for multiclass event
classification.
Secondly, temporal entities (TE) are very important to predict the time of the events. We
have explored the various types of TE’s that are helpful to develop NLP applications. We
decided to use the rule-based approach i.e., Regular Expression. TE’s are extracted from
UNER dataset that is publicly available for researcher purposes. We have also further
suggested new names to different types of fully qualified dates (FQD) lying in Urdu
language text. Our proposed regex showed considerable results for FQD TE. Although
anaphoric and deictic TE’s extracted successfully but the require contextual information to
give the actual meaning. Such type of TE’s can be analysed using the deep learning
approach.
1
CHAPTER 1
INTRODUCTION
2
1 Introduction
In this age of technology, the use of paper and pen is going to be demised. People prefer to
use digital gadgets instead of traditional communication sources like post-mail. High-speed
communication networks i.e., Internet is being popular for instant communication (chat). It
has become a vital source of communication among people located at different locations.
The rapid growth of Internet users up to 2020 can be observed in figure1.1.
Figure 1.1: Internet Users in the World1
These days social networks have fascinated millions of people all around the world. People
use these networks for different purposes i.e., to share opinions, events, news,
advertisement, and research ideas, etc., on Facebook, Twitter, Instagram, and news
websites. A pictorial overview of social network users from 2010 to 2021 by Statista
(website) is given in figure 1.2. We can observe the popularity, dominating influence, and
craze of such communication networks in society.
1https://dazeinfo.com/2016/06/13/number-internet-users-worldwide-2016-2020/
3
Figure 1.2: Social Media Users in the World2
People from every corner of the world, speaking different languages are producing
gigabytes of information regularly. Internet has provided the facility to use online input
tools like google input tools3 that support many languages (more than 187). These tools
allow the users to type fast, efficiently, and easily in various languages. The fundamental
factors that are causing to produce huge amount of multilingual data are: continuous growth
of Internet users, multiple language’s supporting tools and those platforms that have
provided communication facility in multiple languages.
To develop intelligent applications, it is essential to extract worthy insight from huge
amount of structured and unstructured data existing on digital platforms. English indeed is
the highly dominant language of the Internet. However, online social networks have
facilitated their users to use local languages for communication that caused huge amount
of data in local languages. The use of local languages is being highly prefer because of
many factors like easiness and to promote local languages. There are more than 187
languages that are supported by google input tools. The top 5 widely used languages in the
2 https://www.statista.com/statistics/278414/number-of-worldwide-social-network-users/ 3 https://www.google.com/intl/ur/inputtools/try/
4
world are mentioned in Table 1.1.
Table 1.1: Top 5 Widely Spoken Languages in the World
Language Speakers
English 1132 Million
Mandarin 1117 Million
Hindi 615 Million
Spanish 514 Million
Urdu 300 Million
The English language is the most widely used language having 1132 million users in the
world. Other non-English languages i.e. Mandarin (1117 million), Hindi (615 million),
Spanish (534), and Urdu (300 million), ( M. D. Eberhard, 2019) also have a considerable
number of users. The statistics of multilingual users, low amount of referential works for
other non-English languages, the importance of conversation predicting, and monitoring
motivated us to focus on non-English languages of Asian countries.
Naturally, humans prefer local languages for communication which delivers effective
implications of the situation. All the above-mentioned factors are the reasons that
promoting the use of local languages over the Internet. It can be seen in, figure 1.3 to figure
1.5 that usage of local languages i.e., Urdu, Hindi, and Arabic, etc. on social networks being
popular. The use of Urdu, Hindi, Arabic, and their Roman scripts are very common in
Pakistan, India, Bangladesh, and Saudi Arabia.
Figure 1.3: Usage of Urdu language on Facebook
5
Figure 1.4: Usage of Hindi language on Facebook
Figure 1.5: Usage of Arabic Language on Social Media
The above examples highlight the usage of local languages on social networks. It is
noticeable that people move instantly towards social networks when something happens or
specific event occurs i.e., music concerts, sports, political campaigns, protests, and bomb
blasts, etc. to get know the detail and real facts. People give their opinion, show reactions,
and post comments on these events. These responses are very important in developing
Natural Language Processing (NLP) applications.
1.1 Concept of Event and Temporal Entity
1.1.1 Event
The definition of event varies from domain to domain. In literature, the event is defined in
various aspects, such as a verb, adjective, and noun based environmental situation, etc. (Dr.
D. Ramehs, 2016)(AHMED et al., 2016). Similarly, an event can be defined as “specific
actions, situations or happenings occurred in a certain period (Yang et al., 1998)(Tomas,
2015).
Events can be represented as follow (Dr. D. Ramehs, 2016):
6
• Tensed verbs i.e., Ali took juice yesterday. “Took” represents the event,
• Un-Tensed Word i.e., the valuable statement of PM is to appreciate talented and
industrial people. “Appreciate” is the event,
• Nominalization i.e., Pakistani Air Force (PAF) strike has opened the eyes of the
whole world. “Strike” is the nominal event related to PAF.
• Adjectives i.e., Pakistani cricket team seems helpless before the Australian
Kangaroos. “Helpless” is an adjective event.
• Predicative Clause i.e., there is no reason why our people would not be prepared
to face the war escalation. In the given example “be prepared” is a predicative
clause event.
• Prepositional Phrase i.e., all the 157 people on board Boeing 737 jet died due to
airplane crash. “Onboard” represents the event.
1.1.2 Temporal Entity
Temporal Entities are time-dependent information that changes state-to-state by time i.e.,
event & act. It can be also defined as “Anything in temporal expression that represents
‘When’, ‘How long’, ‘How often’ something happening.” It represents the time of specific
phenomena. Time and date are very important temporal entities to develop many Natural
Language Processing (NLP) applications. Categorically, temporal entity (date) can be
classified into three types i.e., fully qualified date, deictic date, and anaphoric date (Ahn,
2005) (Filannino, p. 2015).
1.1.2.1 Fully Qualified
A temporal entity that consists of complete information of data i.e. day, month, and year.
The format of a fully qualified temporal entity is like dd/mm/yyyy. For example,
08/12/2020.
1.1.2.2 Deictic
A temporal expression that represents such type of date that requires further analysis. An
expression of words that required the utterance time of words. For example, today and
tomorrow, etc.
1.1.2.3 Anaphoric
A case of deictic expression for which utterance of time varies according to previously
mentioned temporal expression. For example, that year, last week and two months, etc.
It is “an entity that represents time in dataset”. In literature generally, three types of
temporal entities have been discussed. We have further explored these types of entities and
7
proposed (introduced) a new type of temporal entity named as “Partially Fully Qualified
Date”.
Figure 1.6: Types of Temporal Entities
1.1.3 Temporal Entities in the Urdu Language
Deep analysis of dates, written in the Urdu language text revealed that it can be divided
into four types i.e., Urdu Fully Qualified, Urdu Partially Fully Qualified, Urdu Deictic,
and Urdu Anaphoric. A tree diagram of temporal entities is a vivid depiction of various
types of temporal entities in the Urdu language. These types of temporal entities are not
reported before our work for any languages.
Name Entity
Temporal Entity
Deictic Anaphoric Fully qualified
8
Figure 1.7: Types of Temporal Entities in Urdu Language Text
1.1.3.1 Urdu Fully Qualified Date
A temporal expression that consists of complete date information. It consists of day,
month, and year i.e. dd/mm/yyyy (20/10/2018) دوہزار اٹھارہ بیس اکتوبر. A date written in the
Urdu language which consists of day, month and year is called Urdu Fully Qualified
Date”. For Example, (1) آٹھ فروری انیس سو اکانوے
Day, Month, Year, and century can be represented in the following manner:
(1) Roman Numbers 0,1, 2…,9 i.e. 02-10-2012
(2) Arabic Numbers i.e. (۰۳/۱۱/۱۹۹۱)
(3) Urdu words ایک، دو ،تین،چار۔۔۔۔۔ i.e. دو ہزار اٹھارہ دسمبر چودہ اگست انیس
(4) Mix up of all i.e. 25-2018-جولائی
1.1.3.2 Different Types of UFQD Regarding Processing
The analysis showed that Fully Qualified Date (FQD) in the Urdu language can be
represented in a different format, so for the convenience of understanding we have
suggested a name i.e., Hybrid Urdu Fully Qualified Date (HUFQD). These dates are
given here:
• Numeric Day and Urdu Month/Year i.e. 25دسمبر دوہزارسات 25مارچ انیس سو چالیس,
• Urdu Day/Year and Numeric Month i.e. دوہزار بارہ 5چار دس 8دوہزار تیرہ,
• Urdu D/Month and Numeric year i.e. 2008 مارچ یکم, 2009پندرہ جون
9
• Urdu Partially Fully Qualified Date
A type of date written in Urdu textual language which is missing one of the given i.e. day,
month, or year. For example, 26/2008, 08/2016 or 26/08 in English while in Urdu دس جولائی
.(07/2007) جولائی دوہزارسات ,(10/07)
1.1.3.3 Urdu Deictic
A temporal expression represents such a type of date that requires further analysis. An
expression of words that required the utterance of words (Filannino M. G., 2015). For
example, (1) آج(‘today’), (2) کل(‘tomorrow’), etc. is such type of date that cannot directly
be mapped to standard date format. It requires further analysis of context to give the
purposeful meanings i.e. وقت، دن اب،تب، رات اور صبح وغیرہ. A comprehensive but limited
collection of deictic words used in the Urdu language to represent time is given in table
1.2.
Table1.2: Deictic Words
Deictic Words Representing Time
فرصت جمعرات رات لمحہ
مہلت جمعہ روز دقیقہ
وقفہ ہفتہ یوم ساعت
دورانیہ اتوار وار پل
آغاز تب شب گھڑی
شروع کب صبح لحظہ
اختتام ابھی مہینہ سیکنڈ
رُت دیر سال آن
تاریخ تاخیر برس دم
حیات جلدی صدی عہد
زندگی اثنا سحر دور
باری سردی فجر زمانہ
موسم گرمی دوپہر آن
موقع خزاں سہ پہر وقت
عمر بہار شام قرن
دراز اوقات تڑکا مدت
روزانہ میعاد سویرا زمانہ
آج ازل سوموار منٹ
کل ابد منگل گھنٹا
پرسوں عرصہ بدھ دن
1.1.3.4 Urdu Anaphoric
A case of deictic expression for which utterance of time varies according to the temporal
expression as previously mentioned in the text. For example اسُ سال (‘that year’), پچھلے
etc. it is a special case of deictic date which ,(’two months‘) دو ماہand (’last week‘)ہفتے
10
requires utterance time which varies from time to time to conclude meaningful
information i.e. اگلے سال، پچھلے دن، کئی سال.
1.1.4 Applications of Temporal Entities
There are several applications to utilize extracted insights related to temporal entity i.e.
timeline construction, tracking the history of stories, estimating document creation time
(DCT), improving the news reading experience, and enhancing information retrieval
capabilities of systems(Zaraket & Makhlouta, 2012).
Detection and extraction of temporal entities from contents of documents instead of meta-
data i.e., Document Creation Time is preferable for the research community. Because the
last modification date of any created document is not its DCT. Sometime data may be
copied or uploaded at that time meta-data of document updated that is not the actual
DCT(Li, Hang, et al., 2009) Assigning automated event-time period to documents i.e.
medical reports, to an event of accidents, to news articles, and traveling history, etc. is
probably significant rather than relying on writing/publishing date of all these (Llidó et al.,
2001)
1.2 Event Detection
Event detection is a fundamental task in NLP which can be used to analyze risk factor, to
predict the law and order situation, take the decision (Hogenboom et al., 2016), in mediation
information system (Barthe-Delanoë et al., 2014), analysis of firm-specific, social media
monitoring (Jiang et al., 2014)in Vehicle routing (Pillac et al., 2012)Environment scanning
(Wei & Lee, 2004) New personalization system (Borsje et al., 2010), Discovering defects
in products (Hogenboom et al., 2016), Advance Spatio-temporal reasoning of moving
objects (Jin et al., 2013), Algorithmic trading (Nuij et al., 2014)Financial risk analysis,
Quality assurance (Abrahams et al., 2012), Terrorism detection (Conlon et al., 2015), E-
commerce (A.S. Abrahams, 2002) and design a timeline, etc. Event extraction, with its
origins from the 1980s, has become an interesting and popular problem due to the
availability of resources i.e., datasets, processing tools, etc. for many languages. As defined
by (Li et al., 2017) an event is a general term used for referring to happening; some
situations or actions depending on time. Events consist of “event trigger” and “events
arguments”. Event triggers are the factors that cause events to happen i.e., “Action/Verb
Words “. Events arguments are the name entities i.e., person, place, and organization. Event
detection is categorized into two subtasks i.e. retrospective event detection and new event
detection (Jin et al., 2013). Former task extracts events from pre-collected resources while
11
later discovers new events from real time streams of text.
Event detection can be performed at sentence (Naughton et al., 2010), paragraph (D’Andrea
et al., 2019), phrase and document level (Jacobs et al., 2019). The extracted information
can represent different types of events, i.e. sports, politics, terrorist attacks and inflation,
etc. information related to the event can be detected and classified at a different level of
granularity, i.e. document level (D’Andrea et al., 2019), sentence-level (Jacobs et al., 2019),
word level, character level, and phrase-level (D’Andrea et al., 2019).
1.3 Event Classification
“Event classification is an automated way to assign a predefined label to the new instances.”
It also can be defined as “The automated way of assigning predefined labels of events to
new instances by using pre-trained classification models is called event classification.” All
the classifiers are trained on label instances of the dataset that are later used to predict the
event class of new unknown instances.
Event classification information can be used to develop several different NLP applications
like content labeling, topic modeling, and finding the latest trend, etc. Event classification
is one of the practicable and challenging tasks of NLP. It is pertinent to mention that events
can be classified in more than one class (Sokolova & Lapalme, 2009)] i.e. binary
classification, multiclass classification, and multilabel classification.
1.3.1 Binary Classification
In binary classification, data point is assigned one (1) class among total of Two (2) classes.
It is used for such type of dataset that contains two classes as output. For example, positive
or negative, male or female (D. Ali et al., 2016) and spam or not-spam.
12
Figure 1.8: Binary Classification
1.3.2 Multiclass Classification
A type of classification in which new instances or data points are classified into one class
of multiple classes. For example, event classification i.e., terrorist attack, murder, accident,
and outbreak, etc. In such a type of classification, every sentence is labelled with one of the
multiple classes.
Figure 1.9: Multiclass Classification
It is the task of automatically assigning the most relevant one class from the given multiple
classes. A major and serious challenge for multiclass classification is i.e. sentences are
overlapping in multiple classes (Kong et al., 2011, Sarker & Gonzalez, 2015) generally
13
affect the overall performance of the classification system.
1.3.3 Multilabel Classification
In multilabel classification, an object can be assigned more than one class. The collective
comparison of the different three types of classification is shown in figure 1.10.
Figure 1.10: Three types of Classification Including Multilabel Classification
1.4 Event Classification in the Urdu Language
There are several hurdles to process Urdu language text for event classification. Some of
them are determining the boundary of events in a sentence, identifying event triggers, and
assigning an appropriate label. (Naz et al., 2013) reported that more than 100 million
people understand speak and write the Urdu language in the sub-continent. Urdu has got
much popular on the web especially in online social networks because of the availability
of input tools for Urdu. Extracting events from the social network for the Urdu language
is a unique and challenging task. The Urdu language has complex writing script and right-
to-left writing style. Its grammatical structure is different from other languages i.e.,
English, French and German etc. It follows the subject, object, and verb sequence (SOV)
(Daud et al., 2017). Urdu language is complex because it consists of joined letters and
non-joined letter sets (Pal & Sarkar, 2003). Each letter of the joined-letter set can be
written in three different locations of the word having different forms i.e. at the
beginning, middle, and end (Pal & Sarkar, 2003). Some words can be combined to make
a single word i.e. ےیاسل ےیاس ل (so that) etc. it makes it hard to process the Urdu language
by existing tools.
14
Table 1.3: Examples of Event
Urdu Roman English English
:کتوں کا سب سے ایکور یجنوب
۔ ایگ ایبڑا مذبح خانہ بند کر د
Janobi Koriya: Kutoon
ka sab sy barra mizbah
khana bandd ker dia
gaya.
South Korea: The largest
Dog’s slaughterhouse
had banned.
1981 مرتبہ یکے بعد پہل
میڈیسٹ ںیم رانیکو ا نیخوات
یک کھنےید چیآ کر م ںیم
۔یاجازت مل
1981 kay baad pahli
martaba khawateen ko
Iran mein stadium mein
aa ker match daikhny ki
ijazat mili.
After 1981first time
women could come to the
stadium for watching the
match in Iran.
1 Example given in sentence one, the word ( بند کرنا-banned) representing action
word while (کتوں-dogs) and (مذ بح خانہ-slaughterhouse) are nouns. Extracting and sorting
this information is helpful to know the events for the Urdu language.
2 In the second example 1981 and (بعد-after) are temporal entities while (اجازت ملنا-
allowed) is a word representing events. Such information can be used to construct a
historical timeline about Iran.
In our research problem event can be defined as “An environmental change that occurs due
to some reasons or actions for a specific period.” For example, the explosion of the gas
container, a collision between vehicles, terrorist attacks, and rainfall, etc.
Social media provides the platform to share information in different languages on various
topics. The classification of this information is very important in NLP tasks. Urdu is one of
the local languages being used on online social media for sharing information. Classifying
such information into different types can be helpful for different NLP applications i.e. risk
factor analyzer, law and order situation predictor, and event timeline constructor for certain
areas of the world. In our research work, we are exploring the social media textual data
written in the Urdu language for the classification of events into different categories. To
our best knowledge, we are the first ones who are exploring the Urdu textual data for event
classification.
1.5 Challenges in Event Detection and Classification
1.5.1 General Challenges
The social network has become the central hub in the world to share information that
15
generated a large volume and variety of data (Al-Dyani, W. Z., Yahya, A. H., & Ahmad,
2018). The extraction of worthy information from this huge bulk of data is one of the
challenging tasks. Lack of publicly available datasets that are platform-dependent (McMinn
et al., 2013) i.e. related to Twitter data leading to repetition and comparison of different
approaches (Panagiotou et al., 2016). Many users’ generated data on social networks has
irregular grammar, irrelevant terms, limited length, and misspelled errors (Parikh, R., &
Karlapalem, 2013).
1.5.2 Event Detection and Classification Methodology Challenges
In general, two main approaches for event detection are i.e. document pivot and feature
pivot (McMinn et al., 2013). Clustering based on documents similarity is performed which
cannot handle the large amount of data on the social network. Identical terms are used in
different events which degrade the accuracy of event detection using the document pivot
approach (McMinn et al., 2013). Event detection using the supervised method i.e. feature
pivot shows good results as mention in the literature (Mohamad, A. Y., Mustapha, S. S., &
Razali, 2010,S. Lavanya, R. Kavipriya, Y. Yang, J. Q. Carbonell, R. D. Brown, B.
Archibald, 2014) but it is time-consuming, required a large volume of training data, and a
lot of human effort (Al-Dyani, W. Z., Yahya, A. H., & Ahmad, 2018).
1.5.3 Pre-Processing Challenges in Event Detection and Classification
Social stream is full of noise i.e. advertisement, hoaxes, spam messages, etc. identifying
eventual content from noisy content is another challenge in event detection (Li, Xia,
Yongqing Zheng, 2014). Data representation techniques i.e., Bag of words (BOW) and term
frequency (TF) have their limitations. In case of BOW technique event classification is
challenging because it do not maintain the sequence of words. The overlapping sentence
leads towards misclassification while contrary term frequency utilizes more resources i.e.
time and memory (Al-Dyani, W. Z., Yahya, A. H., & Ahmad, 2018).
1.5.4 Feature Extraction Challenge
Features are an important and crucial element of event detection. Social streams contain a
huge number of features (Dou, Wenwen, Xiaoyu Wang, William Ribarsky, 2012) (Lu,
Zhongyu, Weiren Yu, Richong Zhang, Jianxin Li, 2015) Dependency among the extracted
features is leading to ambiguity in event detection and classification because different
events can be expressed using similar words (identical features).
16
1.6 Challenges in Event Detection and Classification from the Urdu
Language
The major challenges of events extraction and classification for the Urdu language are:
• Writing style and structure of the Urdu language are challenges in event extraction
and classification,
• Lack of processing resources i.e., part-of-speech (PoS) tagger, name entity
recognizer, and annotations tools is another big challenge in event detection and
classification for the Urdu language,
• People are generally unfamiliar with the meaning and usage of the Urdu language,
• Misusage of different terms representing events by people made event classification
a more challenging task because of mostly the same words representing different
events. It is one of the important features that badly affect the accuracy of the
classification system,
• Extraction and classification of event from Urdu language text using knowledge
based and data driven approach face challenges in the form of unavailability of
publicly event datasets,
• Extracting temporal entities are important for classifying events as retrospective
(old) and real-time (new) event. Cursive and complex writing format made
extracting events and temporal information from the Urdu language an interesting
and challenging task.
1.7 Importance of Time in Event Detection
Events are occurrences in a certain period. Time detection in various NLP and IR
application is very crucial. To retrieve information about event that happened in specific
time, it required temporal entities. Time plays an important role to retrieve exact
information. It saves user’s time and other machinery resources i.e., processing power etc.
In case of natural disaster, terrorist attack, and acute accidents temporal information can be
useful to predict the start of the rescue operation, analyze the risk factor of injuries,
estimation of losses, and expected number of causalities for a certain duration. Identifying
events from social media in specific time interval relays on temporal information (Kamila,
Sabyasachi, Mohammad Hasanuzzaman, Asif Ekbal, 2018) Detection of temporal entities
in textual data helps to construct a timeline of events, order them, and classify them as a
retrospective and real-time events (Li et al., 2017).
17
1.8 Research Motivation
Although very useful information can be extracted from social networks but extracting
events and temporal information is a very practicable and challenging task. Temporal
information is also crucial to order the sequence of events to identify between retrospective
and real-time events. Appraisable research work exists for non-cursive languages for
information extraction in textual data i.e. English, German, French, and Japanese (Nadeau
& Sekine, 2007)but very few amounts of research work exist for other languages (Riaz,
2008). The Urdu language is also one of the languages highly used on the web but with
minimum processing resources (Malik, M. K. and Sarwar, 2016). It has more than 100
million users in the world (Naz et al., 2013).
Generally, it can be observed that research on conversation predicting and monitoring had
considerable referential work for the English language (Konstantinidis et al., 2017)The
pandemic outbreak (covid-19) highly affected Italy (De Santis et al., 2020). A heated debate
was generated on Twitter in the Italian language. Italian keyword i.e. Salvini, Conte, PD.
Calcio and Carceree etc. were used to collect 1044645 tweets to predict the relevant topic
(De Santis et al., 2020). To monitor the sentiment, loyalty, and behavior of people about
the product an Italian public broadcasting service was analyzed using the 1000 posts on
Facebook (O’Keeffe et al., 2011). Cyberbullying on social media is causing aggression. A
major (Xu et al., 2012) and national health problem (Limber, n.d.)badly affecting the people
physiologically, physically, and academically (Al-Garadi et al., 2019). An aggression
demonstrating system was designed to predict and monitor cyberbullying (Somooro, 2019).
During the general election of Pakistan in 2018, the status of the Urdu language on Twitter
is analyzed (Jaidka et al., 2019). The work highlighted the occasionally rapid use of Urdu
language on social media i.e.,f Twitter. A lot of Urdu text was generated to promote the
election campaign of the political party. The purpose of the system was to predict the
sentiment regarding the general election (Jaidka et al., 2019). Sentiment information is
mined using tweets to predict the outcome of the election of Pakistan, India, and Malaysia
(Jaidka et al., 2019).
To narrow down our research problem we decided to choose one of the 187 languages i.e.,
the Urdu language conversation for predicting and monitoring. The Urdu language has
complex writing script, right-to-left writing style, free order of words. It is one of the
resource poor languages (Chowdhury et al., 2013) . To our best knowledge, there exist no
such event extraction, event classification, conversation predicting and monitoring system
18
for the Urdu language text. Instead of predicting the sentiment of a specific group of people,
feedback about the product, personality, or policy, we decided to extract, classify events
and to predict and monitor the general and specific conversation of all types of users. The
events are classified into the top twelve broad categories i.e., sports, inflation, terrorist
attack, murder, death, sexual assault, fraud and corruption, weather, earthquake, business,
politics, and showbiz.
Text of Urdu language has considerable volume on social media and news websites. It
contains invaluable information that are certainly essential to develop many NLP
applications. To our best knowledge still, no referential research work exists for events
extraction in the Urdu language. In our research thesis, we propose to extract events and
temporal information related to events from textual data for the Urdu language.
1.9 Research Problem
Nowadays, the local languages of the most populous countries created their interaction
space on social media, mobile devices, and news websites. The usage of Urdu language
text on social media, news websites, and mobile devices is growing rapidly. The invaluable
information can be extracted from these sources of data. Event classification and temporal
entity extraction from the text written in Urdu language script are the major problems of
our research work.
In our research work we decided to accomplish three main tasks related to the Urdu
language: 1) event extraction: It is the task to extract event information for input data, 2)
event classification: It task to assign a predefined label to input i.e. public protest, sports,
terrorist attack, inflation, murder, death, etc. and 3) extracting temporal information
associated with events to identify as a retrospective and real-time (new) events from Urdu
script.
19
Figure1.11: A generic application diagram of our proposed system
1.10 Research Objectives
The objectives of our research are given here:
1 To propose an approach to identify and classify events in Urdu language text
2 To propose an approach to classify events in Urdu language text into different
types i.e., sports, politics, and protest, etc.
3 To propose an approach for extracting temporal entities from Urdu language text,
4 To propose an approach to segregate events into real and retrospective by using
extracted temporal entities.
20
5 To develop an event-based Urdu text data collection.
1.11 Thesis Organization The thesis is organized into different chapters to explain the research work. A detailed
summary of the chapters and related information is given below:
Chapter 1
In this chapter, a comprehensive introduction of the research work is presented that covers
the importance and scope of our work.
Chapter 2
The research work related to our research problem is given in the second chapter.
Chapter 3
The detail of the dataset is given in chapter 3.
Chapter 4
The detail of the methodology and experimental results are given in chapter 4.
Chapter5
All the detail about temporal entities and an overall discussion of the research work is
given in chapter 5.
Chapter 6
The research work is concluded in this chapter. We also included the future work after the
conclusion.
21
CHAPTER 2
BACKGROUND AND RELATED
WORK
22
In this chapter we have described the comprehensive review of literature work. The
related work has presented in two sections. In first section, event detection and
classification work that already exist for other languages has been reported, while
the second section consists of the literature review of temporal entity extraction.
2.1 Event Detection and Classification
Initially, event detection was used for the biomedical domain i.e. to extract gene and protein
entities (Hogenboom et al., 2016). Event extraction from textual data found on the internet
supported by Advanced Research Project Agency (ARPA) originated in the late 1980s to
message understanding and automatically detecting terrorism-related text from newswire
(Hogenboom et al., 2016). These days event detection enlarged its scope from gene
expression to protein expression i.e. finding events related to gene and protein entity
(Yakushiji et al., 2001). Event detection is not confined to the biomedical domain but being
used in other domains To remain updated by the latest events occurring on social media
(Ritter et al., 2015) weakly supervise approaches are used. Seed example of candidate
events as given as input to the system and new events can be categories. Computer security
events are reported in work i.e., Denial of services, Data breaches, and data hijack.
Identifying events and events location performed (Bahir & Peled, 2016)using keywords
and contextual information. The analysis showed that name of the event location was used
in many instances of the textual message. Event location detection is very important. For
example, in case of disastrous events i.e., fire, earthquake, or typhoons rescue team requires
to know the “location” of the event. A Twitter data corpus Edinburgh (Petrovic et al., 2010)
consists of 97 million tweets that had been used for retrospective event extraction (Li et al.,
2017) used the temporal module to differentiate between clusters of real-time (new) event
and retrospective (old) event. Jyoti et al. (J. P. Singh et al., 2019) developed a neural
network-based system to classify events to help out the people in a natural disaster like a
flood by analyzing tweets. The Markov model used to classify and predict the location that
showed 81% accuracy for classification tweet as a request for help and 87% accuracy to
locate the location. Research work was conducted on life event detection and classification
i.e. marriage, birthday and traveling, etc. to anticipate products and services to facilitate the
people (Cavalin Rodrigo Paulo, 2016). The data about life event exist in very small amount.
Linear regression, naïve bayes and nearest neighbor algorithms were evaluated on original
dataset that was very small but did not show favorable results. Oversampling of training
dataset greatly affected the performance of algorithms in which linear regression
23
outperformed and showed considerable results.
An exhaustive review of the literature concentrated our findings that events can be detected
by three general approaches i.e. Data-Driven, Knowledge-Driven, and Hybrid (Allen,
1983). In general Machine Learning Classification is performed based on hand-crafted
rules i.e., pattern matching by regular expression, handcrafted rules using lexical features
(full-length words, Part of Speech Tagger), syntactic features (Parsing dependency), and
external knowledge features (WordNet). A knowledge-based system developed by (Ferro
Lisa, Gerber Laurie, Mani Inderjeet, 2005) was applied to Arabic Tweets using an
unsupervised rule-based technique. It had focused to extract three parameters related to
events i.e., trigger, time, and identification. The proposed system had accuracy 75.9%,
87.5%, and 97.7% respectively. A system developed to extract events from a tweet a noisy
piece of information i.e. TWICAL is an open domain event extractor from Twitter
(Filannino & Nenadic, 2015). French Timebank corpus was developed by Adre Bittar et al.
to (Bittar et al., 2011) extract time, events, and the relation between them. Cross-language
annotation and French language guidelines had been specified to improve the ISO-
TimeML. An improvement in modality capturing system was made after the analysis of
French text that revealed that modality was expressed using inflected verbs. A set of the
normalized value of modality attribute i.e., necessity, possibility, obligation, and
permission provided in manual annotation context. Another contribution was made to
provide a way to capture the difference between neutral aspect value and inchoative aspect
value of support verb construction. Finally, a new type of event class i.e. Event-Container
was introduced to distinguish predicates that taken an event nominal as subject (Bittar et
al., 2011). Some correspondence had made between English and French Grammar,
imperfect morphological tense included in French which was not exist in English. (Caselli
& Sprugnoli, 2017) tools were used to process and evaluate the contents consisting of
61,000 tokens.
Table 2.1: TempEx Tagger for an exact match on tag span and value calculation (Bittar et al., 2011).
System Precision Recall F1-Measure
Match TempEx 84.2 81.8 83.0
DEDO 83.0 79.0 81.0
Value TempEx 55.0 44.9 49.4
DEDO 56.0 45.0 50.0
TempEval-2 released TimeML annotated data for Chinese, English, French, Italian,
24
Korean, Spanish (Costa & Branco, 2012) and (Caselli & Sprugnoli, 2017). Portuguese
dataset consists of 70,000 words annotated in TimeML language was developed by (Costa
& Branco, 2012) used to extract event and temporal information related to events.
Table 2.2: Performance of Classifier on Portaguess Dataset (Costa & Branco, 2012)
F-Measure
Group Algorithm English Portuguese
A KStar 0.59 0.58
Baseline 0.57 0.59
B Decision Table 0.73 0.77
Baseline 0.56 0.56
C SMO 0.54 0.54
Baseline 0.47 0.47
A corpus developed for the Persian language based on ISO-TimeML annotation to extract
event and temporal information. (Yaghoobzadeh et al., 2012) did the first attempt in the
Persian language and developed 4237 events from 30, 000 sentences (Yaghoobzadeh et al.,
2012).
• Multiple tokens located at a different location in the same sentence can be
marked as event i.e. Bârân (Rain) be (to) mantaq-e (area) sadame-e
(damage) zyâdê (large) khâhad zad (will do).
• Translation: The rain will largely damage the area.
• Part of PersTimeML output will be: <Event xml:id=” e1”
target=”#token3#token5” text= “sadame-e khâhad zad”… />
• Some changes to event attributes, the value of these attributes, annotation
rules, and event extents.
• A perTimeML corpus specifies the annotation of generic as the event.
• Gerund phrases are also annotated as an event even when they represent
generic events.
• Objective deverbal adjectives in PersTimeML are adjectives that derived
from passive modes of verbs.
• Compound words consisting of non-verbal and light verbs are marked as
events.
25
Table 2.3: Evaluation Results for PET for Event Recognition (Yaghoobzadeh et al., 2012)
Rule-based Learning-based
Category Precision Recall F-Measure Precision Recall F-Measure
All 78.9 72.5 75.6 79.2 87.5 83.1
Verb 96.5 99.3 97.9 97.1 99.5 98.3
Noun 66.3 64.4 65.3 82.1 81.8 77.3
Adjective 88.5 55.8 68.4 78.3 76.4 77.3
Using TimeML for other Non-English languages rather than English coined two
approaches:
• Modification of annotation scheme starting from automatic porting of existing and
annotated English Corpus to other languages,
• Design of language-specific annotation specification and the corresponding
annotated resources from scratch (Caselli & Sprugnoli, 2017).
In the past, researchers were impassive in the Urdu language because of limited processing
resources, i.e., datasets, annotators, Part-of-Speech (PoS) taggers, and translators (A. R. Ali
& Ijaz, 2009), etc. However, now, in the last few years, feature-based classification for
Urdu text documents started the use of machine learning models (Daud et al., 2017,
Mehmood et al., 2019 and AHMED et al., 2016). A framework was proposed (Zia et al.,
2015) to classify Chinese short texts into 7 kinds (Zia et al., 2015) of emotion and product
review. The event-level information of sentence from text and contextual information from
the external sources (lexicon, knowledge base) is provided as supplementary supporting
material to the neural models.
A fusion of CNN and RNN models is used to classify sentences using a movie review
dataset and achieved 93% accuracy (Abdlrauf, 2017). Urdu text classification at document
level is presented (Zhou et al., 2018) that has showed the comparative analysis of Machine
learning and deep learning models. CNN and RNN single layer/multilayer architectures are
used to evaluate three different sizes of the dataset (Liu & Guo, 2019). The idea was to
analyze and predict the quality of product using the feedback of customers. The categories
the feedback as Valuable, Not valuable, Relevant, Irrelevant, Bad, Good, or Very Good (Y.
Zhang, 2012).
Different datasets reported in state-of-art i.e. Northwestern Polytechnical University Urdu
26
(NPUU) consists of 10K news articles labeled into six classes, Naïve dataset including 5003
news articles consists of five classes (Zia et al., 2015) while COUNTER has 1200 news
articles and five classes (Y. Zhang, 2012). A joint framework consisting of CNN and RNN
layers used for sentiment analysis (A. R. Ali & Ijaz, 2009). Two datasets Stanford movie
review and Stanford Treebank dataset were used to evaluate the designed system. The
system had showed 93.3% and 89.2% accuracy, respectively.
In (A. R. Ali & Ijaz, 2009), the authors had performed a supervised text classification in
the Urdu language by using a statistical approach like Support Vector Machine (SVM) and
Naïve Bayes. The classification was initiated by applying different preprocessing
approaches, namely, stemming, stop word removal, and both stop words elimination &
stemming. The experimental results had showed that the steaming process had little impact
on improving performance. On the other hand, the elimination of stop words showed a
positive effect on results. The SVM outperformed the Naïve Bayes by achieving the
classification accuracies of 89.53% and 93.34% based on polynomial and Radial function,
respectively.
Similarly, the SVM is also applied in the News Headlines classification (Usman et al.,
2016) in Urdu text showing a very low amount of accuracy improvement of 3.5%. News
headlines are a small piece of information that frequently does not describe the contextual
meaning of the contents. In (Usman et al., 2016), the majority voting algorithm is used for
text classification in the Urdu language showed 94% accuracy. The classification is
performed on seven different types of news text. However, the number of instances was
very limited. A dynamic neural network (Kalchbrenner et al., 2014) was designed to model
the sentiment of sentences. It consists of Dynamic K-modeling, pooling, and global pooling
over a linear sequence that performs multi-class sentiment classification.
A quite different task is performed (Awais & Shoaib, 2019) in which the authors used a
hybrid approach of rule-based and machine learning-based techniques to perform the
sentiment classification while analyzing the Urdu script (Awais & Shoaib, 2019) at the
phrase level. The performance of the system was 31.25%, 8.46%, and 21.6% for recall,
precision, and accuracy, respectively. In (J. P. Singh et al., 2019),the limitations of
traditional approaches BOW and n-gram features were tackled by using variant of RNN
know as Long-Short-Term-Memory (LSTM).
A neural network-based system In (J. P. Singh et al., 2019) was developed to classify
events. The purpose of the system was to help the people in natural disasters like floods by
analyzing tweets. The Markov model was used to classify and predict the location that
27
showed 81% accuracy for classification tweets as a request for help and 87% accuracy to
locate the location. Research work was conducted on life event detection and classification,
i.e., marriage, birthday and traveling, etc. to anticipate products and services to facilitate
the people (Cavalin Rodrigo Paulo, 2016).
A multiple minimal reduct extraction algorithm was designed (Al-Radaideh & Al-Abrat,
2019) by improving the quick reduct algorithm. The multiple reducts are used to generate
the set of classification rules which represent the rough set classifier. A corpus that was the
collection of 2700 Arabic text documents, had been evaluated using multiple and single
reducts. The proposed system showed 94% and 86% accuracy, respectively. Experimental
results also had shown that both the k-NN and J48 algorithms outperformed regarding
classification accuracy using the dataset on hand. Table 1 depicts a summary of the related
research discussed.
Table 2.4: Summary of the Related Research Work
Paper
Reference Classifier used Dataset Accuracy
(Hassan,
2018) CNN and RNN Movie Reviews 92%
(Zia et al.,
2015) CNN and RNN
1. Stanford movie
review dataset
2. Stanford
Treebank dataset
93.3% and 89.2%
(A. R. Ali
& Ijaz,
2009)
Naïve Bayes and
SVM
Corpus of Urdu
documents 89.53% and 93.34%
(Usman et
al., 2016)
Dynamic neural
network News articles 96.5%
(Awais &
Shoaib,
2019)
Rule-based modeling Urdu corpus of news
headlines 31.25%
(J. P.
Singh et
al., 2019)
LSTM Tweets 81.00%
(Al-
Radaideh
& Al-
Abrat,
2019)
K-NN and J48 Arabic corpus of
2700 documents 95% and 86%
28
2.2 Existing Methodologies
Three broad approaches for the temporal entity and Events extraction are Data-Driven,
Knowledge-Driven, and Hybrid approaches (Allen, 1983).
2.2.1 Data-driven Approach
Statistic, Machine Learning, and Linear Algebra methods are used in this approach. This
approach requires a huge volume Corpora. It does not consider the semantic of words i.e.
the meaning of words while discovering relations in the dataset. It is helpful to develop a
language independent Event Detection System (Allen, 1983). Classification, clustering,
and regression are common types of the machine learning approach. Deep learning is an
emerging innovation in machine learning is another approach for NLP.
2.2.2.1 Different Techniques/Methods in Data-Driven Approach (Allen, 1983)
• Word Frequency Count,
• Ranking by mean of TF-IDF,
• Word Sense Disambiguation,
• N-grams,
• Clustering,
• Hierarchical Clustering,
• Weighted undirected bipartite Graph and Clustering.
2.2.2 Knowledge-Driven Approach
It is the pattern-based approach. Patterns help to design rules for event extraction from
textual data. There are two types of linguistic patterns i.e. Lexicon-syntactic and Lexicon-
Semantic. Lexicon-syntactic uses the grammatical features i.e. Part of speech and tense
while Lexicon-Semantic uses the contextual meaning of the words. Regular expressions
are used to combine Lexicon and Syntactic patterns. In Personal Blogs, experiences events
were extracted using three words i.e., Place, Object, and Verb that together represent the
event. Semantic information is used to find patterns about the event. Semantics are added
by using gazetteers or by ontology (Allen, 1983).
2.2.3 Hybrids Approach
The hybrid approach combines the properties of both the Data-Driven and Knowledge-
Driven approach. It is highly suitable when we have a small amount of data and less
command of the target language (Allen, 1983).
29
2.3 Temporal Entity Extraction
Initiative in name entity recognition system was taken (Woodward, 2001)which extract the
‘company’ names by using heuristic and handcraft rules. Rules design by learning patterns
in contents by a human. Language is a big factor in the case of textual data analysis, a good
portion of the research is made for the English language but mostly researchers highlighted
multilingual and language independence which is due to the heterogeneous and
unstructured nature of data (Nadeau & Sekine, 2007).
To quickly convey the idea of contents a tool (Nadeau & Sekine, 2007) developed by
learning about people, place, things, and events. Information visually provided by
extracting entities and events form textual data. Historical collection of American civil war
news article published on Wikipedia which incorporated with all Named Entities with
proper time are analyzed by Stanford NLP CRF achieved 79.1% f-measure. Change,
causality, and actions are defined in term of time, many artificial intelligence applications
require time and reasoning about time i.e., to answer a “When” query system need to anchor
events. Similarly, “How long” questions require event duration to respond properly. An
approach developed (Llidó et al., 2001)to automatically assign document event-time by
extracting temporal expression from the text. It helps to retrieve related documents based
on temporal values and finding the relationship between them.
(Hao et al., 2018) designed a novel method TEER to extract and normalize temporal
expression from heterogeneous clinical text. They use heuristic rules, summarization, and
automatic patterns learning. The developed system evaluated on two datasets i.e. English
and Chinese clinical text which consists of 400 English and 1459 Chinese discharge
summaries. Precision and recall for English and Chinese languages are 0.948, 0.877, and
0.941, 0.932 respectively. A sequencer system developed for the ana
lysis of temporal entities (Walenz et al., 2010) existing in news articles and user-generated
unstructured contents. It is based on crawling, clustering, extracting, and visualizing.
WordNet-based features used in the CRF model. Many annotation schemes i.e. PoS
Tagging, Partial Parsing, Semantic Interpretation, case frame instantiation, and discourse
analysis were used to extract temporal expression from textual data (Ferro Lisa, Gerber
Laurie, Mani Inderjeet, 2005). Another system developed to extract fluent information that
is valuable for a certain period. They claimed that many proposed systems focused on static
information while mostly newswire text and Wikipedia are predominant temporal
expression (Ling & Weld, 2010) precision and recall of Temporal Information Extraction
30
were 0.50 to 0.99. In general, Temporal Expression identification performed by machine
learning approaches based on lexical and morphological features (Ahn et al., 2005) Support
vector Machine and Condition Random Field CRF gives considerable results for Non-
cursive languages respectively (john et al., 2001) (Khan Wahab, Daud Ali, Nassir A Jamal,
2016). A system was designed to monitor events that were being reported on social media
for specific time (Huang et al., 2018).
A considerable volume of research work exists for non-cursive languages especially for
English, French, German, Dutch, and Spanish (Nadeau & Sekine, 2007) which achieved
noticeable accuracy for developing mature artificial intelligence applications.
In Urdu literature, there is no proper exhaustive research work exists for temporal entities.
In 2008 International Joint Conference on Natural Language Processing IJCNLP (IJCNLP)
proposed a set of 12 named entities for South-Asian language including Temporal Entity
i.e. Date and Time as single Entity (Liao & Veeramachaneni, 2009). A rule-based approach
was adopted in (U. P. Singh et al., 2012), which focused on date and time tags. They used
Regular Expression (RE) to extract the specific pattern of date i.e., 01.08.2015 or
01/01/2014. The same system is also able to identify date like May 01, 2018 and achieved
90.83% F1-Measure. To our best knowledge, there is no detailed discussion about different
types and formats of date in the Urdu language. Lack of resources i.e., lexicon, gazetteers,
and dataset are the main factors to adopt rule-based approaches. In (Riaz, 2010) a generic
name entity recognition system used a rules-based approach to extract name entities
including Date from the Urdu language which achieved considerable F1-Measure for
specific patterns i.e. ‘1996’ but unfortunately there exists no detail about types and format
of dates in the Urdu language. Central Language of Engineering (CLE) is working for the
Urdu language which offered different datasets available on the website at affordable
prices. A Part of Speech Tagger (PoS) was also developed by CLE and providing services
online. It tags 100 words per attempt free. For further processing, full access can be
provided on request. A system developed by Central Language of Engineering (CLE) does
not evaluate the Temporal Entities (TE). At the same website, a small WordNet which
contains data in UTF-8 format is also publicly available with some charges. Now, from last
few decades’ cursive languages are being popular and attracted researchers to explore for
development of NLP applications. Temporal Data in Urdu language introduced at very
basic level internationally and nationally in different research papers but still, no significant
work proceeded in favor of Urdu Temporal Entity ‘Date extraction’. To our best
knowledge, we are the first one working on Temporal Entity ‘Date’ in Cursive Language
31
Urdu.
A dataset for Urdu language processing (Khan Wahab, Daud Ali, Nassir A Jamal, 2016) is
publicly available for researchers. It is named as Urdu Named Entity Recognition (UNER).
It is specifically developed for name entity extraction from Urdu language text. In the
dataset 12 different types of Name Entity reported including temporal entities. There are
206 tags that represent the temporal entity in the dataset. Existing tools that are designed
for English languages are incompatible with Urdu language processing. The situation
highlighted the need of new methods and approach to process the Urdu language text.
Arabic temporal entity has been extracted using morphological analysis and finite state of
the transducer. Their proposed system identified 12 temporal morphological categories that
augmented the size of the Arabic lexicon with 550 more tags. It achieved 94.6% and 84.2%
recall and precision respectively for temporal entity detection. They also analyzed the
dataset and reported 89.7% recall and 90.8% precision for temporal entity boundary
detection (Zaraket & Makhlouta, 2012).
In the past many cursive languages like Urdu, Arabic, Persian and Hindi were neglected by
researchers because of lacking in resource(Malik, M. K. and Sarwar, 2016b). Only a few
numbers of cursive languages were known publicly, due to lack of interest, inconvenience
in processing, and unavailability of resources i.e. Lexicon, Databases, Dictionaries,
Annotations schemes, and Datasets (Riaz, 2008). To develop generic NLP applications,
time demands to include cursive language into the research stream. A temporal entity based
module was used to extract the cluster of real-time and retrospective (Old) events (Li et al.,
2017).Temporal information is essential to classify latest and ancient events. An approach
was developed (Llidó et al., 2001) to automatically assign document event-time by
extracting temporal expression from the text. It helped to retrieve related documents based
on temporal values and finding the relationship between them.
Summary
We have discussed the research work that exists for event detection, event classification
and temporal entity extraction for different languages i.e., English, Arabic, Hindi and
Persian etc. We also tried our best to discuss the related work of Urdu language. The detail
of existing approaches that are used for event and temporal entity extraction is also
described in this chapter.
32
CHAPTER 3
DATASETS
33
The purpose of our research work is to classify multiple types of events that are occurring
in a specific time duration. Extracting temporal entities is very important to construct event
timelines. To achieve our objectives, we decided to use two different datasets. i.e. one for
Event Classification named “Multiclass Urdu Language Labelled Sentences” (MULLS)
and another dataset that is “Urdu Named Entity Recognition (UNER) (Khan Wahab, Daud
Ali, Nassir A Jamal, 2016)” dataset for Temporal entity extraction.
1. Multiclass Urdu Language Labelled Sentences (MULLS)
2. Urdu Named Entity Recognition (UNER)
Dataset plays a vital role to achieve research goals. We prepared our dataset in comma-
separated value CSV format. The cycle of dataset preparation consists of various phases.
The complete life cycle of dataset preparation is presented in the figure 3.1.
Figure 3.1: Dataset Life Cycle
Data collection
Assigning Label
Paragraphs Splitting
Concatenating other information
Pre-processing
• Stop words elimination
• Data Cleaning
• Tokenization
• Max. and min. length
•
Data set distribution
Phase I
Phase II
Phase III
• Machine Learning
• Deep Learning
34
The detail of both datasets is given in proceeding sections.
Phase I
The initial phase of data collection consists of several important steps. Each step is
explained in detail in proceeding sections.
3.1 Multiclass Urdu Language Labelled Sentences (MULLS)
The MULLS corpus is used in our researcher work for multiclass event extraction and
classification. We have named it as Multiclass Urdu Language Labelled Sentence
(MULLS). The detail of different steps that were performed during the preparation of
dataset are given below in subsections.
3.1.1 Data Collection
In literature various datasets are reported but none of those is specific for event
classification (Sharjeel et al., 2017 and Zia et al., 2015).
So, we have created a larger dataset-specific for event classification. Instead of focusing on
a specific product (Akhter et al., 2020) analysis, or phrase-level sentiment analysis (Awais
& Shoaib, 2019), we decided to classify sentences into multiple event classes. Instead of
using the joint framework of CNN and RNN for sentiment analysis (A. R. Ali & Ijaz, 2009),
we evaluated the performance of deep learning models and popular machine learning
classification models for multiclass event classification.
Data is the core element of the research phase, so to decide and select the source of data is
very sensitive. Authenticated, reliable, and popular data sources should be selected for data
collection. Nowadays social networks and news websites are very popular sources for
information. We decide to collect Urdu text from different data providing sources instead
of a single source to develop a more generic system. In our case we started data collection
from both social networks and news websites like Twitter, Facebook and Geo News
websites, Urdu point, respectively.
A script (crawler) of Python (3.6 version) is used to retrieve data using event-related
keywords (دھماکہ، کھیل،بارش موت، جلسہ اور حکومت وغیرہ) from Twitter and Facebook.
We spent a couple of weeks and crawled 26000 posts for a specific time from 2017 to 2018
(25-06-2017 to 15-11-2018). The collection consists of twenty-two (22) classes of different
events i.e., sports, inflation, murder, terrorist attack, death, accident, politics, education,
showbiz, interesting and strange, government-official, earthquake, fraud and corruption,
religious, weather, science and technology, international, business, health, law and order,
sexual assault and others.
While collecting data, a PHP-based web scraper is written to crawl data from popular news
35
websites, i.e., Geo News Channel4 website, BBC Urdu5 , and Urdu point6 . A complete post
is retrieved from the website and stored in MariaDB (database). It consists of a title, body,
published date, location, and URL. The sample text or tweet of both languages of the South
Asian countries, i.e., Urdu language on Twitter and Hindi language on Facebook, is shown
in figure 3.2.
Figure 3.2: Urdu and Hindi Language Text on Social Media
There are 0.15 million (1,50,000) Urdu language sentences. The diversity of data collection
sources helped us to develop multiclass data sets. It consists of twelve types of events. The
subset of data datasets can be useful for other researchers.
As we have described earlier that our task is to classify events at sentence level instead of
whole document classification. Our dataset contains 0.15 million instances (sentences). All
the instances (sentences) are labelled with twelve (12) different types of events. The detail
of each event and its total number of instances are shown in figure 3.4.
4https://urdu.geo.tv/ 5https://www.bbc.com/urdu 5https://www.urdupoint.com/daily/ 6https://www.kaggle.com/rtatman/urdu-stopwords-list
36
Phase II
3.1.2 Pre-processing
In the first phase of dataset preparation, we have performed some pre-processing steps, i.e.,
noise removing and sentence annotation/labeling. All non-Urdu words, sentences,
hyperlinks, URLs, and special symbols were removed from dataset. It was necessary to
clean out the dataset to annotate/label the sentences properly. The initial steps are
performed on the corpus to prepare for machine learning algorithms. Because textual data
cannot directly process by machine learning classifiers. It also contains many irrelevant
words. So, we must apply some pre-processing steps, in our research case. The detail of all
the pre-processing steps followed in our research problem is given below.
3.1.2.1 Post Splitting
The PHP crawler extracted the body of the post. It comprises of many sentences as a
paragraph. In the Urdu language script, sentences end with a sign called “-“Hyphen
(Khatma-تمہ). It is a standard punctuation mark in the Urdu language to represent the end
of the sentence. As mentioned earlier, we are performing event classification at the sentence
level. So, we split paragraphs of every post into sentences. Every line in the paragraphs
ending at Hyphen is split as a single line.
3.1.2.2 Stop Words Elimination
Generally, those words that occur frequently in text corpus are considered as stop words.
These words merely affect the classifier's performance. Punctuation marks (“!”, “@”,” #”,
etc.) and frequent words of the Urdu languages ( , کا ، کے ،کی وغیرہ etc.) are the common
examples of stop words. All the stop words (Capet et al., 2008) that do not play an
influential role in event classification for the Urdu language text are eliminated from the
corpus. Stop words elimination reduces the memory and processing utilization and make
the processing efficient. A list of standard stop words of the Urdu language is available
here7.
3.1.2.3 Noise Removal
Our data is collected by different sources (see section 3.1). It contains a lot of noisy
elements i.e., Multilanguage words, links, mathematical characters, and special symbols,
etc. In collected croups, we found many multi-lingual sentences in the post. To make our
corpus clean and ready for further processing, we removed those sentences, irrelevant links,
7 https://www.kaggle.com/rtatman/urdu-stopwords-list
37
and special characters from the corpus.
3.1.2.4 Filtering Sentences
The nature of our problem confined us to define the limit of words per sentence. Because
of the multiple types of events, it is probably hard to find a sentence of the same length.
We decided to keep the maximum number of sentences in our corpus. All those sentences
which are very short and very long removed from our corpus.
Our observation depicted that a lot of sentences vary in length from 5 words to 250 words.
We decided to use sentences that consist of 5 words to 150 words to lemmatize our research
problem and consumption of processing resources.
3.1.3 Annotation Guidelines
• Go through each sentence and assign a class label,
• Remove ambiguous sentences,
• Merge relevant sentences to a single class i.e., accident, murder, and death, etc.,
• Assign one of the twelve types of events i.e., Sports, Inflation, Murder and Death,
Terrorist Attack, Politics, Law and Order, Earthquake, Showbiz, Fraud and
Corruption, Weather, Sexual Assault, and Business to each sentence.
Sentence annotation is the imperative and attentive task of our research project. It was a
very exhaustive and time-consuming task. It took a couple of weeks to assign labels to each
sentence.
To annotate our dataset, two M.Phil. (Urdu) level language experts were engaged. They
deeply read and analyzed the dataset sentence by sentence before assigning event labels.
They recommended removing (46035) sentences from the dataset because those sentences
would not contain information that useful for event classification. Finally, after annotation,
the dataset size was reduced to 103965 imbalanced instances of twelve different types of
events. The annotation inter-agreement i.e., Cohen Kappa score is 0.93 which indicates the
strong agreement between the language expert annotators. The annotated dataset is almost
perfect according to the annotation agreement score.
Few examples of labelled sentences are given in the table below:
38
Table 3.1: Urdu Label Sentences
Example no. Title Sentences Class/Label
انگلینڈ اور نیوزی لینڈ کے درمیان کرکٹ ولڈ کپ کا کھیل 1
یادگارمیچ ہوا ہے۔
Sports
عمران خان نے جلسوں کی مدد سے لوگوں میں سیاسی سیاست 2
شعور پیدا کیا۔
Politics
ڈیینگی مچھر کےکاٹنے کی وجہ سے دس افراد لقمہ اجل بن موت 3
گئے۔
Death
4
مہنگائی
قیمتوں میں اضافہ نے غریب عوام کی خوردونوش اشیا کی
کمر توڑ دی۔
Inflation
کوئٹہ کے نواحی علاقہ میں خود کش دھماکہ متعدد افراد دھماکہ 5
جاں بحق ہوئے۔ Terrorist
Attack
After performing data cleaning and stop word removal every sentence is tokenized into
words, based on white space. An example of sentence tokenization is given in table 3.2.
Table 3.2: Sentence Tokenization
Sentence Tokenized sentence
کرونا وائرس متعدد لوگوں جان لے لی کرونا وائرس نے متعدد لوگوں کی جان لے لی۔
گر گئی طوفانی بارش سے کئی گھروں کی چھت طوفانی بارش کئی گھروں چھت گر گئ
The previous pre-processing steps revealed that many sentences are varying in length.
Some sentences were so short, and many were very long. We decide to define the length
boundary for tokenized sentences. There exist many sentences in the dataset that have
length range from 5 words to 250 words. We selected those sentences that consist of 5
words to 150 words. An integer value was assigned to the type of event for all the selected
sentences. The detailed description of different types of events and their corresponding
numeric (integer) values that are used in the dataset is also given in the table below.
Table 3.3: Class Label
Event Label Event Label
Sports 1 Earthquake 7
Inflation 2 Showbiz 8
Murder and Death 3 Fraud and Corruption 9
Terrorist Attack 4 Rain/Weather 10
Politics 5 Sexual Assault 11
Law and Order 6 Business 12
In figure 3.3 few instances of the dataset after pre-processing are presented. It is a comma-
separated value (CSV) file that consists of two fields i.e., sentence and label.
39
Figure 3.3: Instances of the Pre-Processed Dataset
In our dataset, three types of events have a larger number of instances i.e., sports (18746),
politics (33421), and fraud and corruption (10078). Contrary to three other types of events
that have a smaller number of instances i.e., sexual assault (2916), Inflation (3196), and
earthquake (3238).
While the remaining types of events have a smaller difference of instances among them.
There are 51814 unique words, and the total number of tokens is 20,79,967. The summary
of the dataset is given below in the table.
Table 3.4: The Summary of Dataset
Total types of
sentences
Total number
of tokens
Maximum no. of
Unique tokens
Total
sentences
Labelled
Sentences
12 20,79,967 51814 1,50,000 1,03,965
The visualization in figure 3.4 showed that the dataset is imbalanced.
40
Figure 3.4: Maximum Number of Instances of Each Type of Event
While during the second phase of the pre-processing dataset is prepared to machine-
understandable format. Some of the steps that are performed to convert text data into
numeric format are:
• Count_ Vectorizer
All the machine learning classifiers can learn patterns by numeric values. Whenever we try
to learn the pattern in natural languages as human being text are appropriate for us to
understand the structure of information. Machines cannot understand textual data directly.
Whenever we try to solve the problems of the natural languages, it may be simple or
complex, we need to convert all non-numeric data (text, audio, video, image, and graphs,
etc.) into numeric format before feeding into the machine learning classifiers. Machine
learning models do not accept the textual data so, there is not to vectorize the data into the
numeric format. It counts the total number of times a term has occurred in the documents.
It provides a way to tokenize the collection of documents and create a vocabulary of known
words. This vocabulary works like a dictionary to generate feature vectors. We can encode
new documents using that vocabulary.
An encoded vector is returned with a length of the entire vocabulary and an integer count
41
for the number of times each word appeared in the document.
The working strategy of the count vectorizer is given in table 3.1.
For example, we have two sentences:
i. Millions of people died of Covid-19.
ii. The fifth war is a psychological game among people.
Table 3.5: The Example of Count Vectorizer
Million War Fifth Covid-
19
Died People Psychological Game
Sentence 1 1 0 0 1 1 1 0 0
Sentence 2 0 1 0 0 1 1 1 1
• Term Frequency Inverse Document Frequency (TF_ IDF)
Term frequency referrers to the total count of term t in the documents. Generally, text
mining, information retrieval, and classification are based on the weighted value of tf-idf.
The weighted values are the statistical measure that shows how a word/term is important
in the collection of corpus/dataset.
The importance increases proportionally to the number of times a word appears in the
document but is offset by the frequency of the word in the corpus.
One of the simplest ranking functions is computed by summing the tf-idf for each query
term; many more sophisticated ranking functions are variants of this simple model.
Tf-idf can be successfully used for stop-words filtering in various subject fields including
text summarization and classification.
( )( )
_( )
Total number of documentsIDF t log e
Number of documentswith term t in it= (Wu et al., 2008) (1)
TF_IDF TF*IDF= (2)
3.1.4 Training Dataset
To develop a generic model for multiclass event classification, we divided our dataset into
three sub-set i.e., training dataset, testing, and validation dataset. Random distribution of
data is performed by using python library scikit. We distributed a 75% dataset randomly
for training purposes. There are 77974 labeled instances for different types of events in our
training dataset. A multiclass-instances-based training dataset is used to training deep
learning models.
42
3.1.5 Testing/Validation Dataset
To evaluate the performance of our trained model, we used a 25% dataset for
testing/validations purpose. It consists of 25991 unknown instances that were never seen
by trained models while10% of the testing dataset was used for validation purposes.
Phase III
Data distribution
We further divided our event classification dataset into two sub-datasets. Our purpose is to
evaluate the performance of traditional machine learning classifiers and advanced machine
learning classifiers (deep learning). It can be observed in Figure 3.1 that we did not use
Machine Learning for dataset preparation. Although in the last stage of dataset preparation
life cycles, we mentioned that we have created two different datasets from the same corpus
one for machine learning classifiers and other for deep learning classifiers. The main
different between these two datasets is that the dataset used for deep learning classifiers
contains only sentence while the dataset used for machine learning contains different
attributes like title, length, and sentence.
1 Machine Learning
A dataset that consists of other related information (title, location, and date) as described
in above table 4.1 is used to evaluate the traditional machine learning classifier. It contains
the same number of instances as the whole dataset.
2 Deep Learning
Another dataset that contains only simple tokenized sentences instead of other features i.e.,
title, location, and date is used to evaluate deep learning classifiers. The maximum number
of sentences also the same as reported in the above paragraphs.
The detail of the temporal entity dataset is given in the next section.
3.2 Urdu Named Entity Recognition (UNER) Dataset
Another dataset that is used in our work, publicly available for researchers to extract named
entities from the Urdu language text.
Dataset for the Urdu language generally exists for name entity extraction with a small
number of instances which are:
• Enabling Minority Language Engineering (EMILLE) (only 200000 tokens)
(Baker, Paul, Andrew Hardie, Tony McEnery, 2003).
• Becker-Riaz corpus (only 50000 tokens) (Becker & Riaz, 2002)
43
• International Joint Conference on Natural Language Processing (IJCNLP)
workshop corpus (only 58252 tokens)
• Computing Research Laboratory (CRL) annotated corpus (only 55,000 tokens are
publicly available data corpora (Kanwal et al., 2019)
A rule-based named entity recognition system for the Urdu language was proposed to
extract named entities (Riaz, 2010). To our knowledge, there is no specific data set
available for temporal entities extraction from the Urdu language. We selected a dataset to
develop for name entity extraction (Khan Wahab, Daud Ali, Nassir A Jamal, 2016). It
consists of 206 date tags including a single month name, year, or both. It is about National,
Sports, and International News including Urdu Fully Qualified, Urdu Hybrid Fully
Qualified, Urdu Deictic, and Urdu Anaphoric. The exhaustive analysis revealed that there
are only 5-10 Fully Qualified Dates which made us impassive. It also revealed that 18
different date patterns are lying in limited date tags which created a problem for generating
generic regular expressions for date extraction.
We decided to extend the existing dataset by adding 200 Urdu Fully Qualified dates and 50
Urdu deictic words. We added 50 dates for UFQD and 150 dates for HUFQD in UNER
dataset.
Similarly, 50 deictic words were added 25 of them were representing dates while 25 deictic
words were representing name entities. We have placed these dates at a different location
in documents i.e., at the sentence level, at the beginning, middle, and end of the sentence.
For example, لکھا جائے گا۔ ںیالفاظ م یچھ ستمبر دوہزار اٹھارہ کو سنہر ںیم خیتار یپاکسان ک (Pakistan
ki tareekh mein chey stmber do hazar atharah ko sunhari alfaz mein likha jay ga) in the
sentenceچھ ستمبر دوہزار اٹھارہ (chey stmber do hazar atharah) represents date which is placed
in the middle of the sentence.
Summary
In this chapter we have discussed in detail about two datasets that we have used in our
research work.
44
CHAPTER 4
EVENT CLASSIFICATION
45
In this chapter the detail of experiments, different feature vector generating techniques,
proposed methodology and result related to event classification has been discussed in
detail. 4.1 Proposed Methodology for Event Classification
The selection of methodology is tightly coupled with the research problem. In our problem,
we decided to use machine learning (traditional machine learning and deep learning
approaches) classifiers. Some traditional machine learning algorithms, i.e., K Nearest
Neighbor (K-NN), Random Forest (RF), Support Vector Machine (SVM), Decision Tree
(DT), and Multinomial Naïve Bayes (MNB) are evaluated for multiclass event
classification. Deep learning models, i.e., a Convolutional Neural Network (CNN), Deep
Neural Network (DNN), and Recurrent Neural Network (RNN), are also evaluated for
multiclass event classification.
A collection of Urdu text documents D = {d1, d2……, dn} is split into the set of sentences
S = {s1, s2, …., sn}. Our purpose is to classify the sentences to a predefined set of events
E= {e1, e2, en}. Various feature generating methods are used to create a feature vector for
deep learning and machine learning classifiers, i.e., TF_IDF, one-hot-encoding, and word
embedding. Feature vectors generated by all these techniques are fed up with input into the
embedding layer of neural networks. The output generated by the embedding layers is fed
up with the next fully connected layer (dense layer) of deep learning models, i.e., RNN,
CNN, and DNN. A relevant class label out of twelve categories is assigned to each sentence
at the end of model processing in the testing/ validation phase.
Bag-of-Words is a common method to represent text. It ignores the sequence order and
semantic of text (Joachims, 1998) Text while the one-hot-coding method maintains the
sequence of text. Word embedding methods word2Vec and Glove8 that are used to generate
feature vectors for deep learning models are highly recommended for textual data.
However, in the case of Urdu text classification, pre-existing wrod2Vec and Glove are
incompatible. The framework of our designed system is represented in figure 4.2. It shows
the structure of our system from taking input to producing output.
8 https://ybbaigo.gitbooks.io/26/pretrained-word-embeddings.html
46
Figure 4.1: Event Classification Methodology
In figure 4.2 comprehensive and detailed flow of the process is shown. It the generic
representation of the experimental setup that summarises all of the steps.
Figure 4.2: Event Classification Methodology’s Flow Diagram
شدید دھند کی وجہ سے نظامِ ذندگی
درہم برہم ہے
Pre-processing
• Stop word elimination
• Tokenization
• Annotation
Feature Selection
Feature Engineering • TF_ IDF
• One-Hot-
Encoding
• Word
Embedding
Classifiers
• DNN (feedforward)
• RNN (LSTM)
• CNN
• KNN
• SVM
• Random Forest
• NBM
• Decision Tree
• Logistic Regression
Train
Test/Validate
1. Sports
2. Inflation
3. Murder
4. Terrorist Attack
5. Politics
6. Law and Order
7. Earthquake
8. Showbiz
9. Fraud and Corruption
10. Weather
11. Sexual Assault
12. Business
Prediction
47
In this chapter experiments and results of event classification and temporal entity,
extraction is described section-wise. In the first section, we mentioned the experimental
setup and results of multiclass event classification while in the second section the detail
about the experimental and results of temporal entities are given.
4.2 Experimental Setup of Multiclass Event Classification
We have performed many experiments on our dataset by using various traditional machine
learning and deep learning classifiers. The purpose of many experiments is to find the most
efficient and accurate classification model for the multiclass event on an imbalance dataset
for the Urdu language text.
4.2.1 Feature Space
Unigram and bigram tokens of the whole corpus are used as features to create the feature
space. TF_ IDF vectorization is used to create a dictionary-based model. It consists of
656608 features. The training and testing dataset are converted to TF-IDF dictionary-based
feature vectors. A convolutional sequential model consists of three layers, i.e., the input
layer, hidden layer, and out layer is used to evaluate our dataset. Similarly, word-
embedding and one hot encoding are also included in our feature space to enlarge the scope
of our research problem.
4.2.2 Feature Vector Generating Techniques
Feature vectors are the numerical representation of text. It is an actual form of input that
can be processed by the machine learning classifier. There are several feature generating
techniques used for text processing. We used the following feature vector generating
techniques.
4.2.2.1 Word Embedding
A numerical representation of the text that each word is considered as a feature vector. It
creates a dense vector of real values that captures the contextual, semantical, and syntactical
meaning of the word. It also ensures that similar words should have a related weighted
value (AHMED et al., 2016).
4.2.2.2 Pretrained Word Embedding Models
Usage of a pre-trained word embedding model for a small amount of data is highly
recommended by researchers in state-of-art. Glove and word2vec are famous word
embedding models that are developed by using a big amount of data. Word embedding
models for text classification, especially in the English language, showed promising results.
48
It has emerged as a powerful feature vector generating technique among others, i.e., TF,
TF-IDF, and one-hot encoding, etc.
In our research case, sentence classification for different events in the Urdu language using
the word embedding technique is potentially preferable. Unfortunately, the Urdu language
is lacking in processing resources. We found only three word-embedding models. A word
embedding model (Baker, Paul, Andrew Hardie, Tony McEnery, 2003) that is developed
by using three publicly available Urdu datasets, Wikipedia’s Urdu text, another corpus
having 90 million tokens (Jawaid et al., 2014) and 35 million tokens(Baker, Paul, Andrew
Hardie, Tony McEnery, 2003). It has 102214 unique tokens. Each token comprises 300
dimensional real values. Another model publicly available for research purposes consists
of 25925 unique words of the Urdu language (Abdlrauf, 2017). Every word has a 400-
dimensional value. A word embedding model comprises web-based text, created to classify
text. It consists of 64653 unique Urdu words and 300 dimensions for each word.
The journey of research is not over here, to expand our research scope and find the most
efficient word embedding model for sentence classification, we decide to develop
custom/own (domain/data specific) word embedding models. We have developed four
word-embedding models that contain 57251 unique words.
The results of pre-trained existing word embedding models are good at the initial level but
very low, i.e., 60.26% is the highest accuracy. We further have explored the contents of
these models, which revealed that many words are irrelevant and borrowed from other
languages, i.e., Arabic and Persian. The contents of Wikipedia are entirely different than
news websites that is also affected the performance of embedding models. Another major
factor, i.e., low amount of data, has affected the feature vector generation quality. Stop
words in the pre-trained word embedding model are not eliminated and considered as a
token, while in our dataset all the stop words are removed. It also reduces the size of the
vocabulary of the model while generating a feature vector. Therefore, we decided to
develop a custom word embedding model on our preprocessed dataset. To postulate the
enlargement of the research task, four different word embedding models has been
developed. The detail of all used pre-trained word embedding models is given in Table 4.1
below.
49
Table 4.1: Pretrained Word Embedding Model and Custom Word Embedding Model
Existing Pre-trained Word Embedding Models
Sr. No. Unique
Words Dimension Window Size
1 (Pillac et al.,
2012) 64653 300 -
2 (Nuij et al.,
2014) 102214 100 -
3 53454 300 - Custom Pre-trained Word Embedding Models
1 57251 50 2
2 57251 100 2
3 57251 100 3
4 57251 350 1
4.2.2.3 One Hot Encoding
Text cannot be processed directly by machine learning classifiers; therefore, we need to
convert the text into a real value. We have used the one-hot encoding to convert text to
numeric features. For example, the sentences given in table 4.2 can be converted to a
numeric feature vector using one-hot encoding as shown in table 4.3.
Table 4.2: Event Sentence
Urdu Sentence English Sentence
.Ali plays football علی فٹ بال کھیلتا ہے
کرو نا وائرس نے لاکھوں
۔ لوگوں کی جان لے لی Corona Virus killed
millions of people.
Table 4.3: Event Sentence Converted Using One-Hot Encoding
Sentence کرونا وائرس لاکھوں لوگوں جان علی فٹ بال کھیلتا
1 1 1 1 1 0 0 0 0 0
2 0 0 0 0 1 1 1 1 1
4.2.2.4 TF-IDF
TF and TF-IDF are feature engineering techniques that transform the text into a numerical
format. It is one of the most highly used feature vectors for creating a method for text data.
Three deep learning models were evaluated on our corpus. The sequential model with
embedding layers outperformed over pre-trained word embedding models (Haider, 2019)
reported in state-of-art (Adeeba, F., Akram, Q., Khalid, H., and Hussain, 2014). The
detailed summary of the evaluation results of CNN, RNN, and DNN is discussed in the
proceeding section.
50
4.3 Deep Learning Models
4.3.1 Deep Neural Network Architecture/Feedforward Neural Network
(DNN)
Deep neural network are the artificial neural networks. The simple DNN also known as
feedforward neural network. The architecture of model (DNN) that is used in our research
work consists of three layers, i.e., input layer, 150 Hidden (Dense) layers, and 12 output
layers. Feature Vector is given as input into a dense layer that is fully connected. The
SoftMax activation function is used in the output layer to classify sentences into multiple
classes.
4.3.2 Recurrence Neural Network (RNN)
The recurrence neural network is evaluated using a Long-short-term memory LSTM
classifier. RNN consists of embedding, dropout, LSTM, and dense layers. A dictionary of
30000 unique most frequent tokens is made. The sentences are standardized to the same
length by using a padding sequence. The dimension of the feature vector is set at 250. RNN
showed an overall 81% accuracy that is the second highest in our work.
4.3.3 Convolutional Neural Network (CNN)
CNN is a class of deep neural networks that are highly recommended for image processing
(Valueva et al., 2020). It consists of the input layer (embedding layer), multiple hidden
layers, and an output layer. There are series of convolutional layers that convolve with a
multiplication. The Embedded sequence layer and average layer
(GloobalAveragePooling1D) are also part of the hidden layer. The common activation of
CNN is RELU Layer. The detail of the hypermeters that are used in our problem to train
the CNN model is given in Table 4.6.
4.3.4 Hyperparameters
In this section, all the hyperparameters that are used in our experiments are given in the
tabular format. To maintain the conciseness and brevity of the dissertation only those
hyperparameters are being discussed that achieved the highest accuracy for DNN, RNN,
and CNN models. The hyperparameters of DNN that are fine-tuned in our work given in
table 4.4.
51
Table 4.4: DNN’s Hyperparameters
Parameter Value Parameter Value
Max_ words 5000 Layers 04
Batch Size 128 Training/Testing 70%-30%
Embedding_
Dim 512 No. of Epochs 05
Activation
Function SoftMax Loss Function
Sparse
Categorical
Cross-Entropy
The RNN model showed the accuracy (80.3% and 81%) on two sets of hyperparameters
that are given in Table 4.5. Similarly, Table 4.6 provides the detail of the hyperparameters
of the convolutional neural network.
Table 4.5: RNN’s Hyperparameters
RNN (LSTM) (80.3%)
Parameter Value Parameter Value
Max_ words 50000 Recurrent
Dropout 0.2
Batch Size 64 Training/Testing 90%-10%
Embedding_
Dim 100 No. of Epochs 15
Activation
Function SoftMax Loss Function
Sparse
Categorical
Cross-Entropy
RNN (LSTM) (81%)
Parameter Value Parameter Value
Max_ words 30000 Recurrent
Dropout 0.2
Batch Size 128 Training/Testing 80%-20%
Embedding_
Dim 100 No. of Epochs 05
Activation
Function SoftMax Loss Function
Sparse
Categorical
Cross-Entropy
52
Table 4.6 CNN’s Hyperparameters
CNN (79.28%)
Parameter Value Parameter Value
Max_ words 20000 Dense_ Node 256
Batch Size 128 Training/Testing 70%-30%
Embedding_
Dim 50 No. of Epochs 20
Activation
Function SoftMax Loss Function
Categorical
Cross-Entropy
Note: These are the optimal number of epochs for our models that showed the highest
results.
4.3.5 Performance Measuring Parameters
The most common performance measuring (Al-Radaideh & Al-Abrat, 2019) parameters
i.e., precision, recall, and f1-measure, are used to evaluate the proposed framework. The
selection of these parameters was decided because of the multiclass classification and
imbalance dataset. In case of imbalance dataset, the to report only accuracy of system is
biased and unreliable. Hence, we reported all other standard metrices (parameters) to
determine the reliability of the proposed system.
( )
TPPrecision
TP FP=
+ (3)
( )
TPRecall
TP FN=
+ (4)
( )
( )
* 1 2*
Precision RecallF
Precision Recall=
+ (5)
( )
( )
TP TNAccuracy
TP TN FP FN
+=
+ + + (6)
Where TP, TN, FP, and FN represent Total Positive, Total Negative, False Positive, and
False Negative values, respectively. Precision is defined as the closeness of the
measurements to each other, and recall is the ratio of the total amount of relevant (TP
values) instances that were retrieved during the experimental work. It is noteworthy that
both precision and recall are the relative values of the measure of relevance.
4.4 Results
4.4.1 Deep Learning Classifiers
The feature vector can be generated using different techniques. The results of feature vector
generating techniques that are used in our work, i.e., “multiclass event classification for the
53
Urdu language text” are given in the proceeding subsections.
4.4.1.1 Pre-trained Word Embedding Models
The convolutional neural network model is evaluated on the feature vectors that were
generated by all pre-trained word embedding models. The summary of all results generated
by pre-trained (Haider, 2019) and custom pre-trained word-embedding models are given in
Table 4.7. Our custom pre-trained word embedding model that contains 57251 unique
tokens, larger dimension size 350 and 1 as the size of a window showed 38.68% accuracy.
The purpose of developing a different custom pre-trained word embedding model was to
develop a domain-specific model and achieve the highest accuracy. However, the results
of both pre-existing pre-trained word embedding models and domain-specific custom word
embedding models are very low.
Table 4.7: Classification Accuracy of the CNN Model
Sr. No. Existing pre-trained model’s validation_
accuracy
Custom pre-trained model’s validation_
accuracy
1 58.00 36.85
2 60.26 38.04
3 56.68 37.38
4 - 38.68
4.4.1.2 TF_ IDF Feature Vector
DNN architecture consists of an input layer, a dense layer, and a max pool layer. The dense
layer is also called a fully connected layer comprised of 150 nodes. SoftMax activation
function and sparse_ categorical_ cross-entropy are used to compile the model on the
dataset.
25991 instances are used to validate the accuracy of the DNN model. The DNN with
connected layer architecture showed 84% overall accuracy for all event classes. The detail
of the performance measuring parameters for each class of events is given in the table
below. Law and Order, the 6th type of event in our dataset, consists of 2000 instances that
are used for validation. It showed 66% accuracy that is comparatively low to the accuracy
of other types of events. It affected the overall performance of the DNN model. The main
reason behind these results is that the sentence of law-and-order overlaps with the sentences
of politics. Generally, sometime human hardly distinguishes between law and order and
political statements.
For example,
54
“ خطرہ ہے۔ ےیذ مہ دارانہ گفتگو خطے کے امن کے ل ریغ یک ریوز یحکومت “
“The irresponsible talk of state minister is a threat to peace in the region.”
The performance detail of the DNN model is given in Table 4.8 that showed 84% accuracy
for multiple classes of events. All the other performance measuring parameters i.e.,
precession, recall, and F1_ score of each class of events is given in Table 4.8.
Table 4.8: Performance Measuring Parameters for DNN Model
Class Precision Recall F1-Score Support
1 0.96 0.95 0.96 4604
2 0.91 0.91 0.91 776
3 0.75 0.75 0.75 1697
4 0.78 0.70 0.74 770
5 0.81 0.85 0.83 8424
6 0.71 0.63 0.67 2000
7 1.00 1.00 1.00 817
8 0.92 0.90 0.91 1839
9 0.70 0.70 0.71 2524
10 0.95 0.99 0.97 856
11 0.95 0.99 0.97 741
12 0.82 0.73 0.77 943
Accuracy 0.84 25991
Macro avg 0.84 0.84 0.85 25991
Weighted avg 0.84 0.84 0.84 25991
The expected solution to tackle the sentence overlapping problem with multiple classes is
to use a “Pre-trained word-embedding” model like W2Vec and Glove (for English
Language). But for Urdu language there is no mature (efficient, accurate) pre-trained word
embedding model.
The RNN sequential model architecture of deep learning is used in our experiments. The
recurrent deep learning model architecture consists of a sequence of following layers, i.e.
embedding layer having 100 dimensions, SpatialDropout1D, LSTM, and dense layers.
Sparse_ categorical_ cross-entropy loss function has been used for the compilation of the
model. Multiclass categorical classification is handled by a sparse categorical cross-
entropy loss function instead of categorical cross-entropy. A SoftMax activation function
is used at a dense layer instead of the sigmoid function. SoftMax can handle nonlinear
classification i.e., multiple classes, while sigmoid is limited to linear classification and
handles binary classification.
55
A bag-of-words consisting of 30000 unique Urdu language words is used to generate a
feature vector. The maximum length of the feature vector is 250 tokens.
The overall accuracy of the RNN model is presented in Table 4.9 that achieved 81%
validation accuracy for our problem by using TF-IDF feature vectors. Other performance
evaluation parameters of each class are also given in Table 4.9.
Table 4.9: Performance Measuring Parameters for RNN Model
Class Precision Recall F1-score Support
1 0.95 0.95 0.95 4604
2 0.78 0.77 0.78 776
3 0.70 0.72 0.71 1697
4 0.78 0.64 0.70 770
5 0.78 0.84 0.81 8424
6 0.67 0.57 0.62 2000
7 1.00 1.00 1.00 817
8 0.91 0.87 0.89 1839
9 0.70 0.63 0.66 2524
10 0.93 0.98 0.95 856
11 0.86 0.94 0.90 741
12 0.76 0.67 0.71 943
Accuracy 0.81 25991
macro avg 0.82 0.80 0.81 25991
weighted avg 0.81 0.81 0.81 25991
The accuracy of the RNN model can be viewed in figure 4.3, where the y-axis represents
the accuracy, and the x-axis represents the number of epochs. RNN achieved 81% accuracy
for multiclass event classification.
Figure 4.3: RNN’s Accuracy
56
Although CNN is highly recommended for image processing, but it has showed
considerable results for multiclass event classification on textual data. The performance
measuring parameters of the CNN classifier is given in Table 4.10.
Table 4.10: Performance Measuring Parameters for the CNN Model
Class Precision Recall F1-score Support
1 0.96 0.93 0.95 5661 2 0.81 0.65 0.72 967 3 0.72 0.68 0.70 2115 4 0.78 0.54 0.64 878 5 0.73 0.88 0.80 10030 6 0.64 0.51 0.57 2293 7 0.99 0.99 0.99 970 8 0.91 0.86 0.88 2259 9 0.71 0.61 0.66 3044 10 0.93 0.94 0.93 1031 11 0.91 0.82 0.86 889 12 0.77 0.63 0.70 1052
Accuracy 0.80 31189
macro avg 0.82 0.75 0.78 31189
weighted avg 0.80 0.80 0.80 31189
The distributed accuracy of the CNN classifier for the twelve classes can be viewed in
figure 4.4. There is more than one peak (higher accuracies) in figure 4.4 that showed
datasets are imbalanced.
Figure 4.4: CNN’s Distribution Accuracy
57
4.4.1.3 ONE-HOT-ENCODING
The results of deep learning classifiers that are used in our researcher work, their
performance on One-Hot-Encoding features is presented in figure 4.5. The One-Hot-
Encoded feature vectors are given as input to CNN, DNN, and RNN deep learning
classifiers. RNN showed better accuracy as compared to CNN while the DNN
outperformed among them. RNN and DNN achieved 81% and 84% accuracy, respectively,
for multiclass event classification.
Figure 4.5: CNN, RNN, and DNN Accuracy Using One-Hot-Encoding
4.5 Traditional Machine Learning Classifiers
To enlarge the research scope and develop generic efficient model. Some famous machine
learning classifiers are also evaluated for multiclass event classification on MULLS dataset
like k-NN, Decision Tree, Naïve Bayes Multinominal, Random Forest, Logistic Regression
and Support Vector Machine.
All these models are evaluated using TF-IDF and one-hot encoding features, as a feature
vector. It is observed that the results generated by using TF-IDF features are better than the
results generated using one-hot encoding features. A detailed summary of the results of the
above-mentioned machine learning classifiers is given in the proceeding section.
4.5.1 K-Nearest Neighbour (k-NN)
K-NN performs the classification of a new data point by measuring the similarity distance
80% 81% 84%
0%
20%
40%
60%
80%
100%
CNN RNN DNNVali
dati
on
_ A
ccu
raci
es
Deep Learning Models
One Hot Encoding
58
between the nearest neighbors. In our experiments, we set the value of k = 5 that measures
the similarity distance among five existing data points (Guo G, Wang H, Bell D, Bi Y, n.d.)
Although the performances of traditional machine learning classifiers are considerable but
must be noted that it is lower than deep learning classifiers. The main performance
degrading factor of the classifiers is the imbalanced number of instances and sentences
overlapping. The performance of the K-NN machine learning model is given in Table 4.11.
It has showed overall 78% accuracy for all classes.
Table 4.11: Performance Measuring Parameters for the K-NN Model
Class Precision Recall F1-score Support
1 0.91 0.93 0.92 5661
2 0.62 0.83 0.71 967
3 0.67 0.71 0.69 2115
4 0.64 0.60 0.62 878
5 0.78 0.82 0.80 10030
6 0.66 0.50 0.57 2293
7 0.93 1.00 0.96 970
8 0.91 0.80 0.85 2259
9 0.71 0.62 0.66 3044
10 0.85 0.93 0.89 1031
11 0.72 0.85 0.78 889
12 0.75 0.61 0.67 1052
Accuracy 0.78 31189
Macro avg 0.76 0.77 0.76 31189
Weighted avg 0.78 0.78 0. 78 31189
4.5.2 Decision Tree (DT)
Decision Tree (DT) is a type of supervised machine learning algorithm (Zhong, 2016)where
the data input is split according to certain parameters. Decision Tree showed 73% accuracy
while other performance detail of each classes is given in Table 4.12.
59
Table 4.12: Performance Measuring Parameters for The DT Model
Class Precision Recall F1-score Support
1 0.91 0.89 0.90 5661
2 0.83 0.97 0.89 967
3 0.57 0.52 0.54 2115
4 0.58 0.54 0.56 878
5 0.72 0.75 0.73 10030
6 0.44 0.41 0.42 2293
7 0.99 1.00 1.00 970
8 0.79 0.77 0.78 2259
9 0.57 0.55 0.56 3044
10 0.98 0.98 0.93 1031
11 0.86 0.98 0.92 889
12 0.61 0.56 0.58 1031
Accuracy 0.73 31189
Macro avg 0.73 0.74 0.74 31189
Weighted avg 0.73 0.73 0. 73 31189
4.5.3 Naïve Bayes Multinominal (NBM)
Naïve Bayes Multinominal is one of the computational (S., 2018) efficient classifiers for
text classification but it showed only 70% accuracy that is very low as compared to K-NN,
DT, and RF. The performance detail of all twelve (12) type of classes is given in Table
4.13.
Table 4.13: Performance Measuring Parameters for NB Multinominal Model
Class Precision Recall F1-score Support
1 0.94 0.91 0.93 5683
2 0.82 0.34 0.48 956
3 0.66 0.47 0.55 2121
4 0.91 0.20 0.32 919
5 0.56 0.95 0.70 10013
6 0.70 0.22 0.34 2387
7 0.98 0.95 0.97 959
8 0.94 0.75 0.83 2188
9 0.75 0.40 0.52 3031
10 0.96 0.78 0.86 998
11 0.96 0.32 0.48 863
12 0.84 0.25 0.39 1071
Accuracy 0.70 31189
Macro avg 0.84 0.54 0.61 31189
Weighted
avg 0.76 0.70 0. 67 31189
60
4.5.4 Logistic Regression (LR)
Linear Regression is highly recommended for the prediction of continuous output instead
of categorical classification (T. Zhang & Oles, 2001) but logistic regression has been used
for multiclass classification tasks. The Table 4.14 is representing the performance of the
Logistic Regression model i.e., 80% overall accuracy for multiclass event classification.
Table 4.14: Performance Measuring Parameters for The LR Model
Class Precision Recall F1-score Support
1 0.95 0.94 0.94 5661
2 0.83 0.64 0.72 967
3 0.72 0.69 0.70 2115
4 0.77 0.55 0.64 878
5 0.73 0.88 0.80 10030
6 0.64 0.53 0.58 2293
7 1.00 1.00 1.00 970
8 0.91 0.84 0.88 2259
9 0.73 0.62 0.67 3044
10 0.94 0.92 0.93 1031
11 0.90 0.80 0.85 889
12 0.77 0.66 0.71 1052
Accuracy 0.80 31189
Macro avg 0.82 0.76 0.79 31189
Weighted avg 0.80 0.80 0. 80 31189
4.5.5 Random Forest (RF)
It comprises many decision trees (Ali J, Khan R, Ahmad N, 2012). It has showed the highest
accuracy among all other evaluated machine learning classifiers. A detailed summary of
the results is given in Table 4.15.
Table 4.15: Performance Measuring Parameters for The RF Model
Class Precision Recall F1-score Support
1 0.94 0.93 0.94 5661
2 0.94 0.96 0.95 967
3 0.72 0.63 0.67 2115
4 0.80 0.58 0.67 878
5 0.71 0.90 0.79 10030
6 0.67 0.41 0.51 2293
7 1.00 1.00 1.00 970
8 0.93 0.80 0.86 2259
9 0.75 0.58 0.65 3044
10 0.94 0.98 0.96 1031
11 0.96 0.98 0.97 889
12 0.84 0.63 0.72 1052
Accuracy 0.80 31189
Macro avg 0.85 0.78 0.81 31189
Weighted avg 0.81 0.80 0. 80 31189
61
4.5.6 Support Vector Machine (SVM)
The support vector machine (SVM) is one of the highly recommended models for binary
classification. It is based on statistical theory (Y. Zhang, 2012). Its performance details are
given below in Table 4.16.
Table 4.16: Performance Measuring Parameters for SVM Model
Class Precision Recall F1-score Support
1 0.84 0.94 0.89 5683
2 0.72 0.43 0.54 956
3 0.72 0.49 0.58 2121
4 0.73 0.43 0.54 919
5 0.64 0.90 0.75 10013
6 0.74 0.24 0.36 2387
7 0.90 0.99 0.94 959
8 0.86 0.78 0.82 2188
9 0.65 0.47 0.57 3031
10 0.85 0.87 0.82 998
11 0.81 0.62 0.70 863
12 0.77 0.63 0.67 1071
Accuracy 0.73 31189
Macro avg 0.77 0.63 0.67 31189
Weighted avg 0.77 0.73 0. 71 31189
A comparative depiction of results obtained by the traditional machine learning classifiers
is given in figure 4.6. Random forest showed the highest accuracy among all other machine
learning classifiers. Although machine learning classifiers showed considerable results but
are low as compared to deep learning models.
Figure 4.6: Machine Learning Algorithms Accuracy using TF_ IDF
73% 70%
80%73%
80% 78%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Va
lid
ati
on
_ A
ccu
racy
AccuracySVM
NBM
RF
DT
LR
KNN
62
(Note that the results are reported in dissertation belongs to one (deep learning) dataset
while the results of other dataset are under consideration.)
Summary
In this chapter we have discussed in detail about experiments and results. We explored both
machine learning and deep learning classifiers and the best results are reported in the
chapter. It can be observed that deep learning classifiers outperformed among machine
learning classifiers. Deep Neural Network (Feedforward) has showed the highest accuracy
84% using TF_ IDF as feature vectors.
63
CHAPTER 5
TEMPORAL ENTITY EXTRACTION
64
5.1 Proposed Methodology for Temporal Entity Extraction
Extracting and classifying information using a rules-based approach requires skills of
expert level and deep knowledge of concerned languages. The language features i.e.,
grammar, morphological, and lexical are the basic parameters of deep knowledge of
specific language. Rules are designed based on patterns to extract a specific entity (U. P.
Singh et al., 2012).
5.2 Rule-based Approach (Regular Expression)
A regular expression (regex) is the string that defines a text matching pattern. These
patterns can be strings of numbers or text. For example, “1234” and “banana”. Some
examples of regular expressions that are used to extract temporal entities from the Urdu
Named Entity Recognition dataset are given here:
• Fully Qualified Date in Urdu
Regular expressions for other types of temporal entities are given in the appendix section
at the end of the thesis.
Expression1=("\d+\s+[ جنوری|فروری|مارچ|اپریل|مئی
ایک|دو|تین|چار|پانچ|چھ|سات|آٹھ|نو| ]+s\+[دوہزار]+s\+[جون|جولائي|اگست|ستمبر|اکتوبر|نومبر|دسمبر
دس|گیارہ|بارہ|تیرہ|چودہ|پندرہ|سولہ|سترہ|اٹھارہ|انیس|بیس|اکیس|بائیس|تئیس|چوبیس|پچیس|چھبیس|ستائیس
س|اکتیس|بتیس|تینتیس|چونتیس|پینتیس|چھتیس|سینتیس|اڑتیس|انتالیس|چالیس|اکتالیس|بیالی |اٹھائیس|انتیس|تی
س|تینتالیس|چوالیس|پینتالیس|چھیالیس|سینتالیس|اڑتالیس|انچاس|پچاس|اکاون|باون|تریپن|چون|پچپن|چھپن|
تر|بہتر|تہتر|چوہتر| ستاون|اٹھاون|انسٹھ|ساٹھ|اکسٹھ|باسٹھ|تریسٹھ|چونسٹھ|پینسٹھ|سڑسٹھ|اڑسٹھ|انہتر|ستر|اکہ
پچھتر|چھہتر|ستتر|اٹھتر|اناسی|اسی|اکاسی|بیاسی|تریاسی|چوراسی|پچاسی|چھیاسی|ستاسی|اٹھاسی|نواسی|
("+[نوے|اکانوے|بانوے|ترانوے|چورانوے|پچانوے|چھیانوے|ستانوے|اٹھانوے|ننانوے
65
Figure 5.1: Temporal Entity Extraction Methodology
5.3 Experimental Setup of Temporal Entity Extraction
We started our experiments on plain textual Urdu corpus neglecting the annotation tags.
The complex structure and varying format of Urdu temporal entities converge our strength
to use regular expressions. In our exhaustive analysis, we found the different writing styles
of date written in Urdu language text i.e., Urdu Fully Qualified date, Urdu Deictic Date,
and Urdu Anaphoric Date.
5.4 Results
All date extraction results are shown in tables which showed considerable results for all
types of date.
Table 5.1: All Dates Extraction Results on the original dataset
Type of Date Precision Recall
F1-
Measure
Numeric Year 0.91 1.00 0.95
Urdu Month and Urdu Year 0.58 1.00 0.77
Urdu Year 1.00 1.00 1.00
Urdu Month and Numeric Year 1.00 1.00 1.00
Numeric Day and Urdu Month 0.95 1.00 0.97
Only Urdu Month 0.100 1.00 1.00
Urdu Day and Month 0.50 1.00 0.67
UFQ Date and Urdu Hybrid FQ Date 0.95 0.95 0.95
Deictic and Anaphoric 1.00 1.00 1.00
Rule-
based
Approach
Temporal Entities
• Fully
Qualified
• Partially
Fully
Qualified
• Deictic
Input/Text
66
Table 5.2: UFQD &UPFQD on Extended dataset
Example
Date Type Precision Recall
F1-
Measure
Numeric Year 0.96 0.92 0.94 2019جون
Numeric Month 1.00 1.00 1.00 8پانچ
فروری 8 Numeric Day 0.94 1.00 0.97
تین اگست دو ہزار
Urdu FQ Date 1.00 1.00 1.00 انیس
Average 0.97 0.98 0.98
Table 5.3: Deictic date analysis
Deictic date
Precision Recall
F1-
Measure
Recognition 0.50 1.00 0.66
Retrieval 1.00 1.00 1.00
Figure 5.2: All Dates Extraction Results on Original Dataset
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Numeric
Year
Urdu Year Numeric
Day &
Urdu
Month
Urdu Day
& Urdu
Month
Urdu
Month &
Urdu Year
Urdu
Month &
Numeric
Year
Urdu
Month
UFD and
UPFQD
Deictic &
Anaphoric
F--
Mea
sure
Different format of Temporal Entities
Results of Rule Base system for Temporal Entities
67
Figure 5.3: F-Measure of UFQD & UPFQD on an extended dataset
Figure 5.4: Deictic date analysis
5.5 Discussion
The influence of the Internet via social media and news websites is remarkable. Generally,
nowadays sharing feelings, thoughts, ideas, events, problems research work, advertisement,
and criticism, etc. on the Internet using social media is common practice. It can be observed
that the usage of local languages on social networks i.e., Twitter, Facebook, and WhatsApp,
etc. is increasing because people feel easiness to convey messages in local languages. Many
tools are available to support multiple languages on the Internet. As a result, a huge bulk of
multilingual data is being generated exponentially in speed. The analysis of such pieces of
stuff using pre-existing processing tools is inadequate because these tools are insufficient
and incompatible for all types of languages.
100% 97%
0%10%20%30%40%50%60%70%80%90%
100%
Urdu Fully Qualified Date Urdu Partially Fully Qualified Date
F--
Mea
sure
Urdu Fully Qualified Date and Urdu Partially
Fully Qualifed Date
Results of Rule Base system for Temporal
Entities
100%
50%
100% 100%
0%10%20%30%40%50%60%70%80%90%
100%
Retrieval Recognition
F--
Mea
sure
Urdu Fully Qualified Date and Urdu Partially
Fully Qualifed Date
Results of Rule Base system for Temporal
Entities
68
The Urdu language is one of the resource-poor languages that cannot be handle by the
existing tools that are good enough for the English language. It is a morphological rich,
having a complex writing style right-to-left, and diacritic writing script. These
characteristics make it different from other languages. It is lacking in processing resources,
language annotators, Part of Speech (PoS) Taggers Word2vec models, and datasets. Only
a few numbers of datasets exist on the website of the Center for Language Engineering
CLE9 that are purchasable. The Urdu language has more than 300 million users who can
read, written, and understand it. It also the national language of Pakistan which the 6th
populous country in the world.
Lack of resources is a major hurdle in research for Urdu language texts. Events are an
important piece of information related to our lives. In research period we have found in
literature only a few numbers of datasets that are specifically developed for “Named Entity
Recognition” instead of event classification. We developed our dataset for multiclass event
classification. We collected more than 0.15 million sentences from different types of
events. To classify multiclass events at the sentence level we decided to use machine
learning and deep learning approaches. Six famous machine learning classifiers i.e., SVM,
RF, DT, NBM, K-NN, and LR, and three deep learning models i.e., CNN, RNN, and DNN
are used for multiclass event classification. Different feature vector generating techniques
are explored like Count vectorizer, TF-IDF, One-Hot-Encoding, and word-embedding.
Interestingly TF-IDF has outperformed among other techniques. DNN showed 84%
accuracy using TF-IDF feature vectors.
While in the case of temporal entity extraction, we deeply analyzed the different writing
formats of Dates in Urdu language text. We found more than 20 different formats of date.
It is observed that people generally do not follow the standard format of temporal entities.
Fully Qualified TE’s can be extracted by writing appropriated regular expressions and
accuracy can be ensured but in the case of anaphoric and deictic TE’s, the only regex is
insufficient to conclude the temporal values from the text. There is a need to analyze the
contextual information for anaphoric and deictic TE’s.
Summary
Temporal Entities are necessary to predict the occurrence time of any event. In this chapter
we explored various types of TE’s that exist in Urdu language text. Regular expression has
been used to extract different TE’s from plain text.
9 http://www.cle.org.pk/
69
CHAPTER 6
CONCLUSION AND FUTURE WORK
70
6.1 Conclusion
In a comprehensive review of Urdu literature, we found only a few numbers of referential
works related to Urdu text processing. The main hurdle in Urdu exploration is the
unavailability of the processing resources, i.e., event dataset, close-domain part of speech
tagger, lexicons, annotators, and other supporting tools.
It is reported in the dissertation that dataset is imbalanced, we performed the experiments
on the same (imbalanced) dataset. In case of imbalanced dataset, the accuracy values seem
unreliable since the results produced by the classifiers are biased. To resolve this issue, we
reported the output performance on basis of other metrices like precision, recall and f-
measure.
We have explored many feature vectors generating techniques. Different classification
algorithms of traditional machine learning, and deep learning approaches are evaluated on
these feature vectors. The purpose of performing many experiments on various feature
vector generating techniques was to develop the most efficient and generic model of
multiclass event classification for Urdu language text.
Word embedding feature generating technique is considered an efficient and powerful
technique for text analysis. Word2Vector (W2Vec) feature vectors can be generated by pre-
trained word embedding models or using dynamic parameters in embedding layers of deep
neural networks.
In general, the word embedding performs well in feature vector generating technique. It is
one of the most widely used techniques for relatively small sized data sets. Furthermore,
the pre-trained wording embedding models perform the key role to handle large and
complex datasets. In our research problem, we explored only three pre-trained word
embedding models for the Urdu language, also cited in the dissertation. Unlikely, those pre-
trained word embedding models did not perform well in our case since these models are
trained on the blogosphere extracted text. In contrary to these types of datasets, our dataset
is relatively different since it is the collection of different events that are reported/discussed
on the social media. This is the reason; these modules showed results with quite low
accuracy.
Another argument in support of this conclusion is that only a few pre-trained word
embedding models exist for Urdu language texts. These models are trained on considerable
number of tokens but domain-specific Urdu text. There is a need to develop generic word
embedding models for the Urdu language on a large corpus. CNN and RNN(LSTM) single-
71
layer architecture and multilayer architecture did not affected the performance of the
proposed system.
Experimental results are the vivid depiction that the one-hot-encoding method is better than
the word embedding model and pre-trained word embedding model. However, TF-IDF
outperformed among other feature generating techniques like word embedding and One-
Hot-Encoding. It showed the highest accuracy 84% by using DNN deep learning classifier.
While the same task using traditional machine learning classifiers showed considerable
performance but lower than deep learning models. Deep learning algorithms, i.e., CNN,
DNN, and RNN are preferable over traditional machine learning algorithms. Because there
is no need for a domain expert to find relevant features in deep learning like traditional
machine learning. DNN and RNN outperformed among all other classifiers and showed
overall 84% and 81% accuracy, respectively, for the twelve classes of events.
Comparatively the performance of CNN and RNN is better than Naïve Bayes and SVM.
Multi-Class event classification at the sentence level performed on an imbalance dataset;
events that are having a low number of instances for a specific class affect the overall
performance of the classifiers. We can improve the performance by balancing the instances
of each class. It can be concluded that:
• Lack of resources is the barrier in research work,
• For the Urdu language, there are only few pretrained word embedding models but
that models showed very poor results,
• We also created our word-embedding models but that also showed very poor results,
• We evaluated the six famous machine learning classifiers and three deep learning
classifiers,
• Deep learning classifiers using TF_ IDF have shown the best results as compared
to machine learning classifiers. The DNN has shown 84% accuracy,
• There is no specific work in literature related to Temporal Entities,
• Regular expressions have shown considerable results for Fully Qualified date while
Deictic and Anaphoric required the contextual information.
6.2 Future Work
There are a lot of tasks that can be accomplished for Urdu language text in the future. Some
of those are mentions here:
1. In future we have a plan to extend our research work to improve the accuracy of
proposed models, to increase the size of datasets, using BERT encoder and perform
72
event classification at the document level and phrase level.
2. We also decided to use machine learning and deep learning approaches to extract
and classify deictic and anaphoric TE’s.
3. To classify event and as real-time and retrospective using fuzzy rules.
4. To propose an approach to differentiate Temporal Entities written in Urdu language
text from other languages like Sindhi, Arabic and Persian that have the relatively
similar writing scripts.
73
References A.S. Abrahams. (2002). Developing and Executing Electronic Commerce Applications
with Occurrences.
Abdlrauf, H. and M. A. (2017). Deep learning for sentence classification. IEEE Explorer.
Abrahams, A. S., Jiao, J., Wang, G. A., & Fan, W. (2012). Vehicle defect discovery from
social media. Decision Support Systems. https://doi.org/10.1016/j.dss.2012.04.005
Adeeba, F., Akram, Q., Khalid, H., and Hussain, S. (2014). N-grams, CLE Urdu books. . .
In Conference on Language and Technology, CLT 14,Karachi, Pakistan.
AHMED, K., ALI, M., KHALID, S., & KAMRAN, M. (2016). Framework for Urdu
News Headlines Classification. Journal of Applied Computer Science &
Mathematics. https://doi.org/10.4316/jacsm.201601002
Ahn, D., Adafre, S. F., & De Rijke, M. (2005). Towards task-based temporal extraction
and recognition. Dagstuhl Seminar Proceedings.
Akhter, M. P., Jiangbin, Z., Naqvi, I. R., Abdelmajeed, M., & Fayyaz, M. (2020).
Exploring deep learning approaches for Urdu text classification in product
manufacturing. Enterprise Information Systems.
https://doi.org/10.1080/17517575.2020.1755455
Al-Dyani, W. Z., Yahya, A. H., & Ahmad, F. K. (2018). Challenges of event detection
from social media streams. International Journal of Engineering & Technology, 7,
72–75.
Al-Garadi, M. A., Hussain, M. R., Khan, N., Murtaza, G., Nweke, H. F., Ali, I., Mujtaba,
G., Chiroma, H., Khattak, H. A., & Gani, A. (2019). Predicting Cyberbullying on
Social Media in the Big Data Era Using Machine Learning Algorithms: Review of
Literature and Open Challenges. IEEE Access.
https://doi.org/10.1109/ACCESS.2019.2918354
Al-Radaideh, Q. A., & Al-Abrat, M. A. (2019). An Arabic text categorization approach
using term weighting and multiple reducts. Soft Computing.
https://doi.org/10.1007/s00500-018-3249-z
Ali, A. R., & Ijaz, M. (2009). Urdu text classification. Proceedings of the 6th
International Conference on Frontiers of Information Technology, FIT ’09.
https://doi.org/10.1145/1838002.1838025
Ali, D., Muhammad, M., Akhtar, N., Salamat, N., Asmat, H., & Firdous, A. (2016).
Gender Prediction for Expert Finding Task. International Journal of Advanced
Computer Science and Applications. https://doi.org/10.14569/ijacsa.2016.070525
Ali J, Khan R, Ahmad N, M. I. (2012). Random Forests and Decision Trees. International
Journal of Computer Science Issues.
Allen, J. F. (1983). Maintaining Knowledge about Temporal Intervals. Communications
of the ACM. https://doi.org/10.1145/182.358434
Awais, M., & Shoaib, M. (2019). Role of discourse information in Urdu sentiment
classification: A Rule-based Method and Machine-learning Technique. ACM
Transactions on Asian and Low-Resource Language Information Processing.
https://doi.org/10.1145/3300050
Bahir, E., & Peled, A. (2016). Geospatial extreme event establishing using social
network’s text analytics. GeoJournal. https://doi.org/10.1007/s10708-015-9622-x
Baker, Paul, Andrew Hardie, Tony McEnery, and B. D. J. (2003). Corpus data for South
Asian language processing. .." In Proceedings of the 10th Annual Workshop for
South Asian Language Processing, EACL.
Barthe-Delanoë, A. M., Truptil, S., Bénaben, F., & Pingaud, H. (2014). Event-driven
agility of interoperability during the Run-time of collaborative processes. Decision
Support Systems. https://doi.org/10.1016/j.dss.2013.11.005
74
Becker, D., & Riaz, K. (2002). A study in Urdu corpus construction.
https://doi.org/10.3115/1118759.1118760
Bittar, A., Amsili, P., Denis, P., & Danlos, L. (2011). French TimeBank: An ISO-
TimeML annotated reference corpus. ACL-HLT 2011 - Proceedings of the 49th
Annual Meeting of the Association for Computational Linguistics: Human Language
Technologies.
Borsje, J., Hogenboom, F., & Frasincar, F. (2010). Semi-automatic financial events
discovery based on lexico-semantic patterns. International Journal of Web
Engineering and Technology. https://doi.org/10.1504/IJWET.2010.038242
Capet, P., Delavallade, T., Nakamura, T., Sandor, A., Tarsitano, C., & Voyatzi, S. (2008).
A risk assessment system with automatic extraction of event types. IFIP
International Federation for Information Processing. https://doi.org/10.1007/978-0-
387-87685-6_27
Caselli, T., & Sprugnoli, R. (2017). It-TimeML and the Ita-TimeBank: Language Specific
Adaptations for Temporal Annotation. In Handbook of Linguistic Annotation.
https://doi.org/10.1007/978-94-024-0881-2_36
Cavalin Rodrigo Paulo, D. F. and C. da S. M. S. (2016). Classification of Life Events on
Social Media.
Chowdhury, S. R., Imran, M., Asghar, M. R., Amer-Yahia, S., & Castillo, C. (2013).
Tweet4act: Using incident-specific profiles for classifying crisis-related messages.
ISCRAM 2013 Conference Proceedings - 10th International Conference on
Information Systems for Crisis Response and Management.
Conlon, S. J., Abrahams, A. S., & Simmons, L. L. (2015). Terrorism information
extraction from online reports. Journal of Computer Information Systems.
https://doi.org/10.1080/08874417.2015.11645768
Costa, F., & Branco, A. (2012). TimeBankPT: A TimeML annotated corpus of
Portuguese. Proceedings of the 8th International Conference on Language
Resources and Evaluation, LREC 2012.
D’Andrea, E., Ducange, P., Bechini, A., Renda, A., & Marcelloni, F. (2019). Monitoring
the public opinion about the vaccination topic from tweets analysis. Expert Systems
with Applications. https://doi.org/10.1016/j.eswa.2018.09.009
Daud, A., Khan, W., & Che, D. (2017). Urdu language processing: a survey. Artificial
Intelligence Review. https://doi.org/10.1007/s10462-016-9482-x
De Santis, E., Martino, A., & Rizzi, A. (2020). An Infoveillance System for Detecting
and Tracking Relevant Topics from Italian Tweets during the COVID-19 Event.
IEEE Access. https://doi.org/10.1109/ACCESS.2020.3010033
Dou, Wenwen, Xiaoyu Wang, William Ribarsky, and M. Z. (2012). Event detection in
social media data. In IEEE Vis Week Workshop on Interactive Visual Text Analytics-
Task Driven Analytics of Social Media Content.
Dr. D. Ramehs, D. S. S. K. (2016). EVENT EXTRACTION FROM NATURAL
LANGUAGE TEXT. IJESRT.
Ferro Lisa, Gerber Laurie, Mani Inderjeet, S. B. and W. G. (2005). TIDES 2005 standard
for the annotation of temporal expressions.
Filannino, M., & Nenadic, G. (2015). Temporal expression extraction with extensive
feature type selection and a posteriori label adjustment. Data and Knowledge
Engineering. https://doi.org/10.1016/j.datak.2015.09.002
Guo G, Wang H, Bell D, Bi Y, G. K. (n.d.). On the Move to Meaningful Internet
Systems. OTM Confederated International Conferences.
Haider, S. (2019). Urdu word embeddings. LREC 2018 - 11th International Conference
on Language Resources and Evaluation.
75
Hao, T., Pan, X., Gu, Z., Qu, Y., & Weng, H. (2018). A pattern learning-based method
for temporal expression extraction and normalization from multi-lingual
heterogeneous clinical texts. BMC Medical Informatics and Decision Making.
https://doi.org/10.1186/s12911-018-0595-9
Hogenboom, F., Frasincar, F., Kaymak, U., De Jong, F., & Caron, E. (2016). A Survey of
event extraction methods from text for decision support systems. Decision Support
Systems. https://doi.org/10.1016/j.dss.2016.02.006
Huang, P. Y., Liang, J., Lamare, J. B., & Hauptmann, A. G. (2018). Multimodal filtering
of social media for temporal monitoring and event analysis. ICMR 2018 -
Proceedings of the 2018 ACM International Conference on Multimedia Retrieval.
https://doi.org/10.1145/3206025.3206079
Jacobs, G., Lefever, E., & Hoste, V. (2019). Economic Event Detection in Company-
Specific News Text. https://doi.org/10.18653/v1/w18-3101
Jaidka, K., Ahmed, S., Skoric, M., & Hilbert, M. (2019). Predicting elections from social
media: a three-country, three-method comparative study. Asian Journal of
Communication. https://doi.org/10.1080/01292986.2018.1453849
Jawaid, B., Kamran, A., & Bojar, O. (2014). A tagged corpus and a tagger for Urdu.
Proceedings of the 9th International Conference on Language Resources and
Evaluation, LREC 2014.
Jiang, S., Chen, H., Nunamaker, J. F., & Zimbra, D. (2014). Analyzing firm-specific
social media and market: A stakeholder-based event analysis framework. Decision
Support Systems. https://doi.org/10.1016/j.dss.2014.08.001
Jin, B., Zhuo, W., Hu, J., Chen, H., & Yang, Y. (2013). Specifying and detecting spatio-
temporal events in the internet of things. Decision Support Systems.
https://doi.org/10.1016/j.dss.2013.01.027
Joachims, T. (1998). Text categorization with support vector machines: Learning with
many relevant features. Lecture Notes in Computer Science (Including Subseries
Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics).
https://doi.org/10.1007/s13928716
john, lafferty, Andrew, M., & Fernando, C. N. P. (2001). Conditional Random Fields:
Probabilistic Models for Segmenting and Labeling Sequence Data. ICML ’01:
Proceedings of the Eighteenth International Conference on Machine Learning.
https://doi.org/10.29122/mipi.v11i1.2792
Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural
network for modelling sentences. 52nd Annual Meeting of the Association for
Computational Linguistics, ACL 2014 - Proceedings of the Conference.
https://doi.org/10.3115/v1/p14-1062
Kamila, Sabyasachi, Mohammad Hasanuzzaman, Asif Ekbal, and P. B. (2018). Tempo-
Hindi WordNet: A Lexical Knowledge-base for Temporal Information Processing.
ACM Transactions on Asian and Low-Resource Language Information Processing
(TALLIP).
Kanwal, S., Malik, K., Shahzad, K., Aslam, F., & Nawaz, Z. (2019). Urdu named entity
recognition: Corpus generation and deep learning applications. ACM Transactions
on Asian and Low-Resource Language Information Processing.
https://doi.org/10.1145/3329710
Khan Wahab, Daud Ali, Nassir A Jamal, A. T. (2016). Named Entity Dataset for Urdu
Named Entity Recognition Task.
Kong, X., Shi, X., & Yu, P. S. (2011). Multi-label collective classification. Proceedings
of the 11th SIAM International Conference on Data Mining, SDM 2011.
https://doi.org/10.1137/1.9781611972818.53
76
Konstantinidis, K., Papadopoulos, S., & Kompatsiaris, Y. (2017). Exploring twitter
communication dynamics with evolving community analysis. PeerJ Computer
Science. https://doi.org/10.7717/peerj-cs.107
Li, Hang, Yunhua Hu, Guangping Gao, Yauhen Shnitko, Dmitriy Meyerzon, and D. M.
(n.d.). Techniques for extracting authorship dates of documents. U.S. Patent
Application, 141,935.
Li, Xia, Yongqing Zheng, and Y. D. (2014). Discovering Evolution of Complex Event
Based on Correlations Between Events. IEEE.
Li, Q., Nourbakhsh, A., Shah, S., & Liu, X. (2017). Real-Time novel event detection from
social media. Proceedings - International Conference on Data Engineering.
https://doi.org/10.1109/ICDE.2017.157
Liao, W., & Veeramachaneni, S. (2009). A simple semi-supervised algorithm for named
entity recognition. https://doi.org/10.3115/1621829.1621837
Limber, R. K. and S. P. (n.d.). Psychological, Physical, and Academic Correlates of
Cyberbullying and Traditional Bullying. Journal of Adolescent Health, 13–20.
Ling, X., & Weld, D. S. (2010). Temporal information extraction. Proceedings of the
National Conference on Artificial Intelligence.
Liu, G., & Guo, J. (2019). Bidirectional LSTM with attention mechanism and
convolutional layer for text classification. Neurocomputing.
https://doi.org/10.1016/j.neucom.2019.01.078
Llidó, D., Berlanga, R., & Aramburu, M. J. (2001). Extracting temporal references to
assign document event-time periods. Lecture Notes in Computer Science (Including
Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics). https://doi.org/10.1007/3-540-44759-8_8
Lu, Zhongyu, Weiren Yu, Richong Zhang, Jianxin Li, and H. W. (2015). Discovering
event evolution chain in microblog. In High Performance Computing and
Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace
Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded
Software and Systems (ICESS).
M. D. Eberhard, S. F. G. and C. D. F. (2019). Ethnologue: Languages of the world. SIL
International.
Malik, M. K. and Sarwar, S. M. (2016a). Named Entity Recognition System for
Postpositional Languages: Urdu as a Case Study.
Malik, M. K. and Sarwar, S. M. (2016b). Named Entity Recognition System for
Postpositional Languages: Urdu as a Case Study. International Journal of Advanced
Computer Science and Applications, 141–147.
McMinn, A. J., Moshfeghi, Y., & Jose, J. M. (2013). Building a large-scale corpus for
evaluating event detection on twitter. International Conference on Information and
Knowledge Management, Proceedings. https://doi.org/10.1145/2505515.2505695
Mehmood, K., Essam, D., & Shafi, K. (2019). Sentiment analysis system for Roman
Urdu. Advances in Intelligent Systems and Computing. https://doi.org/10.1007/978-
3-030-01174-1_3
Mohamad, A. Y., Mustapha, S. S., & Razali, M. S. (2010). Automatic Event Detection on
Reuters News.
Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification.
Lingvisticae InvestigationesLingvisticæ Investigationes. International Journal of
Linguistics and Language ResourcesLingvisticæ Investigationes / International
Journal of Linguistics and Language ResourcesLingvisticæ Investigationes.
https://doi.org/10.1075/li.30.1.03nad
Naughton, M., Stokes, N., & Carthy, J. (2010). Sentence-level event classification in
77
unstructured texts. Information Retrieval. https://doi.org/10.1007/s10791-009-9113-
0
Naz, M., Akram, Q. U. A., & Hussain, S. (2013). Binarization and its evaluation for Urdu
Nastalique document images. 2013 16th International Multi Topic Conference,
INMIC 2013. https://doi.org/10.1109/INMIC.2013.6731352
Nuij, W., Milea, V., Hogenboom, F., Frasincar, F., & Kaymak, U. (2014). An automated
framework for incorporating news into stock trading strategies. IEEE Transactions
on Knowledge and Data Engineering. https://doi.org/10.1109/TKDE.2013.133
O’Keeffe, G. S., Clarke-Pearson, K., Mulligan, D. A., Altmann, T. R., Brown, A.,
Christakis, D. A., Falik, H. L., Hill, D. L., Hogan, M. J., Levine, A. E., & Nelson, K.
G. (2011). Clinical report - The impact of social media on children, adolescents, and
families. In Pediatrics. https://doi.org/10.1542/peds.2011-0054
Pal, U., & Sarkar, A. (2003). Recognition of printed Urdu script. Proceedings of the
International Conference on Document Analysis and Recognition, ICDAR.
https://doi.org/10.1109/ICDAR.2003.1227844
Panagiotou, N., Katakis, I., & Gunopulos, D. (2016). Detecting events in online social
networks: Definitions, trends and challenges. Lecture Notes in Computer Science
(Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics). https://doi.org/10.1007/978-3-319-41706-6_2
Parikh, R., & Karlapalem, K. (2013). Et: events from tweets. .." In Proceedings of the
22nd International Conference on World Wide Web, 613–620.
Petrovic, S., Osborne, M., & Lavrenko, V. (2010). The Edinburgh Twitter Corpus.
Computational Linguistics.
Pillac, V., Guéret, C., & Medaglia, A. L. (2012). An event-driven optimization
framework for dynamic vehicle routing. Decision Support Systems.
https://doi.org/10.1016/j.dss.2012.06.007
Riaz, K. (2008). Concept search in urdu. International Conference on Information and
Knowledge Management, Proceedings. https://doi.org/10.1145/1458550.1458557
Riaz, K. (2010). Rule-Based Named Entity Recognition in Urdu. Proceedings of the 2010
Named Entities Workshop.
Ritter, A., Wright, E., Casey, W., & Mitchell, T. (2015). Weakly supervised extraction of
computer security events from twitter. WWW 2015 - Proceedings of the 24th
International Conference on World Wide Web.
https://doi.org/10.1145/2736277.2741083
S. Lavanya, R. Kavipriya, Y. Yang, J. Q. Carbonell, R. D. Brown, B. Archibald, and X.
L. (2014). A Survey on Event Detection in News Streams. 2(5), 33–35.
S., X. (2018). Bayesian Naïve Bayes classifiers to text classification. Journal of
Information Science, 48–59.
Sarker, A., & Gonzalez, G. (2015). Portable automatic text classification for adverse drug
reaction detection via multi-corpus training. Journal of Biomedical Informatics.
https://doi.org/10.1016/j.jbi.2014.11.002
Sharjeel, M., Nawab, R. M. A., & Rayson, P. (2017). COUNTER: corpus of Urdu news
text reuse. Language Resources and Evaluation. https://doi.org/10.1007/s10579-016-
9367-2
Singh, J. P., Dwivedi, Y. K., Rana, N. P., Kumar, A., & Kapoor, K. K. (2019). Event
classification and location prediction from tweets during disasters. Annals of
Operations Research. https://doi.org/10.1007/s10479-017-2522-3
Singh, U. P., Goyal, V., & Lehal, G. S. (2012). Named entity recognition system for
Urdu. 24th International Conference on Computational Linguistics - Proceedings of
COLING 2012: Technical Papers.
78
Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for
classification tasks. Information Processing and Management.
https://doi.org/10.1016/j.ipm.2009.03.002
Somooro, S. M. G. and T. R. (2019). Current Status of Urdu on Twitter. Sukkur IBA
Journal of Computing and Mathematical Sciences.
Tomas, K. (2015). Event detection from text data. Computational Intelligence, 1312–164.
Usman, M., Shafique, Z., Ayub, S., & Malik, K. (2016). Urdu Text Classification using
Majority Voting. International Journal of Advanced Computer Science and
Applications. https://doi.org/10.14569/ijacsa.2016.070836
Valueva, M. V., Nagornov, N. N., Lyakhov, P. A., Valuev, G. V., & Chervyakov, N. I.
(2020). Application of the residue number system to reduce hardware costs of the
convolutional neural network implementation. Mathematics and Computers in
Simulation. https://doi.org/10.1016/j.matcom.2020.04.031
Walenz, B., Gandhi, R., Mahoney, W., & Zhu, Q. (2010). Exploring social contexts along
the time dimension: Temporal analysis of named entities. Proceedings - SocialCom
2010: 2nd IEEE International Conference on Social Computing, PASSAT 2010: 2nd
IEEE International Conference on Privacy, Security, Risk and Trust.
https://doi.org/10.1109/SocialCom.2010.80
Wei, C. P., & Lee, Y. H. (2004). Event detection from online news documents for
supporting environmental scanning. Decision Support Systems.
https://doi.org/10.1016/S0167-9236(03)00028-9
Woodward, D. (2001). Extraction and Visualization of Temporal Information and Related
Named Entities from Wikipedia. Springs.
Wu, H. C., Luk, R. W. P., Wong, K. F., & Kwok, K. L. (2008). Interpreting TF-IDF term
weights as making relevance decisions. ACM Transactions on Information Systems.
https://doi.org/10.1145/1361684.1361686
Xu, J. M., Jun, K. S., Zhu, X., & Bellmore, A. (2012). Learning from bullying traces in
social media. NAACL HLT 2012 - 2012 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies,
Proceedings of the Conference.
Y., Z. (2016). The analysis of cases based on decision tree. In2016 7th IEEE
International Conference on Software Engineering and Service Science (ICSESS).
Yaghoobzadeh, Y., Ghassem-Sani, G., Mirroshandel, S. A., & Eshaghzadeh, M. (2012).
ISO-TimeML event extraction in persian text. 24th International Conference on
Computational Linguistics - Proceedings of COLING 2012: Technical Papers.
Yakushiji, A., Tateisi, Y., Miyao, Y., & Tsujii, J. (2001). Event extraction from
biomedical papers using a full parser. Pacific Symposium on Biocomputing. Pacific
Symposium on Biocomputing. https://doi.org/10.1142/9789814447362_0040
Yang, Y., Pierce, T., & Carbonell, J. (1998). Study on retrospective and on-line event
detection. SIGIR Forum (ACM Special Interest Group on Information Retrieval).
https://doi.org/10.1145/290941.290953
Zaraket, F., & Makhlouta, J. (2012). Arabic Temporal Entity Extraction using
Morphological Analysis. 3(1), 121–136.
Zhang, T., & Oles, F. J. (2001). Text Categorization Based on Regularized Linear
Classification Methods. Information Retrieval.
https://doi.org/10.1023/A:1011441423217
Zhang, Y. (2012). Support vector machine classification algorithm and its application.
Communications in Computer and Information Science. https://doi.org/10.1007/978-
3-642-34041-3_27
Zhou, H., Huang, M., Zhang, T., Zhu, X., & Liu, B. (2018). Emotional chatting machine:
79
Emotional conversation generation with internal and external memory. 32nd AAAI
Conference on Artificial Intelligence, AAAI 2018.
Zia, T., Akhter, M. P., & Abbas, Q. (2015). Comparative study of feature selection
approaches for Urdu text categorization. Malaysian Journal of Computer Science.
80
Appendix
A. Regular Expressions for Temporal Entities
1. Numeric day_ Urdu month_ Numeric year
2. Numeric day_ Urdu month
3. Urdu day_ Numeric month_ Urdu year ( دو ہزار بیس 12چودہ)
More expression for different format of date
date1=re.findall("(\d+\s)+(
(s+(\d+)",s\+(جنوری|فروری|مارچ|اپریل|مئی|جون|جولائی|اگست|ستمبر|اکتوبر|نومبر|دسمبر
Expression =re.findall("\d+\s+[ جنوری|فروری|مارچ|اپریل|مئی
ایک|دو|تین|چار|پ ]+s\+[دوہزار]+s\+[جون|جولائي|اگست|ستمبر|اکتوبر|نومبر|دسمبر
انچ|چھ|سات|آٹھ|نو|دس|گیارہ|بارہ|تیرہ|چودہ|پندرہ|سولہ|سترہ|اٹھارہ|انیس|بیس|اکیس|بائی
س|تئیس|چوبیس|پچیس|چھبیس|ستائیس|اٹھائیس|انتیس|تیس|اکتیس|بتیس|تینتیس|چونتیس|
|انتالیس|چالیس|اکتالیس|بیالیس|تینتالیس|چوالیس|پینتالیس|چھ پینتیس|چھتیس|سینتیس|اڑتیس
یالیس|سینتالیس|اڑتالیس|انچاس|پچاس|اکاون|باون|تریپن|چون|پچپن|چھپن|ستاون|اٹھاون|ا
نسٹھ|ساٹھ|اکسٹھ|باسٹھ|تریسٹھ|چونسٹھ|پینسٹھ|سڑسٹھ|اڑسٹھ|انہتر|ستر|اکہتر|بہتر|تہتر|چو
اکاسی|بیاسی|تریاسی|چوراسی|پچاسی|چھیاسی| ہتر|پچھتر|چھہتر|ستتر|اٹھتر|اناسی|اسی|
ستاسی|اٹھاسی|نواسی|نوے|اکانوے|بانوے|ترانوے|چورانوے|پچانوے|چھیانوے|ستانوے|
(s,"+[اٹھانوے|ننانوے
Expression
=re.findall("[ |آٹھ|سات|چھ|پانچ|چار|تین|یکم|دو|نو|دس|گیارہ|بارہ|تیرہ|چودہ|پندرہ|سولہ
سترہ|اٹھارہ|انیس|بیس|اکیس|بائیس|تئیس|جوبیس|پچیس|چھبیس|ستائیس|اٹھائیس|انتیس|تی
ایک|دو|تین|چار|پانچ|چھ|سات|آٹھ|نو|دس|گیارہ|با ]+s\+[دوہزار]+s+\d+\s\+[س|اکتیس
ہ|انیس|بیس|اکیس|بائیس|تئیس|چوبیس|پچیس|چھبیس| رہ|تیرہ|چودہ|پندرہ|سولہ|سترہ|اٹھار
ستائیس|اٹھائیس|انتیس|تیس|اکتیس|بتیس|تینتیس|چونتیس|پینتیس|چھتیس|سینتیس|اڑتیس|انتا
لیس|چالیس|اکتالیس|بیالیس|تینتالیس|چوالیس|پینتالیس|چھیالیس|سینتالیس|اڑتالیس|انچاس|پ
سٹھ|ساٹھ|اکسٹھ|باسٹھ|تریسٹھ|چو چاس|اکاون|باون|تریپن|چون|پچپن|چھپن|ستاون|اٹھاون|ان
نسٹھ|پینسٹھ|سڑسٹھ|اڑسٹھ|انہتر|ستر|اکہتر|بہتر|تہتر|چوہتر|پچھتر|چھہتر|ستتر|اٹھتر|اناس
ی|اسی|اکاسی|بیاسی|تریاسی|چوراسی|پچاسی|چھیاسی|ستاسی|اٹھاسی|نواسی|نوے|اکانو
(s,"+[ے|بانوے|ترانوے|چورانوے|پچانوے|چھیانوے|ستانوے|اٹھانوے|ننانوے
81
1 key Word سنہ and complete Urdu year i.e. سنہ دوہزارتین
2 Month and complete Urdu Year i.e. جنوری دوہزارپانچ
Weakness
Although it gives expected output but has some basic issues. It considers all related
piece of string as date which made output noisy. It shows date and some other
matched words.
3 Complete Urdu year i.e. تین مارچ دوہزارچھ
Weakness
It extracted all Fully Qualified date accurately, but it produced some noise in output
i.e. it returned some related keywords as output.
4 Month/2001سال/ .…2001سنہ…
Regex= سنہ\s +[ انیس سو|دوہزار]+
]ایک|دو|تین|چار|پانچ|چھ|سات|آٹھ|نو|دس|گیارہ|بارہ|تیرہ|چودہ|پندرہ|سولہ|سترہ|اٹھارہ|انیس|بی
س|اکیس|بائیس|تئیسچوبیس|پچیس|چھبیس|ستائیس|اٹھائیس|انتیس|تیس|اکتیس|بتیس|تینتیس|چونت
یس|پینتیس|چھتیس|سینتیس|اڑتیس|انتالیس|چالیس|اکتالیس|بیالیس|تینتالیس|چوالیس|پینتالیس|چھی
الیس|سینتالیس|اڑتالیس|انچاس|پچاس[
ستاسی +
Regex=
دوہزار +s\+[جنوری|فروری |دسمبر |نومبر|اکتوبر |ستمبر |اگست |جولائی |جون|مئی |اپریل |مارچ ]
ا |بیس |انیس |اٹھارہ |سترہ |سولہ|پندرہ|چودہ |تیرہ |بارہ |گیارہ |دس |نو |آٹھ |سات |چھ |پانچ|چار |تین|دو |]+
|چونتیس |تینتیس |بتیس |اکتیس |تیس|انتیس |اٹھائیس |ستائیس |چھبیس |پچیس |تئیسچوبیس |بائیس |کیس
|چھیالیس |پینتالیس |چوالیس |تینتالیس |بیالیس |اکتالیس |چالیس |انتالیس|اڑتیس |سینتیس |چھتیس |پینتیس
+[پچاس |انچاس |اڑتالیس |سینتالیس
Regex=re.findall("( ھ|پانچ|چار|تین|یکم|دو|نو|دس|گیارہ|بارہ|تیرہ|چودہ|پندرہ|سولہآٹھ|سات|چ
|سترہ|اٹھارہ|انیس|بیس|اکیس|بائیس|تئیس|جوبیس|پچیس|چھبیس|ستائیس|اٹھائیس|انتیس|
جولائی|جون|مئی|مارچ|فروری|جنوری|اپریل|اگست|ستمبر|اکتوبر|نوم )s\+(تیس|اکتیس
)دوہزار|انیس s\+(بر|دسمبر
چ|چھ|سات|آٹھ|نو|دس|گیارہ|بارہ|تیرہ|چودہ|پندرہ|سولہ|سترہ|ا سو(+)ایک|دو|تین|چار|پان
ائیس|تئیس|چوبیس|پچیس|چھبیس|ستائیس|اٹھائیس|انتیس|تیس|اک ٹھارہ|انیس|بیس|اکیس|ب
تیس|بتیس|تینتیس|چونتیس|پینتیس|چھتیس|سینتیس|اڑتیس|انتالیس|چالیس|اکتالیس|بیالیس|
لیس|انچاس|پچاس|اکاون|باون|تریپن|چ تینتالیس|چوالیس|پینتالیس|چھیالیس|سینتالیس|اڑتا
ون|پچپن|چھپن|ستاون|اٹھاون|انسٹھ|ساٹھ|اکسٹھ|باسٹھ|تریسٹھ|چونسٹھ|پینسٹھ|سڑسٹھ|اڑ
سٹھ|انہتر|ستر|اکہتر|بہتر|تہتر|چوہتر|پچھتر|چھہتر|ستتر|اٹھتر|اناسی|اسی|اکاسی|بیاسی|
رانوے|چ تریاسی|چوراسی|پچاسی|چھیاسی|ستاسی|اٹھاسی|نواسی|نوے|اکانوے|بانوے|ت
(s,"+ورانوے|پچانوے|چھیانوے|ستانوے|اٹھانوے|ننانوے(
82
The above given regular expression covers Month/Year with keyword سنہ اور سال
5 Fully Qualified Urdu Date with keyword سنہ
Table 1: Different format of Extracted Dates
1 2 3 4 5
Regex=re.findall("(آٹھ|سات|چھ|پانچ|چار|تین|یکم|دو|نو|دس|گیارہ|بارہ|تیرہ|چودہ|پندرہ|سولہ
بیس|اکیس|بائیس|تئیس|جوبیس|پچیس|چھبیس|ستائیس|اٹھائیس|انتیس| |سترہ|اٹھارہ|انیس|
جولائی|جون|مئی|مارچ|فروری|جنوری|اپریل|اگست|ستمبر|اکتوبر|نوم )s\+(تیس|اکتیس
)دوہزار|انیس s\+(بر|دسمبر
سو(+)ایک|دو|تین|چار|پانچ|چھ|سات|آٹھ|نو|دس|گیارہ|بارہ|تیرہ|چودہ|پندرہ|سولہ|سترہ|ا
ٹھارہ|انیس|بیس|اکیس|بائیس|تئیس|چوبیس|پچیس|چھبیس|ستائیس|اٹھائیس|انتیس|تیس|اک
ونتیس|پینتیس|چھتیس|سینتیس|اڑتیس|انتالیس|چالیس|اکتالیس|بیالیس| تیس|بتیس|تینتیس|چ
تینتالیس|چوالیس|پینتالیس|چھیالیس|سینتالیس|اڑتالیس|انچاس|پچاس|اکاون|باون|تریپن|چ
ون|پچپن|چھپن|ستاون|اٹھاون|انسٹھ|ساٹھ|اکسٹھ|باسٹھ|تریسٹھ|چونسٹھ|پینسٹھ|سڑسٹھ|اڑ
ھتر|چھہتر|ستتر|اٹھتر|اناسی|اسی|اکاسی|بیاسی| سٹھ|انہتر|ستر|اکہتر|بہتر|تہتر|چوہتر|پچ
تریاسی|چوراسی|پچاسی|چھیاسی|ستاسی|اٹھاسی|نواسی|نوے|اکانوے|بانوے|ترانوے|چ
(s,"+ورانوے|پچانوے|چھیانوے|ستانوے|اٹھانوے|ننانوے(
جنوری|فروری|مارچ|اپریل|مئی )
+d\|اپریل +d\|مارچ +d\|فروری+d\|جنوری+d\|جون|جولائي|اگست|ستمبر|اکتوبر|نومبر|دسمبر
ئي م |\d+ جون|\d+ جولائی|\d+ اگست|\d+ ستمبر|\d+ اکتوبر|\d+ نومبر|\d+ دسمبر|\d+سنہ|سنہ\s
+\d+|\d+\s+سال|سال\
Regex: (“s+\d+|سال\s+\d{1,4}”)
Regex=re.findall(“(سنہ\s+[ آٹھ|سات|چھ|پانچ|چار|تین|یکم|دو|نو|دس|گیارہ|بارہ|تیرہ|چود
ہ|سولہ|سترہ|اٹھارہ|انیس|بیس|اکیس|بائیس|تئیس|جوبیس|پچیس|چھبیس|ستائیس|اٹھائیس|ان ہ|پندر
اگست|مارچ|جنوری|فروری|اپریل| مئی|جون|جولائی|اکتوبر |نومبر ]s\+[تیس|تیس|اکتیس
]s\+[دوہزار]s\+[ستمبر||دسمبر ار|پانچ|چھ|سات|آٹھ|نو|دس|گیارہ|بارہ|تیرہ|چودہ ایک|دو|تین|چ
|پندرہ|سولہ|سترہ|اٹھارہ|انیس|بیس|اکیس|بائیس|تئیس|چوبیس|پچیس|چھبیس|ستائیس|اٹھائیس|انت
یس|تیس|اکتیس|بتیس|تینتیس|چونتیس|پینتیس|چھتیس|سینتیس|اڑتیس|انتالیس|چالیس|اکتالیس|بیال
س|اڑتالیس|انچاس|پچاس|اکاون|باون|تریپن|چون| یس|تینتالیس|چوالیس|پینتالیس|چھیالیس|سینتالی
پچپن|چھپن|ستاون|اٹھاون|انسٹھ|ساٹھ|اکسٹھ|باسٹھ|تریسٹھ|چونسٹھ|پینسٹھ|سڑسٹھ|اڑسٹھ|انہتر|س
تر|اکہتر|بہتر|تہتر|چوہتر|پچھتر|چھہتر|ستتر|اٹھتر|اناسی|اسی|اکاسی|بیاسی|تریاسی|چوراسی|پ
چاسی|چھیاسی|ستاسی|اٹھاسی|نواسی
(s,”(+[|نوے|اکانوے|بانوے|ترانوے|چورانوے|پچانوے|چھیانوے|ستانوے|اٹھانوے|ننانوے
83
جنوری سنہ دوہزارایک
دوہزارایک
تین جنوری
دوہزارایک
2012سنہ سنہ دو اکتوبر
دوہزار چار
مارچ یکم مارچ دوہزاردو سنہ دوہزاردو
دوہزار
سنہ دس فروری
فروری دوہزار
چار
تین نومبر اپریل دوہزارتین سنہ دوہزارتین
دوہزارنو
2025سال سنہ دو اکتوبر
دوہزار پانچ
فروری سنہ دوہزارچار
دوہزارچار
تین مارچ
دوہزارچار
سنہ آٹھ دسمبر سال 2
دوہزار ایک
دو اپریل مئی دوہزارپانچ سنہ دوہزارایک
دوہزارایک
سنہ دو اکتوبر سال 2
دوہزار چار
سات اپریل جون دوہزارچھ سنہ دوہزاردو
دوہزارنو
سنہ دس 2000مارچ
فروری دوہزار
چار
جولائی سنہ دوہزارتین
دوہزارسات
دو مئی
دوہزارچھ
سنہ دو اکتوبر مارچ
دوہزار پانچ
84
Expression2=
آٹھ|سات|چھ|پانچ|چار|تین|یکم|دو|نو|دس|گیارہ|بارہ|تیرہ|چودہ|پندرہ|سولہ|سترہ|اٹھارہ|انیس|بیس|ا]")
[دوہزار]+s+\d+\s\+[کیس|بائیس|تئیس|جوبیس|پچیس|چھبیس|ستائیس|اٹھائیس|انتیس|تیس|اکتیس
+\s+[ ہ|انیس| ایک|دو|تین|چار|پانچ|چھ|سات|آٹھ|نو|دس|گیارہ|بارہ|تیرہ|چودہ|پندرہ|سولہ|سترہ|اٹھار
بیس|اکیس|بائیس|تئیس|چوبیس|پچیس|چھبیس|ستائیس|اٹھائیس|انتیس|تیس|اکتیس|بتیس|تینتیس|چونتی
س|پینتیس|چھتیس|سینتیس|اڑتیس|انتالیس|چالیس|اکتالیس|بیالیس|تینتالیس|چوالیس|پینتالیس|چھیالیس|
سٹھ|ساٹھ|اکسٹھ| سینتالیس|اڑتالیس|انچاس|پچاس|اکاون|باون|تریپن|چون|پچپن|چھپن|ستاون|اٹھاون|ان
باسٹھ|تریسٹھ|چونسٹھ|پینسٹھ|سڑسٹھ|اڑسٹھ|انہتر|ستر|اکہتر|بہتر|تہتر|چوہتر|پچھتر|چھہتر|ستتر|اٹ
ھتر|اناسی|اسی|اکاسی|بیاسی|تریاسی|چوراسی|پچاسی|چھیاسی|ستاسی|اٹھاسی|نواسی|نوے|اکانوے|
("+[بانوے|ترانوے|چورانوے|پچانوے|چھیانوے|ستانوے|اٹھانوے|ننانوے
Expression3=
آٹھ|سات|چھ|پانچ|چار|تین|یکم|دو|نو|دس|گیارہ|بارہ|تیرہ|چودہ|پندرہ|سولہ|سترہ|اٹھارہ|انیس|بیس|ا]")
]+s\+[کیس|بائیس|تئیس|جوبیس|پچیس|چھبیس|ستائیس|اٹھائیس|انتیس|تیس|اکتیس
("+s+\d\+[جنوری|فروری|مارچ|اپریل|مئی جون|جولائي|اگست|ستمبر|اکتوبر|نومبر|دسمبر
Expression4=("(\d+\s)+(
("s+(\d+)\+(جنوری|فروری|مارچ|اپریل|مئی|جون|جولائی|اگست|ستمبر|اکتوبر|نومبر|دسمبر
The proceeding expressions extract only Urdu fully qualified date that is
written in Urdu words i.e. (11/12/2020) چودہ دسمبر دو ہزار بیس.
Expression5=("(سنہ\s+[ | سات| آٹھ
| بائیس | اکیس | بیس| انیس| اٹھارہ| سترہ| سولہ| پندرہ| چودہ| تیرہ| بارہ| گیارہ| دس| نو| دو| یکم| تین| چار| پانچ| چھ
اکتیس| تیس| انتیس| اٹھائیس| ستائیس| چھبیس| پچیس| جوبیس| تئیس ]+\s[ ا| فروری| جنوری| مارچ| اگست
نومبر| اکتوبر| جولائی| جون| مئی| پریل
ستمبر| دسمبر| ]+\s[دوہزار]+\s[ پندر| چودہ| تیرہ| بارہ| گیارہ| دس| نو| آٹھ| سات| چھ| پانچ| چار| تین| دو| ایک
ا| تیس| انتیس| اٹھائیس| ستائیس| چھبیس| پچیس| چوبیس| تئیس| بائیس | اکیس| بیس| انیس| اٹھارہ| سترہ| سولہ| ہ
چ| تینتالیس| بیالیس| اکتالیس| چالیس| انتالیس| اڑتیس| سینتیس| چھتیس| پینتیس| چونتیس| تینتیس| بتیس| کتیس
ا | ستاون| چھپن| پچپن| چون| تریپن| باون| اکاون| پچاس| انچاس| اڑتالیس| سینتالیس| چھیالیس| پینتالیس| والیس
| چوہتر| تہتر| بہتر| اکہتر| ستر| انہتر| اڑسٹھ| سڑسٹھ| پینسٹھ| چونسٹھ| تریسٹھ| باسٹھ| اکسٹھ| ساٹھ| انسٹھ| ٹھاون
| اٹھاسی| ستاسی| چھیاسی| پچاسی| چوراسی| تریاسی| بیاسی| اکاسی| اسی| اناسی| اٹھتر| ستتر| چھہتر| پچھتر
ننانوے | اٹھانوے| ستانوے| چھیانوے| پچانوے| چورانوے| ترانوے| بانوے| اکانوے| نوے| سینوا ]+)")
85
Expression 6=
("( ب | انیس| اٹھارہ| سترہ| سولہ| پندرہ| چودہ| تیرہ| بارہ| گیارہ| دس| نو| دو| یکم | تین| چار| پانچ| چھ| سات| آٹھ
اکتیس | تیس| انتیس| اٹھائیس| ستائیس| چھبیس| پچیس| جوبیس| تئیس| بائیس| اکیس| یس )+\s( جو | جولائی
دسمبر| نومبر| اکتوبر| ستمبر| اگست| اپریل| جنوری| فروری| مارچ| مئی| ن )+\s( انیس| دوہزار
)+s\+(سو اٹھارہ | سترہ| سولہ| پندرہ| چودہ| تیرہ| بارہ| گیارہ| دس| نو| آٹھ| سات| چھ| پانچ| چار| تین| دو| ایک
تینت | بتیس| اکتیس| تیس| انتیس| اٹھائیس| ستائیس| چھبیس| پچیس| چوبیس| تئیس| بائیس| اکیس| بیس| انیس|
پینتال| چوالیس| تینتالیس| بیالیس| اکتالیس | چالیس| انتالیس| اڑتیس| سینتیس| چھتیس| پینتیس| چونتیس| یس
ا| اٹھاون| ستاون| چھپن| پچپن| چون| تریپن| باون| اکاون| پچاس| انچاس| اڑتالیس| سینتالیس| چھیالیس| یس
پچھ | چوہتر| تہتر| بہتر| اکہتر| ستر| انہتر | اڑسٹھ| سڑسٹھ| پینسٹھ| چونسٹھ| تریسٹھ| باسٹھ| اکسٹھ| ساٹھ| نسٹھ
| اٹھاسی| ستاسی| چھیاسی| پچاسی| چوراسی| تریاسی| بیاسی| اکاسی| اسی| اناسی| اٹھتر| ستتر| چھہتر| تر
ننانوے | اٹھانوے| ستانوے| چھیانوے| پچانوے| چورانوے| ترانوے| بانوے| اکانوے| نوے| نواسی )+")
86
B. List of Stop Words in Urdu Language 10
10 https://raw.githubusercontent.com/SyedMuhammadMuhsinKarim/Urdu-Stop-Words/main/stop_words.txt
آجاؤ آج آئے آئیں آئی آو آؤ آ
آپکو آپکا آپ آجکل آجاو آجائیے آجائیں
اسکی اسکا اسطرح اس ابھی اب آیا آپکی
انکی انکا ان الگ اطراف اسے اسی اسکے
اونچے اونچی اونچا اور انہیں انھیں انھوں انکے
اہم اگرچہ اگر اکثر اپنے اپنی اپنا اوپر
بظاہر بس بذریعہ باہم باہر بارے بار بائیں
بیشک بیشتر بھی بہت بند بلاشبہ بغیر بعد
تعداد ترین تر تجھے تجھ تب تاہم بےشک
تمھیں تمھارے تمھاری تمھارا تمکو تمام تم تلک
تھی تھا تک تو تمہیں تمہارے تمہاری تمہارا
جاتی جاتا جائیں تیسرے تیسری تیسرا تھے تھیں
جو جبکہ جبھی جبہی جب جانے جانا جاتے
دائیں خود جیسےکہ جیسے جیسی جیساکہ جیسا جہاں
دوسری دوسرا دوران دور دو دفعہ درمیان
دیکھو دیکھا دینے دینی دینا دی دونوں دوسرے
رکھے رکھیں رکھی رکھا رکھ دے دیکھیں دیکھی
سارا ساتھ زیادہ رہے رہیں رہی رہا رہ
سکتیں سکتی سکتا سبہی سبھی سب سارے ساری
صرف صحیح شدہ شبہ شاید سے سوا سکتے
لو لازمی لا لئے غیر غلط طرف طرح
لینی لینا لیا لی لگے لگیں لگی لگا
مجھے مجھکو مجھ لے لیئے لیے لیکن لینے
ملی ملو ملا مل مشتمل مزید مرتبہ مربوط
نھیں نا میں میرے میری میرا مگر ملے
والا و نے نیچے نیچی نیچا نہیں نہ
پاس ویں وہی وہاں وہ والے والی والوں
پھر پورے پوری پورا پڑی پڑے پڑا پر
چاہا پہلےسے پہلے پہلی پہلا پھرے پھری پھرو
چاہیئے چاہی چاہنا چاہتے چاہتیں چاہتی چاہتا
کب کا چکے چکیں چکی چکا چاہے چاہیے
کردو کرتےہو کرتے کرا کر کتنا کبھی
کرنا کررہے کررہیں کررہی کررہا کردی کردیے کردیا
کرواسکتی کرواسکتا کروانے کروانا کرو کرنے کرنی
کرچکیں کرچکی کرچکا کروائے کروائی کروایا کرواسکتے
کس کرے کریں کرسکتیں کرسکتے کرسکتی کرسکتا کرچکے
کچھ کون کوئی کو کمی کم کل کسی
کہنا کہتے کہتی کہتا کہاں کہا کہ
کیا کی کہے کہیں کہوں کہو کہنے کہنی
کیں کیوں کیلیے کیلئے کیسے کیساتھ کیجیے کیجئے
ہاں گے گیا گی گئے گئیں گا کے
ہو ہمیں ہمارے ہماری ہمارا ہم ہر
ہوسکتا ہورہے ہورہی ہورہا ہوا ہوئے
ہوچکا ہونے ہونگے ہونگی ہونگا ہونا ہوسکتے ہوسکتی
ہوگے ہوگیا ہوگی ہوگا ہوگئی ہوچکی
یا ہے ہیں ہی ہوے ہوئے ہوں ہوگئے
یہاں یہ یوں
87
C. PUBLISHED WORK
1. Ali, D., Missen, M. M. S., & Husnain, M. Multiclass Event Classification
from Text. Scientific Programming, 2021.
https://doi.org/10.1155/2021/6660651
2. Daler Ali, Malik Muhammad Saad Missen, Muhammad Ali Memon,
Muhammad Ali Nizamani, & Asadullah Shaikh. (2020). Extracting Temporal
Entity from Urdu Language Text. University of Sindh Journal of Information
and Communication Technology, 4(3), 181 - 188. Retrieved from
https://sujo.usindh.edu.pk/index.php/USJICT/article/view/2886.
88
Special Thanks
All the prostration for Allah Subhana,Taala
I’m very grateful to these honourable people especially my supervisor and Head of the
Department.
Dr. Dost Muhamad Khan HoD (Assistant Professor) ________________________
Dr. Malik Muhammad Saad Miseen (Assistant Professor) ________________________
Dr. Mujtaba Husnain (Assistant Professor) ________________________
Dr. Najia Saher (Assistant Professor) ________________________
Dr. Muhammad Omer (Assistant Professor) ________________________
Dr. Waheed Anwar (Assistant Professor) ________________________
I would specially thank my MS fellows Mr. Zahid Khurshid and Mr. Muzammil
Zubair.