language identification system for code-mixed social media text analysis

Hello!

Language Identification system

for code-mixed Social Media Text

Analysis

Multilingual speakers often switch between languages.Mixing multiple languages together (code mixing) is a popular trend

in social media users.This complicates automatic language identification as it is shifted

from document level to word level.

Challengesproblems

Enhance social media analysis in language-dense areas.

Challengesmotivation

Creation of new lexical and syntactic structures (e.g. code-mixing on morpheme level).Interflow of dissimilar grammar when combining languages.Classification of particular words/ phrases that have been assimilated into

one language from another. (e.g.. Bottle)Classification of particular words that exist in more than one language.

(e.g.. Hum)Defining canonical forms for word normalization.

ChallengesChallenges

Hi

Class of statistical modeling method often applied in pattern recognition and machine learning and used for structured prediction.

Feature function is a function that takes in input as the sentence, position of word and labels as defined, outputs a real-valued number (usually either 0 or 1).

Conversion of features to probabilities by assigning them weights and exponentiating, normalizing the summation of weighted features.

ChallengesConditional random fields (crf)

@buttmona098 @HamzaIdrees Accha Topic Change karo

ChallengesDataset format

Univ Univ Hi HiEn En

@imp13196 ATM me Cash nhi haiUniv Acro Hi En Hi Hi

CRF++ is a simple, customizable, and open source implementation of CRF for segmenting/labeling sequential data. CRF++ is designed for generic purpose and will be applied to a variety of NLP tasks, such as Named Entity Recognition, Information Extraction and Text Chunking.

ChallengesCrf++ tool

% crf_learn template_file train_file model_file

% crf_test -m model_file test_files

https://taku910.github.io/crfpp/#format

ChallengesExample of template file

ChallengesExample of input file

Precision Recall F1-Score Support

En 0.77 0.81 0.79 2015

Ne 0.39 0.46 0.42 304

Hi 0.87 0.83 0.85 2817

Univ 0.98 0.98 0.98 1011

Mixed 0.00 0.00 0.00 3

Acro 0.81 0.55 0.66 78

Avg/Total 0.83 0.83 0.83 6228

initial approach

Precision-Recall Table:

Confusion Matrix :

Affixes (Prefix, Suffix)Length of WordFirst Character in Uppercase (Binary Feature)All Characters in Uppercase (Binary Feature)If contains any symbol (Binary Feature)Stop-Words for EnglishPrevious Word / Next Word Etc.

ChallengesFeatures

Total Features : 10

3232 17 127 4 0 1

68 141 28 0 0 1

138 18 1315 0 0 5

4 0 0 970 0 0

0 0 0 0 0 0

12 0 4 5 0 37

Precision Recall F1-Score Support

En 0.94 0.96 0.95 3381

Ne 0.80 0.59 0.68 238

Hi 0.89 0.89 0.89 1476

Univ 0.99 1.00 0.99 974

Mixed 0.00 0.00 0.00 0

Acro 0.84 0.64 0.73 58

Avg/Total 0.93 0.93 0.93 6127

Final approach

Precision-Recall Table:

Confusion Matrix :

Affixes (Prefix, Suffix)Length of WordFirst Character in Uppercase (Binary Feature)All Characters in Uppercase (Binary Feature)If contains any symbol (Binary Feature)Stop-Words for EnglishPrevious Word / Next WordList of trained Hindi Words (Binary Feature)Numerical Data Feature Etc.

ChallengesFeatures (final)

Total Features : 17

Probabilistic sequence model: given a sequence of units (words, letters, morphemes, sentences), it computes a probability distribution over possible sequences of labels and chooses the best label sequence.

ChallengesHidden markov model (hmm)

Trigrams ‘n’ Tags, is a very efficient statistical POS tagger that is trainable on different languages and virtually any tagset. The component for parameter generation trains on tagged corpora. The system incorporates several methods of smoothing and of handling unknown words.

ChallengesTnt tool

http://www.coli.uni-saarland.de/~thorsten/tnt/

% ./tnt-para train_file

% ./tnt train_file test_file

ChallengesExample of tnt

ChallengesExample of tnt output file

CRF hmm

Vs.

CRF : Accuracy 93 %

HMM : Accuracy 67%

Challenges

Thank You!

Parth Desai

Shreshta Bhat

Harish B. Manish Shrivastava

Created By :

Under the supervision of :