language identification system for code-mixed social media text analysis

21
Hello!

Upload: parth-desai

Post on 22-Jan-2018

59 views

Category:

Social Media


3 download

TRANSCRIPT

Page 1: Language Identification System For Code-Mixed Social Media Text Analysis

Hello!

Page 2: Language Identification System For Code-Mixed Social Media Text Analysis

Language Identification system

for code-mixed Social Media Text

Analysis

Page 3: Language Identification System For Code-Mixed Social Media Text Analysis

Multilingual speakers often switch between languages.Mixing multiple languages together (code mixing) is a popular trend

in social media users.This complicates automatic language identification as it is shifted

from document level to word level.

Challengesproblems

Page 4: Language Identification System For Code-Mixed Social Media Text Analysis

Enhance social media analysis in language-dense areas.

Challengesmotivation

Page 5: Language Identification System For Code-Mixed Social Media Text Analysis

Creation of new lexical and syntactic structures (e.g. code-mixing on morpheme level).Interflow of dissimilar grammar when combining languages.Classification of particular words/ phrases that have been assimilated into

one language from another. (e.g.. Bottle)Classification of particular words that exist in more than one language.

(e.g.. Hum)Defining canonical forms for word normalization.

ChallengesChallenges

Hi

Page 6: Language Identification System For Code-Mixed Social Media Text Analysis

Class of statistical modeling method often applied in pattern recognition and machine learning and used for structured prediction.

Feature function is a function that takes in input as the sentence, position of word and labels as defined, outputs a real-valued number (usually either 0 or 1).

Conversion of features to probabilities by assigning them weights and exponentiating, normalizing the summation of weighted features.

ChallengesConditional random fields (crf)

Page 7: Language Identification System For Code-Mixed Social Media Text Analysis

@buttmona098 @HamzaIdrees Accha Topic Change karo

ChallengesDataset format

Univ Univ Hi HiEn En

@imp13196 ATM me Cash nhi haiUniv Acro Hi En Hi Hi

Page 8: Language Identification System For Code-Mixed Social Media Text Analysis

CRF++ is a simple, customizable, and open source implementation of CRF for segmenting/labeling sequential data. CRF++ is designed for generic purpose and will be applied to a variety of NLP tasks, such as Named Entity Recognition, Information Extraction and Text Chunking.

ChallengesCrf++ tool

% crf_learn template_file train_file model_file

% crf_test -m model_file test_files

https://taku910.github.io/crfpp/#format

Page 9: Language Identification System For Code-Mixed Social Media Text Analysis

ChallengesExample of template file

Page 10: Language Identification System For Code-Mixed Social Media Text Analysis

ChallengesExample of input file

Page 11: Language Identification System For Code-Mixed Social Media Text Analysis

Precision Recall F1-Score Support

En 0.77 0.81 0.79 2015

Ne 0.39 0.46 0.42 304

Hi 0.87 0.83 0.85 2817

Univ 0.98 0.98 0.98 1011

Mixed 0.00 0.00 0.00 3

Acro 0.81 0.55 0.66 78

Avg/Total 0.83 0.83 0.83 6228

initial approach

Precision-Recall Table:

Confusion Matrix :

Page 12: Language Identification System For Code-Mixed Social Media Text Analysis

Affixes (Prefix, Suffix)Length of WordFirst Character in Uppercase (Binary Feature)All Characters in Uppercase (Binary Feature)If contains any symbol (Binary Feature)Stop-Words for EnglishPrevious Word / Next Word Etc.

ChallengesFeatures

Total Features : 10

Page 13: Language Identification System For Code-Mixed Social Media Text Analysis

3232 17 127 4 0 1

68 141 28 0 0 1

138 18 1315 0 0 5

4 0 0 970 0 0

0 0 0 0 0 0

12 0 4 5 0 37

Precision Recall F1-Score Support

En 0.94 0.96 0.95 3381

Ne 0.80 0.59 0.68 238

Hi 0.89 0.89 0.89 1476

Univ 0.99 1.00 0.99 974

Mixed 0.00 0.00 0.00 0

Acro 0.84 0.64 0.73 58

Avg/Total 0.93 0.93 0.93 6127

Final approach

Precision-Recall Table:

Confusion Matrix :

Page 14: Language Identification System For Code-Mixed Social Media Text Analysis

Affixes (Prefix, Suffix)Length of WordFirst Character in Uppercase (Binary Feature)All Characters in Uppercase (Binary Feature)If contains any symbol (Binary Feature)Stop-Words for EnglishPrevious Word / Next WordList of trained Hindi Words (Binary Feature)Numerical Data Feature Etc.

ChallengesFeatures (final)

Total Features : 17

Page 15: Language Identification System For Code-Mixed Social Media Text Analysis

Probabilistic sequence model: given a sequence of units (words, letters, morphemes, sentences), it computes a probability distribution over possible sequences of labels and chooses the best label sequence.

ChallengesHidden markov model (hmm)

Page 16: Language Identification System For Code-Mixed Social Media Text Analysis

Trigrams ‘n’ Tags, is a very efficient statistical POS tagger that is trainable on different languages and virtually any tagset. The component for parameter generation trains on tagged corpora. The system incorporates several methods of smoothing and of handling unknown words.

ChallengesTnt tool

http://www.coli.uni-saarland.de/~thorsten/tnt/

% ./tnt-para train_file

% ./tnt train_file test_file

Page 17: Language Identification System For Code-Mixed Social Media Text Analysis

ChallengesExample of tnt

Page 18: Language Identification System For Code-Mixed Social Media Text Analysis

ChallengesExample of tnt output file

Page 19: Language Identification System For Code-Mixed Social Media Text Analysis

CRF hmm

Vs.

Page 20: Language Identification System For Code-Mixed Social Media Text Analysis

CRF : Accuracy 93 %

HMM : Accuracy 67%

Page 21: Language Identification System For Code-Mixed Social Media Text Analysis

Challenges

Thank You!

Parth Desai

Shreshta Bhat

Harish B. Manish Shrivastava

Created By :

Under the supervision of :