language identification system for code-mixed social media text analysis
Post on 22-Jan-2018
59 Views
Preview:
TRANSCRIPT
Hello!
Language Identification system
for code-mixed Social Media Text
Analysis
Multilingual speakers often switch between languages.Mixing multiple languages together (code mixing) is a popular trend
in social media users.This complicates automatic language identification as it is shifted
from document level to word level.
Challengesproblems
Enhance social media analysis in language-dense areas.
Challengesmotivation
Creation of new lexical and syntactic structures (e.g. code-mixing on morpheme level).Interflow of dissimilar grammar when combining languages.Classification of particular words/ phrases that have been assimilated into
one language from another. (e.g.. Bottle)Classification of particular words that exist in more than one language.
(e.g.. Hum)Defining canonical forms for word normalization.
ChallengesChallenges
Hi
Class of statistical modeling method often applied in pattern recognition and machine learning and used for structured prediction.
Feature function is a function that takes in input as the sentence, position of word and labels as defined, outputs a real-valued number (usually either 0 or 1).
Conversion of features to probabilities by assigning them weights and exponentiating, normalizing the summation of weighted features.
ChallengesConditional random fields (crf)
@buttmona098 @HamzaIdrees Accha Topic Change karo
ChallengesDataset format
Univ Univ Hi HiEn En
@imp13196 ATM me Cash nhi haiUniv Acro Hi En Hi Hi
CRF++ is a simple, customizable, and open source implementation of CRF for segmenting/labeling sequential data. CRF++ is designed for generic purpose and will be applied to a variety of NLP tasks, such as Named Entity Recognition, Information Extraction and Text Chunking.
ChallengesCrf++ tool
% crf_learn template_file train_file model_file
% crf_test -m model_file test_files
https://taku910.github.io/crfpp/#format
ChallengesExample of template file
ChallengesExample of input file
Precision Recall F1-Score Support
En 0.77 0.81 0.79 2015
Ne 0.39 0.46 0.42 304
Hi 0.87 0.83 0.85 2817
Univ 0.98 0.98 0.98 1011
Mixed 0.00 0.00 0.00 3
Acro 0.81 0.55 0.66 78
Avg/Total 0.83 0.83 0.83 6228
initial approach
Precision-Recall Table:
Confusion Matrix :
Affixes (Prefix, Suffix)Length of WordFirst Character in Uppercase (Binary Feature)All Characters in Uppercase (Binary Feature)If contains any symbol (Binary Feature)Stop-Words for EnglishPrevious Word / Next Word Etc.
ChallengesFeatures
Total Features : 10
3232 17 127 4 0 1
68 141 28 0 0 1
138 18 1315 0 0 5
4 0 0 970 0 0
0 0 0 0 0 0
12 0 4 5 0 37
Precision Recall F1-Score Support
En 0.94 0.96 0.95 3381
Ne 0.80 0.59 0.68 238
Hi 0.89 0.89 0.89 1476
Univ 0.99 1.00 0.99 974
Mixed 0.00 0.00 0.00 0
Acro 0.84 0.64 0.73 58
Avg/Total 0.93 0.93 0.93 6127
Final approach
Precision-Recall Table:
Confusion Matrix :
Affixes (Prefix, Suffix)Length of WordFirst Character in Uppercase (Binary Feature)All Characters in Uppercase (Binary Feature)If contains any symbol (Binary Feature)Stop-Words for EnglishPrevious Word / Next WordList of trained Hindi Words (Binary Feature)Numerical Data Feature Etc.
ChallengesFeatures (final)
Total Features : 17
Probabilistic sequence model: given a sequence of units (words, letters, morphemes, sentences), it computes a probability distribution over possible sequences of labels and chooses the best label sequence.
ChallengesHidden markov model (hmm)
Trigrams ‘n’ Tags, is a very efficient statistical POS tagger that is trainable on different languages and virtually any tagset. The component for parameter generation trains on tagged corpora. The system incorporates several methods of smoothing and of handling unknown words.
ChallengesTnt tool
http://www.coli.uni-saarland.de/~thorsten/tnt/
% ./tnt-para train_file
% ./tnt train_file test_file
ChallengesExample of tnt
ChallengesExample of tnt output file
CRF hmm
Vs.
CRF : Accuracy 93 %
HMM : Accuracy 67%
Challenges
Thank You!
Parth Desai
Shreshta Bhat
Harish B. Manish Shrivastava
Created By :
Under the supervision of :
top related