salient named entity identification system
TRANSCRIPT
SALIENT NAMED ENTITIES IDENTIFICATION FROM TWEET
INFORMATION RETRIEVAL AND EXTRACTION APRIL 2015
Course Instructor : Dr. Vasudeva VarmaMentor : Priya R [email protected]
Ganesh J [email protected] Chiluveru [email protected]
Sindhura Y. R. [email protected]
Agenda
Problem Definition Approach Dataset Creation Inter-annotator Agreement Prediction algorithm Prediction performance Future directions Conclusion
What are named entities ?
Named entities (NE) are the phrases that clearly identifies one item from a set of other items that have similar attributes*.
They generally fall into 4 categories – Name, Place, Organization and Event.
* - Definition from http://searchbusinessanalytics.techtarget.com/definition/named-entity
Examples of NE’s
NE type Example
Name Virat Kohli, Sundar Pichai
Place Delhi, California
Organization
Indian cricket team, Google
Event 2015 Cricket World Cup
What are Salient NE’s ?
Salient NE’s (SNE) are more central to the text. capture the author’s intention. are relatively important.
Basically, SNE’s are the keywords from a tweet, when these keywords are named entities.
Examples of SNE’s
Vicky went to house-warming ceremony of his professor Raj Reddy at Shankerpally,
Telangana.
Person A1 can mark [‘Vicky’, ‘Raj Reddy’] as SNE’s.
Person A2 can mark [‘Raj Reddy’] as SNE’s.
Person A3 can mark [‘Vicky’, ‘Raj Reddy’, ‘Shankerpally’, ‘Telangana’] as SNE’s.
Of course, all are valid annotations.
Thus, SNE annotation is highly subjective in nature.
Problem Definition
Given a tweet, identify SNE’s present in it, if any.
Novel problem. Applications
User modeling – Understand what user is talking about.
Predicting current trends – Till now done based on hashtags only.
Approach
Create a dataset with manually annotated SNE’s.
Show the inter-annotator agreement in annotating SNE’s.
Pose the problem as Sequence learning problem.
Build the prediction algorithm with good set of features.
Dataset creation
Consider the CWC15* tweet,
Possible SNE combinations
1. Sangakkara and Sachin Tendulkar
2. Sangakkara
3. Sachin Tendulkar
* - http://en.wikipedia.org/wiki/2015_Cricket_World_Cup
Dataset creation (contd.)
Now consider the entire tweet (which includes image)
The person in the picture is Sangakkara*
* - http://en.wikipedia.org/wiki/Kumar_Sangakkara
Intuition -Tweeter ‘s picture actually captures their focus.
Dataset creation (contd.)
Idea – If a NE in the tweet is in the image, then it’s a SNE.
Pipeline:
Collect tweets Identify NE’s Annotate SNE
Dataset creation (contd.)
Collect Tweets Use Twitter4j API to get all the tweets
with the Hashtags* corresponding to Cricket World Cup 2015 quarter finals.
Obtain tweets which are in English language only. not re-tweets. accompanied by at-least one image.
* - #NZvWI,#WIvNZ,#PakvsAus,#AUSvPAK,#INDvsBAN,#BANvsIND,#SAvSL, #SLvSA
Collect tweets
Identify NE’s
Annotate SNE
Dataset creation (contd.)
Identify NE’s Used 3 state-of-art* Named Entity
Rec0gnizers (NER) for tweets “as is”.
Combined all their results to improve recall.
* - Analysis of Named Entity Recognition and Linking for Tweets - Information Processing & Management 51 (2), 32-49, 2014
Collect tweets
Identify NE’s
Annotate SNE
Dataset creation (contd.)
Identify NE’s NER’s used:
University of Washington – Alan Ritter (https://github.com/aritter/twitter_nlp)
Carnegie Mellon University – ArkTweet (http://www.ark.cs.cmu.edu/TweetNLP/)
Stanford (http://nlp.stanford.edu/software/CRF-NER.shtml)
Collect tweets
Identify NE’s
Annotate SNE
Dataset creation (contd.)
Annotate SNE’s Created a web application using
GWT* to do the manual annotation with ease.
Removed tweets with criteria (manually): Pointless Sarcasm Duplicate (same text but not a re-tweet) Images with non-english text. Advertisements.
* - http://www.gwtproject.org/
Collect tweets
Identify NE’s
Annotate SNE
Dataset creation (contd.)Manual annotation interface
Inter-annotator agreement
Compute the inter-annotator agreement score for the annotated dataset.
Created a new dataset for three different domains using keywords - AppleWatch, SAvsNZ and NationalAwards.
Randomly sampled 20 tweets from each domain. (60 in total)
Asked 3 annotators to annotate the 60 tweet corpus.
Inter-annotator agreement (contd.)
Measure / Domain
All Apple Watch
SA vs NZ
National Awards
AgreementPercentage
0.78 0.73 0.85 0.75
Cohen Kappa* 0.68 0.62 0.57 0.76
Fleiss Kappa* 0.60 0.75 0.40 0.65
* - http://en.wikipedia.org/wiki/Cohen%27s_kappa and http://en.wikipedia.org/wiki/Fleiss'_kappa
Status
Problem Definition Approach Dataset Creation Inter-annotator Agreement SNE Prediction algorithm SNE Prediction performance Future directions Conclusion
Sequence Labeling Algorithm Variant of classification problem. Given a sequence (in NLP, words),
assign appropriate labels to each word.
Example, partial parsing (aka chunking):
For a token, the target label is also dependent on the features of adjacent tokens.
B-NP I-NP B-VPB-PP B-NP I-NPThe cat sat on the mat
SNE Prediction algorithm
Modeled the SNE identification problem as Sequence learning problem.
Used Conditional Random Fields (CRF)* algorithm.
Used Alan Ritter’s Twitter NLP toolkit* to tokenize the tweets so as to extract features.
* - http://python-crfsuite.readthedocs.org/en/latest/ and https://github.com/aritter/twitter_nlp
SNE Prediction algorithm
Linguistic Features (from Alan Ritter) Word – Actual token POS tag - NNP, VBP, PRP, ... Chunk POS tag - B-NP, I-NP, B-VP, I-VP,… Entity tag - B-ENTITY, I-ENTITY and O
Example
Target Label – B-SNE, I-SNE and O-SNE
Word POS Tag Chunk POS Tag
Entity tag
Sachin NNP B-NP B-ENTITY
Tendulkar NNP I-NP I-ENTITY
SNE Prediction algorithm
CRF FeaturesType Feature Weigh
t
Word Lower : Change the case of word to lower case.
3
Word Upper : Change the case of word to upper case.
1
Word isTitle : Python’s default str.isTitle function.
1
Word isUpper : Is the word in upper case. 2
Word isFirstCharHash : True if first character is ‘#’
3
Word isFirstCharHashOrAt : True if first character is ‘#’ or ‘@’
4
Word isFirstCharCaps : True if first character is in uppercase.
3
SNE Prediction algorithm
CRF FeaturesType Feature Weigh
t
POS Postag : POS tag returned by Ritter 4
POS isStartsWithNN : True if pos tag starts with ‘NN’
2
POS isStartsWithNNorPR : True if pos tag starts with ‘NN’ or ‘PR’
1
Chunk
Chunk : Chunk POS tag returned by Ritter 1
Chunk
isChunkNP : True if chunk pos tag is ‘B-NP’ or ‘I-NP’
3
Entity
Entity : Entity tag returned by Ritter 4
Entity
isEntity : True if entity is B-ENTITY or I-ENTITY
1
SNE Prediction algorithm
CRF Features - Word2Vec* Algorithm that captures context of
words in a form of vector. Trained using all the tweets related
to World Cup 2015 (about 2,50,000 in size)
Example vector for a word ‘Sangakkara’[-0.014, -0.135, … , -0.068]
* - http://deeplearning4j.org/word2vec.html
SNE Prediction performance Dataset Size = 1100 5-fold cross validation Window size = 5
Precision Recall F1-Score
B-SNE 0.67 0.48 0.56
I-SNE 0.57 0.35 0.43
O-SNE 0.93 0.97 0.95
Overall 0.90 0.91 0.90
Future directions
Improve CRF algorithm with better features.
Implement other classifiers HMM, Neural networks, …
Named Entity Linking Map the SNE to a Knowledge base (KB)
entry. For example, a SNE like ‘Kumar Sangakkara’ must be mapped to http://en.wikipedia.org/wiki/Kumar_Sangakkara
Conclusion
SNE identification problem is a new problem, with many potential applications (especially social network analysis).
Thank you
Resources Code -
https://github.com/ganeshaspiring/ire-seimp
PPT – To be uploaded in Slideshare.