![Page 1: SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649cff5503460f949d028d/html5/thumbnails/1.jpg)
SNOW Workshop, 8th April 2014
Real-time topic detection with bursty ngrams:
RGU participation in SNOW 2014 challengeCarlos Martin and Ayse Goker (Robert Gordon University)
![Page 2: SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649cff5503460f949d028d/html5/thumbnails/2.jpg)
Outline
• Architecture diagram• Results• Future work
#2
![Page 3: SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649cff5503460f949d028d/html5/thumbnails/3.jpg)
Architecture diagram
#3
Crawler EntitiesExtractor
Solr
Tweets(English)
Tweets
Tweets (with Entities)
![Page 4: SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649cff5503460f949d028d/html5/thumbnails/4.jpg)
Architecture diagram
#4
Crawler EntitiesExtractor BNgram
Keyword Extractor
TopicAggregator
Solr
Topics Combiner
Query Builder
Topic Labeller
Tweets(English)
Tweets
Tweets (with Entities)
Ranked topics
Topics (+ keywords, entities, hashtags
and urls)
Merged topics Topics (+ label)Topics (+ tweets)
![Page 5: SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649cff5503460f949d028d/html5/thumbnails/5.jpg)
Entities Extractor
• Extract entities per tweet using Stanford NER (http://nlp.stanford.edu/software/CRF-NER.shtml).
• 3 class model Identifies Person, Location and Organization.
• Efficient enough for a real-time system.
#5
![Page 6: SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649cff5503460f949d028d/html5/thumbnails/6.jpg)
Architecture diagram
#6
Crawler EntitiesExtractor BNgram
Solr
Tweets(English)
Tweets
Tweets (with Entities)
Ranked topics
![Page 7: SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649cff5503460f949d028d/html5/thumbnails/7.jpg)
BNgram approach
• Detection of bursty ngrams based on df-idf score Bursty entities, hashtags and urls are also included in the approach. Re ngrams, 2- and 3-grams are considered (no unigrams anymore).
• Variant of tf-idf Penalization of frequent terms in previous timeslots.
• Terms containing hashtags, entities, urls are boosted.• Two previous timeslots (s=2) were considered in our
experiments.
#7
![Page 8: SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649cff5503460f949d028d/html5/thumbnails/8.jpg)
BNgram approach
• “Partial” membership clustering approach is an interesting alternative as one term could belong to different clusters (For example, entity “Obama” for the stories “Obama wins in Ohio” and “Obama wins in Illinois”).
• Apriori clustering algorithm has been used in the experiments of SNOW challenge
• Explore maximal associations between terms based on the number of shared tweets.
#8
![Page 9: SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649cff5503460f949d028d/html5/thumbnails/9.jpg)
BNgram approach
• Output: Clusters of trending terms with tweets from the last timeslot associated to them.
• A tweet should contain a minimum number of cluster terms to be included.
• Clusters are ranked by their bursty scores (maximum df-idf value of topic terms)
#9
![Page 10: SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649cff5503460f949d028d/html5/thumbnails/10.jpg)
Architecture diagram
#10
Crawler EntitiesExtractor BNgram
Keyword Extractor
TopicAggregator
Solr
Tweets(English)
Tweets
Tweets (with Entities)
Ranked topics
Topics (+ keywords, entities, hashtags
and urls)
![Page 11: SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649cff5503460f949d028d/html5/thumbnails/11.jpg)
Keyword Extractor and Topic Aggregator modules• Topic Aggregator module:
– Aggregate entities, hashtags and urls per topic (coming from topic tweets of the corresponding timeslot) keeping their frequencies.
– Keep those ones whose frequency is higher than a threshold.• Keyword Extractor module:
– Extract main keywords (including ngrams) per topic (not extracted from Topic Aggregator) using bursty terms from the clusters.
– Removal of urls, hashtags, user mentions, entities and acronyms.
– Overlaps are also removed.– Keep df-idf scores as their weights.
#11
![Page 12: SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649cff5503460f949d028d/html5/thumbnails/12.jpg)
Architecture diagram
#12
Crawler EntitiesExtractor BNgram
Keyword Extractor
TopicAggregator
Solr
Topics Combiner
Tweets(English)
Tweets
Tweets (with Entities)
Ranked topics
Topics (+ keywords, entities, hashtags
and urls)
Merged topics
![Page 13: SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649cff5503460f949d028d/html5/thumbnails/13.jpg)
Topic Combiner module
• Topic Combiner module:– Merge similar topics from the same timeslot.– Based on the co-occurrence of keywords (unigrams),
entities, hashtags and urls from the compared topics.– According to preliminary results, Apriori algorithm makes
this module more accurate as one term could belong to different topics.
#13
![Page 14: SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649cff5503460f949d028d/html5/thumbnails/14.jpg)
Architecture diagram
#14
Crawler EntitiesExtractor BNgram
Keyword Extractor
TopicAggregator
Solr
Topics Combiner
Query Builder
Tweets(English)
Tweets
Tweets (with Entities)
Ranked topics
Topics (+ keywords, entities, hashtags
and urls)
Merged topics Topics (+ tweets)
![Page 15: SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649cff5503460f949d028d/html5/thumbnails/15.jpg)
Query Builder module
• Creation of final queries to retrieve all the related tweets to the topic (Solr queries) and also filtering by time (simulating real-time scenario).
• 3 types of queries:– Keywords– Entities and Hashtags– Urls
• If keywords and entities in topic, keywords closer to the entities are the selected ones.
• Image population: If tweets contains links to images (metadata), they are added to the topic.
#15
![Page 16: SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649cff5503460f949d028d/html5/thumbnails/16.jpg)
Query Builder module
• Replies are also considered. Be careful with spam replies
• Replies are not text-query dependent. More diversity?.
• Sentiment analysis, extraction of relevant keywords.
#16
![Page 17: SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649cff5503460f949d028d/html5/thumbnails/17.jpg)
Query Builder module
• Diverse tweets are computed based on cosine similarity.
• This approach could be more or less strict depending on the selected threshold.
#17
![Page 18: SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649cff5503460f949d028d/html5/thumbnails/18.jpg)
Architecture diagram
#18
Crawler EntitiesExtractor BNgram
Keyword Extractor
TopicAggregator
Solr
Topics Combiner
Query Builder
Topic Labeller
Tweets(English)
Tweets
Tweets (with Entities)
Ranked topics
Topics (+ keywords, entities, hashtags
and urls)
Merged topics Topics (+ label)Topics (+ tweets)
![Page 19: SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649cff5503460f949d028d/html5/thumbnails/19.jpg)
Topic Labeller module
• BuzzFeed editor-in-chief Ben Smith: “Headlines sure look a lot like tweets these days.” (http://perryhewitt.com/5-lessons-buzzfeed-harvard/)
• For each topic tweet, a score is computed based on the following formula.
where α = 0.8. The tweet with the highest score is selected as the Topic label after cleaning it.
#19
![Page 20: SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649cff5503460f949d028d/html5/thumbnails/20.jpg)
Topic Labeller module
• Example of tweets after cleaning them
• Granularity is still an issue Some topic labels are too general or specific.
#20
![Page 21: SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649cff5503460f949d028d/html5/thumbnails/21.jpg)
Results - Examples of topics
#21
![Page 22: SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649cff5503460f949d028d/html5/thumbnails/22.jpg)
Future work
• Improve Topic Combiner module – use of similarity measures.
• Further research on the use of replies and diverse tweets per Topic.
• Improve Topic Labeller module – granularity issue.• Modifications in QueryBuilder module – use of term
weights (Solr).
#22