cs533 clustering tweet presentation

Upload: adie-suryadi

Post on 14-Feb-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/23/2019 Cs533 Clustering Tweet Presentation

    1/19

    Hakan Szer, Muhammed Yamur ahin, Yaz Salor

    1 / 19

  • 7/23/2019 Cs533 Clustering Tweet Presentation

    2/19

    2 / 19

  • 7/23/2019 Cs533 Clustering Tweet Presentation

    3/19

    Twitter

    Users post millions of messages everyday

    on Twitter

    With different topics

    Hard to find topics that fits your interest

    Twitter has mechanisms built in like TrendingTopics(TT)

    3 / 19

  • 7/23/2019 Cs533 Clustering Tweet Presentation

    4/19

    Trending Topics(TT)

    Works by using frequency of

    o occurence of recent hastags

    o occurence of recent words

    in aspecific period[1]

    Not uses the topic of the entire sentence(meaning)

    Examples: "Election"Obama wins the election.

    Obama rocks, 4 more years.

    4 / 19

  • 7/23/2019 Cs533 Clustering Tweet Presentation

    5/19

    Topical Clustering of Tweets

    Automatically clustering and classifying tweets intodifferent categories

    inspired by the approaches taken by services likeGoogle News[2]

    will provide coherent tweet sets that has classifiers astheir topics

    which will help browsing of tweets

    5 / 19

  • 7/23/2019 Cs533 Clustering Tweet Presentation

    6/19

    6 / 19

  • 7/23/2019 Cs533 Clustering Tweet Presentation

    7/19

    Social Media and Twitter

    Now, Twitter is an important information sourceworldwide

    Companies, politicians, mayors, celebrities, TV

    programs... Consider Melih Gkek

    Lots of mentions about him

    By using our clustering approach he can easily seewhat they are about

    7 / 19

  • 7/23/2019 Cs533 Clustering Tweet Presentation

    8/198 / 19

  • 7/23/2019 Cs533 Clustering Tweet Presentation

    9/19

    Approaches

    Unsupervised Clustering

    Supervised Clustering

    9 / 19

  • 7/23/2019 Cs533 Clustering Tweet Presentation

    10/19

    Unsupervised Clustering

    Unsupervised Clustering approaches are used in theprogram TweetMotif[3]

    o Automatic extraction of topics

    o Grouping tweets in the relevant topico Summarization of topics

    K means clustering algorithm with TF-IDF weighting

    10 / 19

  • 7/23/2019 Cs533 Clustering Tweet Presentation

    11/19

    TweetMotif

    11 / 19

  • 7/23/2019 Cs533 Clustering Tweet Presentation

    12/19

    Supervised Clustering

    According to article "Twitter Content Classification"hastags (#election) are approximate indicators oftopics[4]

    But not all of the tweets have hastags

    Use these hastags as classifiers

    Rocchio Classifier[2]

    o broadly used in document classification

    o quick to train

    o handle feature sparsity more robustly than othermodels(as the words in tweets are very sparse)

    12 / 19

  • 7/23/2019 Cs533 Clustering Tweet Presentation

    13/19

    Supervised vs Unsupervised

    These methods will be tested and investigated deeplyduring our process

    The best one in terms of

    o Performanceo Effectiveness

    o Cost of implementation

    will be used

    Maybe hybrid method?

    13 / 19

  • 7/23/2019 Cs533 Clustering Tweet Presentation

    14/19

    Summarization of Tweets

    Find the most representative tweet in cluster

    Use TF-IDF similarity for this selection

    That will be approximate summary of the cluster

    Google News uses this approach. It displays mostrelevant Story in a cluster of online news articles.

    14 / 19

  • 7/23/2019 Cs533 Clustering Tweet Presentation

    15/19

    Summarization of Tweets

    The algorithm we used to select the top N most representative tweets from a givencluster is as follows[2]:

    -Construct a centroid, V, for the cluster of tweets, C

    -Initialize an empty list for the selected tweets, T

    -Sort all tweets in C according to their TF-IDF similarity with V, where the highestranked ones are the ones that are most similar

    -Loop N times

    -Pick the highest ranked tweet, t, from C, whose TF-IDF similarity against alltweets in T is below some threshold k

    -Add t to T, and remove t from C

    15 / 19

  • 7/23/2019 Cs533 Clustering Tweet Presentation

    16/19

    Google News

    16 / 19

  • 7/23/2019 Cs533 Clustering Tweet Presentation

    17/19

    17 / 19

  • 7/23/2019 Cs533 Clustering Tweet Presentation

    18/19

    Expected Results

    Results can be rated by comparing for each cluster

    o for each tweets in that cluster

    o the summary statements

    giving a rating in terms of relativity for eachcomparison

    18 / 19

  • 7/23/2019 Cs533 Clustering Tweet Presentation

    19/19

    References

    1. Algorithms Behind Trending Topics http://www.ignitesocialmedia.com/twitter-marketing/trending-on-twitter-a-look-at-algorithms-behind-trending-topics/

    2. Topical Clustering of Tweets,K.D. Rosa, R Shah, B. Lin, A. Gershman, R.Frederking, Language Technologies Institute, Carnegie Mellon University

    3. TweetMotif: Exploratory Search and Topic Summarization for Twitter , B.O.

    Connor, M Krieger, D. Ahn

    4. Twitter Content Classificationhttp://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2745/2681

    19 / 19