ediumv02.pdf

6
EDIUM: Improving Entity Disambiguation via User Modeling Author Organization Abstract.  Enti ty Dis ambiguati on is the tas k of associa ting namementions in text to the correct referent entities in the knowledge base, with the goal of understandin g and extra cting useful information from the document. Entity disambiguation has become an important task to harness informa- tion shared by users on microblogging sites like twitter. However, noise and lack of context in tweets makes disambiguation a difcult task. In this pap er, we descri be an Entity Disambig uation syste m, EDIUM, which uses User interest  Models to disambiguate the mentions in the user’s tweets. Our system jointly models the user’s interest scores and the context dis- ambiguation scores, thus compens ating the sparse context in the tweets for a given user. We evaluated the system’s entity linking capabilities on the user tweets and showed that improvement can be achieved by com-  bining the user models and the context based models. 1 Intr oducti on Named Entity Disambiguation (NED) is the task of identifying the correct en- tity reference from the knowledge bases (like DBpedia, Freebase or YAGO), for the given mention. In microblogging sites like twitter , NED is an important task for understa nding the user ’s intent and for topic detect ion & track ing, search personalization and recommendations. In past, many NED techniques have been proposed. Some utilize contex- tual information of an entity, while others use candidate’s popularity for dis- ambiguation. But tweets being short and noisy, lack sufcient context for these systems to disambiguate precisely. Due to this, the underlying user interests are modeled to disambiguate the entities [1][2]. However, creation of user mod- els might require some external knowledge (like Wikipedia edits[1]), which are computationally expensive. Also, some users (e.g. news channels) tweet ran- domly (based on recent events or trending hashtags), while others follow their interests too passionately. So, using same congurations (like same length slid- ing window[2]) for distinct users might not be effective. We tried to address these issues by modeling the user’s tweeting behavior over time. Our proposed system,  EDIUM, links the entities in the tweets by simulta- neously disambiguating the entities and creating the user models. The user’s  behavior is analyzed with respect to the model built and an appropriate weigh- tage is assigned to the user model. The user model contributes in proportion to the weight assigned to it while disambiguating the new tweet entities. This

Upload: akulbansal

Post on 07-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EDIUMv02.pdf

7/17/2019 EDIUMv02.pdf

http://slidepdf.com/reader/full/ediumv02pdf 1/6

EDIUM: Improving Entity Disambiguation via User

Modeling

Author

Organization

Abstract.   Entity Disambiguation is the task of associating namementionsin text to the correct referent entities in the knowledge base, with the goalof understanding and extracting useful information from the document.Entity disambiguation has become an important task to harness informa-tion shared by users on microblogging sites like twitter. However, noiseand lack of context in tweets makes disambiguation a difficult task. In thispaper, we describe an Entity Disambiguation system, EDIUM, which usesUser interest Models to disambiguate the mentions in the user’s tweets.Our system jointly models the user’s interest scores and the context dis-ambiguation scores, thus compensating the sparse context in the tweetsfor a given user. We evaluated the system’s entity linking capabilities onthe user tweets and showed that improvement can be achieved by com- bining the user models and the context based models.

1 Introduction

Named Entity Disambiguation (NED) is the task of identifying the correct en-tity reference from the knowledge bases (like DBpedia, Freebase or YAGO), forthe given mention. In microblogging sites like twitter, NED is an important task

for understanding the user’s intent and for topic detection & tracking, searchpersonalization and recommendations.

In past, many NED techniques have been proposed. Some utilize contex-tual information of an entity, while others use candidate’s popularity for dis-ambiguation. But tweets being short and noisy, lack sufficient context for thesesystems to disambiguate precisely. Due to this, the underlying user interestsare modeled to disambiguate the entities[1][2]. However, creation of user mod-els might require some external knowledge (like Wikipedia edits[1]), which arecomputationally expensive. Also, some users (e.g. news channels) tweet ran-domly (based on recent events or trending hashtags), while others follow theirinterests too passionately. So, using same configurations (like same length slid-ing window[2]) for distinct users might not be effective. We tried to addressthese issues by modeling the user’s tweeting behavior over time.

Our proposed system,  EDIUM, links the entities in the tweets by simulta-neously disambiguating the entities and creating the user models. The user’s

 behavior is analyzed with respect to the model built and an appropriate weigh-tage is assigned to the user model. The user model contributes in proportionto the weight assigned to it while disambiguating the new tweet entities. This

Page 2: EDIUMv02.pdf

7/17/2019 EDIUMv02.pdf

http://slidepdf.com/reader/full/ediumv02pdf 2/6

approach can also be used for modeling users and disambiguating entities inother streaming documents like emails or query logs. The next section describesthe EDIUM system in details.

2 The EDIUM System

EDIUM  works by creating users interests as a distribution over the semanticWikipedia categories. EDIUM has three sub-systems: the context modeling sys-tem (Section 2.1), the user modeling system (Section 2.2) and the disambigua-tion system (Section 2.3). The user’s interests are modeled based on the tweetcategories by the user. These interests along with the local context are used bythe disambiguation system for linking entities in the new tweets. The final re-sults are fed back to the system for improving the user model.

Every tweet has multiple mentions and each mention can be aligned to mul-tiple entities. Table 1 represents few notations used while describing the system.

Table 1. Notations used for describing the system

Symbol Description

C j

i   j-th candidate entity for the i-th mention in a given tweet

cj

i   Contextual similarity score for C jiPar(C )  Parent of  C  is set of categories that are the immediate ancestor of category  C G(C )   Grand Parent of  C  is set of all categories such that

G(C ) ⊆  P ar(Par(C )) and  G(C ) ∩ Par(C ) ≡ ∅

N r(C )   Set of categories in the r-th neighborhood of category C IC 

iu   Set of categories in the i-th interest cluster for user  u

iciu   Score for the cluster IC iu

2.1 Contextual Modeling System

Context Model (CM) disambiguate the entities based on the text around theentities. Similarity between the text around the mention and text on Wikipediapage of an entity is compared and an appropriate weightage for disambigua-tion is given to each candidate reference. The candidate referent with the maxi-mum weightage is considered as disambiguated entity for the given mention.

We improved the referents’ disambiguation scores by combining the context based scores with the users interest based scores in an appropriate manner. Weused existing entity linking systems like DBpedia Spotlight[3] and WikipediaMiner[4] for linking and disambiguating the entities based on the context.

The final scoreScoreC (C ji ) given by the context model is the candidate score

normalized based on all the possible alignments for the given mention.

ScoreC (C ji ) =   c

ji

j∈∀C i

cji

(1)

2.2 User Modeling System

User Model (UM) understands the user’s interests and behavior over time.

Page 3: EDIUMv02.pdf

7/17/2019 EDIUMv02.pdf

http://slidepdf.com/reader/full/ediumv02pdf 3/6

UM Creation:  We used cluster-weighted models1 for modeling the user’s in-terests. The following assumptions were made while creating the user models.

–  Users only tweet on topics that interest them.–  The amount of interest in a topic is proportional to the information shared

 by the user on the topic.

Based on these assumptions, we modeled each user into weighted sub-clustersof semantic Wikipedia categories. Each sub-cluster represents the user’s interestover specific topic and weight represents the overall interest of user in thattopic.

UM is updated for user’s future tweets based on the categories in the user’scurrent tweet. The tweet categories are extracted using the following steps.

1. Current tweet’s entities are discovered via disambiguation modeling (DM)system (Section 2.3).

2. The entities with sufficiently high confidence are shortlisted to prevent UM

from learning incorrect information for the user. We considered only thoseentities where the ratio of scores of the second ranked entity to the disam- biguated entity is atmost δ 2.

3. Tweet categories for the shortlisted entities are extracted using Wikipedia.The score of each tweet category is equivalent to the number of tweet enti-ties inherited by the category.

4. Considering the graph of semantic Wikipedia categories, the tweet cate-gories are smoothed to include the parent categories. Parents are givenscores in inverse proportion to their out-degree for each child category.Common parent gets lesser contribution from the child’s score as comparedto rare parent.

The UM is created based on the tweet categories. If the category is alreadypresent in the model, the score is updated by the sum of initial and the tweetcategory score. Otherwise the category and its score is added to the model. Asthe new tweet category scores are added to the UM, the model is evolved to

 better represent the newly processed tweet.To find the topic of interests for the user, each category is mapped to a single

interest cluster. We formed clusters based on the similar parent and grandpar-ent categories for a given category. The score of the k-th interest cluster, icku, foruser u, is the sum of the weights of the categories in the cluster  k.

Twitter users exhibit different interest behaviors, highly specific or too ran-dom. This can be seen from the fact that some users tweet based on the situ-ations like trending hashtags or popular news, while others tweets only abouthighly specific products or companies. Disambiguating entities depends highlyon the behavior of the users. While making use of interest models might be use-

ful in the latter case, it might not be that effective in the former case. To handle1 http://en.wikipedia.org/wiki/Cluster-weighted modeling2δ  value depends on the usecase and performanceof the underlyingCM system. Whilehigh values ensure the large learning rates, low values ensure the performance of theUM system.

Page 4: EDIUMv02.pdf

7/17/2019 EDIUMv02.pdf

http://slidepdf.com/reader/full/ediumv02pdf 4/6

this issue, we introduce the concept of relatedness between the user’s learntmodel and the Disambiguation Model (DM).

Similarity between the UM and the DM is defined as cosine similarity be-tween the tweet categories vector obtained when DM is used vs. when onlyuser model is used for disambiguation.

Sim(UM,DM ) =  cos(Scoreα(C i),ScoreU (C i))   (2)

Similarly, similarity between the CM and the DM is defined as cosine simi-larity between the tweet categories vector obtained when DM is used vs. whenonly CM is used for disambiguation.

Sim(CM,DM ) =  cos(Scoreα(C i), ScoreC (C i))   (3)

Now we define relatedness, R as ratio of similarity between UM & DM and CM& DM in inverse proportion to their contribution while disambiguation.

R =  (1 − α) ∗ Sim(UM,DM )

(1 − α) ∗ Sim(UM,DM ) + α ∗ Sim(CM,DM )  (4)

α is the measure of consistency of user’s behavior towards the learnt usermodel. α tells how consistent is the user about his interests (and how stable theuser model is). The higher the value of  α, more consistent the user is.

We update α after each tweet based on the contribution the user model hasin deciding the tweet categories. Since we don’t want to decide the α just basedon the user’s behavior on one tweet, we consider previous n relatedness valuesfor finding the new α. The α is the average of the last  n relatedness values.

α =

  1

n

n

t=0

Rt   (5)

To deal with the changing users’ interests, we decrease  α  by a factor of  0.9each day. This lowers the dependency of DM on the UM with time. Also,  αis restricted to 0.7 to resist model from learning the incorrect user models andalways making decisions irrespective of the contexts used. This also enablesmodel to discover new entity in highly interest focused twitter users.

The user model is committed to the database3 after each transaction andis used whenever new tweet from the same user arrives. This helps us trackhuge number of users and built the streaming disambiguation system for twit-ter streams.

Disambiguation :  For each category C ji   in a tweet, the final score given by user

model is

ScoreU (C ji ) =

n

C k=0

Sim(C ji , IC ku) ∗ Score(IC ku)   (6)

3 Mongo DB is used as a database

Page 5: EDIUMv02.pdf

7/17/2019 EDIUMv02.pdf

http://slidepdf.com/reader/full/ediumv02pdf 5/6

Score(IC iu

) =  iciu

nck=0

icku(7)

Sim(C ji , IC iu) =

  IC iu  ∩ N 3(C ji )

IC iu  ∪ N 3(C ji )(8)

The ScoreU (C ji ) is normalized relative to all possible  ScoreU (C i).

2.3 Entity Disambiguation System (DM)

DM disambiguate the entities based on the textual context as well as the user’sinterests. The DM systems combines both the context based model’s score andthe user based model’s score using the parameter  α, that relates the stability of 

user to the previous tweeted topics. The final score predicted by the DM is

Scoreα(C ji ) =  α ∗ ScoreU (C ji ) + (1 − α) ∗ ScoreC (C 

ji )   (9)

The model selects the entity that maximized the  Scoreα(C ji ) for the given men-tion i.

Entityi  = arg maxScoreα

Scoreα(C ji )   (10)

3 Results and Discussions

We evaluated the performance of   EDIUM  on manually annotated dataset of 100 tweets from 15 different twitter users.  α  is initialized to 0.001 for each user

 because UM has no prior information about the user. We experimented the sys-tem with n  = 20 and  δ  = 0.95. As the UM is improved with user’s each tweet,precision at 1 (P@1) score is calculated at interval of 20 tweets for each user.The system is evaluated with both DBpedia Spotlight and Wikipedia Miner asthe context modeling system. Fig. 1 reports the performance of the system overtime when the proposed model is used vs. when just the CM is used or just theUM (built using previous tweets with the proposed model) is used for disam-

 biguation. We observed that EDIUM started to outperform the CM after  ≈  60tweets (of each user) are processed by the system. The maximum performanceis achieved when the proposed model is used with the Wikipedia Miner as theCM system.

Table 2. Average α scores

Method Avg. αEDIUM (WM) 0.49EDIUM (DS) 0.38

EDIUM is experimented to perform better withWikipedia Miner (WM) than with DBpedia Spot-

light (DS). This is because of the fact that the sys-tem is dependent on the underlying context mod-els for learning the user interests. Context modelsthat are more precise, leads to faster and more ac-curate user models, thus significantly helping thecontext model to disambiguate the entities.

Page 6: EDIUMv02.pdf

7/17/2019 EDIUMv02.pdf

http://slidepdf.com/reader/full/ediumv02pdf 6/6

(a) Performance with Wikipedia Miner (b) Performance with DBpedia Spotlight

Fig. 1. P@1 score of EDIUM under different configurations

Conversely, if the underlying context models have low entity linking and dis-ambiguation accuracies the user models usually takes much longer to learn theuser interests (with low  α  values) and use them for entity disambiguation. Itcan be seen that UM alone can also disambiguate the entities from the user’stweet and achieve significant performance.

4 Conclusion and Future Work

In this paper, we have modeled entity disambiguation based on the users pastinterest information. The paper proposed a way to model the users’ interestsusing the entity linking techniques and then using it later to improve the dis-ambiguation in entity linking systems. The gain in precision is proportional tothe accuracies of the underlying entity linking system.

More analysis is required on the user modeling aspect of the system. Cur-rently user’s past tweets is used for building the user model and the model’squality depends a lot on the underlying context model. We are including net-

work and demographic information of users to improve user modeling. In fu-ture, this would help us in better disambiguating the entities by understandingmore aspects of user behavior.

References

1. Murnane, E.L., Haslhofer, B., Lagoze, C.: Reslve: leveraging user interest to improveentity disambiguation on short text. In: Proceedings of the 22nd international confer-ence on World Wide Web companion. WWW ’13 Companion, Republic and Cantonof Geneva, Switzerland, International World Wide Web Conferences Steering Com-mittee (2013) 81–82

2. Shen, W., Wang, J., Luo, P., Wang, M.: Linking named entities in tweets with knowl-edge base via user interest modeling. In: Proceedings of the 19th ACM SIGKDDinternational conference on Knowledge discovery and data mining. KDD ’13, NewYork, NY, USA, ACM (2013) 68–76

3. Mendes, P.N., Jakob, M., Garcı́a-Silva, A., Bizer, C.: Dbpedia spotlight: shedding lighton the web of documents. In: Proceedings of the 7th International Conference onSemantic Systems. I-Semantics ’11, New York, NY, USA, ACM (2011) 1–8

4. Milne, D., Witten, I.H.: An open-source toolkit for mining wikipedia. Artif. Intell. 194(2013) 222–239