location prediction
TRANSCRIPT
Knowledge Enabled Location Prediction of Twitter
Users
Master’s Thesis
Revathy Krishnamurthy
Committee
Amit P. Sheth (Advisor)
Krishnaprasad Thirunarayan
Derek Doran
Collaborator
Pavan Kapanipathi
1
Background Knowledge can improve a machine’s ability to interpret text
BUCKEYE STATE
2
BACKGROUND KNOWLEDGE
3
Geographic footprint of a Twitter user
4
News RecommenderSystems
Beavercreek preschool to open in 2015
By Sharon D. Boykin
A $5.1 million preschool in Beavercreek citySchools district will help accommodate agrowing of student population and reduceovercrowding, according to school officials.
Ohio’s health exchange to include
more competition
By Randy Tucker
It was just a year ago that the insurance industry
fretted over potential loses from the new
insurance market created by Affordable Care Act.
Recommended for you
WHY IS LOCATION IMPORTANT?
• Targeted advertising
• Opinion Analysis
• Disaster Response
• Location Based Services
Other applications
5
Geo-tagged Tweets Profile Information
LOCATION PUBLISHED BY USER
6
Geo-tagged Tweets Profile Information
LOCATION PUBLISHED BY USER
• Less than 4% of tweets contain geo-spatial tags
• Location field in profile is either empty or contains invalid information such as “Justin Bieber’s heart”
7
Friends
INFERRING LOCATION OF A TWITTER USER
Followees
8
Just drove around Golden Gate Park two times trying to get in
Cleveland Browns confuse me. When I give up on them, they actually show up to play.
Followers
Network based
Content based
Friends
NETWORK BASED APPROACHES
FollowersFollowees
Depends on the friends andfollowers of a user whoselocation is known
9
CONTENT BASED APPROACHES
Just drove around Golden Gate Park two times trying to get in
Cleveland Browns confuse me. When I give up on them, they actually show up to play.
• Supervised Approaches• Probabilistic Models – (Cheng, Caverlee, and Lee, 2010)• Cascading Topic Models – (Eisenstein, Connor, Smith, and Xing, 2010)• Gaussian Mixture Model – (Chang, Lee, Eltaher, and Lee, 2012)• Language Models – (Doran, Gokhale, and Dagnino, 2014)• Ensemble of Statistical and Heuristic Classifiers – (Mahmud, Nichols,
and Drews, 2014)
10
Geographic location of a user influences the contents of their
tweets
Content-based approach
APPROACHES TO LOCATE A TWITTER USER
Reference: Cheng, Caverlee, and Lee, 2010 11
Content-based approach
APPROACHES TO LOCATE A TWITTER USER
12
Reference: Cheng, Caverlee, and Lee, 2010
PROBLEM STATEMENT
13
Predict the location of a Twitter user based on theirtweets, by exploiting Wikipedia to create a locationspecific knowledgebase
• Knowledge-enabled approach to predict the location of Twitterusers based on the contents of their tweets without using anytraining dataset of geo-tagged tweets
• Creation of location specific knowledgebase extracted fromWikipedia by introducing the concept of Local Entities
• Evaluation of the approach on a publicly available dataset with55% accuracy and 429 miles of Average Error Distance
CONTRIBUTIONS
14
KNOWLEDGE-BASE ENABLED APPROACH
San Francisco:Golden Gate Bridge, San Francisco 49ers, San Francisco Chronicle …
Entity Count
Golden Gate Bridge 4
San Francisco 49ers 2
San FranciscoChronicle
1
Top-k predictions:San FranciscoOaklandPalo Alto
15
KNOWLEDGE BASE GENERATOR
Internal Links Extraction
LocalEntity-1LocalEntity-2
---LocalEntity-n
city-1 city-2 city-k
Weighted Local Entities
Entity Recognition and Scoring
Annotated Tweets
USER PROFILE GENERATOR
LOCATION PREDICTION
Location PredictorRanked
cities for user
KNOWLEDGE-BASE ENABLED APPROACH
16
SAN FRANCISCO NEW YORK CITY
HOUSTON
LOCAL ENTITIES
17
• Collaborative encyclopedia
• As of 2014, English Wikipedia has 4.6 million articles, 18 billion pages viewsand 500 million unique visitors per month.
• Category Structure• Used for document clustering, tweet classification, personalization
systems etc.• At Kno.e.sis, used in applications such as
• Doozer (Thomas, Mehra, Brooks, and Sheth, 2008)• BLOOMS (Jain, Hitzler, Sheth, Verma, and Yeh, 2010)• Hierarchical Interest Graph (Kapanipathi, Jain, Venkataramani, and
Sheth, 2014)
• Link Structure• Used for word sense disambiguation, semantic relatedness between
terms etc.
WIKIPEDIA
18
LINK STRUCTURE OF WIKIPEDIA
19
LINK STRUCTURE OF WIKIPEDIA
20
“In general, links should be created to relevantconnections to the subject of another article that willhelp readers understand the article more fully. Thiscan include people, events, and topics that alreadyhave an article or that clearly deserve one, so longas the link is relevant to the article in question.”
Source: http://en.wikipedia.org/wiki/Help:Link#Wikilinks
LINK STRUCTURE OF WIKIPEDIA
21
• We consider the internal links of location pages as Local Entities of thecity
Local Entities of San Francisco
LOCAL ENTITIES
• While a city does not contain link to itself, we use the city as a localentity
22
LOCAL ENTITIES
San Francisco, California – 717 local entitiesFairborn, Ohio – 110 local entities
23
ARE ALL ENTITIES EQUALLY LOCAL?
24
ARE ALL ENTITIES EQUALLY LOCAL?
25
San Francisco Chronicle
San Francisco ExaminerSF Weekly
MSNBC CNN BBCAl Jazeera America
• Pointwise Mutual Information – standard measure ofassociation between two variables
• Assumption is that higher is the localness of an entity withrespect to the city, higher will be the statistical dependencebetween them
• Computed as:
𝑃𝑀𝐼 𝑐, 𝑒 = 𝑙𝑜𝑔2𝑃 𝑐,𝑒
𝑃 𝑐 .𝑃(𝑒)
Association-based Measure
LOCALNESS MEASURE OF ENTITIES
26
Graph-based Measure
LOCALNESS MEASURE OF ENTITIES
27
The Boston Red Sox, a founding member of the
American League of Major League Baseball in
1901..
Boston Red SoxThe Boston Red Sox are an American
professional baseball team based in
Boston, Massachusetts ...
They are members of American League (AL).
Boston
American League
LOCALNESS MEASURE OF ENTITIES
28
Directed Graph of Local Entities of Boston
• Betweenness Centrality (BC) – Measures the importance of anode relative to the rest of the nodes in the graph
• A high BC score of a vertex in a graph indicates that it lies onconsiderable fraction of shortest path connecting others
• Computed as:
𝐶𝐵 𝑐, 𝑒 = 𝑒𝑖≠𝑒≠𝑒𝑗
𝜎𝑒𝑖𝑒𝑗(𝑒)
𝜎𝑒𝑖𝑒𝑗
Graph-based Measure
LOCALNESS MEASURE OF ENTITIES
29
LOCALNESS MEASURE OF ENTITIES
30
Directed Graph of Local Entities of Boston
Boston Red Sox: 0.004540
American League: 0.000046
Alcatraz IslandTreasure Island
Alameda IslandFinancial District
Market StreetFisherman’s WharfSan Francisco 49ersCow Hollow
Silicon ValleySouth Beach
….
Suspension BridgeHyde Street Pier
Irving MorrowAngelo Rossi
Art DecoCharles Alton EllisBethlehem Steel
Half Way to Hell ClubInternational Orange
…
San Francisco BayGolden Gate
San Francisco ChronicleU.S. Route 101Marin County
SausalitoBay Area
…
Semantic Overlap Measure
LOCALNESS MEASURE OF ENTITIES
31
• Measures the relatedness between concepts with the intuitionthat related concepts are connected to similar entities
• Jaccard Index: Overlap between two sets
𝑗𝑎𝑐𝑐𝑎𝑟𝑑 𝑐, 𝑒 =|𝑂 𝑐 ∩𝑂 𝑒 |
|𝑂 𝑐 ∪𝑂 𝑒 |
Semantic Overlap Measure
LOCALNESS MEASURE OF ENTITIES
32
• Tversky Index: Asymmetric similarity measure between two sets
𝑡𝑖 𝑐, 𝑒 =|𝑂 𝑐 ∩𝑂 𝑒 |
𝑂 𝑐 ∩𝑂 𝑒 + α 𝑂 𝑐 −𝑂 𝑒 + β|𝑂 𝑒 −𝑂 𝑐 |
• We choose α = 0 and β = 1
• For every entity in the page of a local entity not found in thepage of the city, penalize the local entity
Semantic Overlap Measure
LOCALNESS MEASURE OF ENTITIES
33
KNOWLEDGE-BASE OF LOCAL ENTITIES
Local Entities of San Francisco (Localness measure: Tversky Index)34
KNOWLEDGE BASE GENERATOR
Internal Links Extraction
LocalEntity-1LocalEntity-2
---LocalEntity-n
city-1 city-2 city-k
Weighted Local Entities
Entity Recognition and Scoring
Annotated Tweets
USER PROFILE GENERATOR
LOCATION PREDICTION
Location PredictorRanked
cities for user
KNOWLEDGE-BASE ENABLED APPROACH
35
Step 1: Entity Linking
Just drove around Golden Gate Park trying to get in.
CREATION OF USER PROFILE
We use Zemanta for Entity Linking
36
Step 1: Entity Linking
Just drove around Golden Gate Park trying to get in.
CREATION OF USER PROFILE
Entity Count
Golden Gate Bridge 4
San Francisco 49ers 2
San Francisco Chronicle 1
User Profile for user 𝑢 defined as:𝑃𝑢 = 𝑒, 𝑠 𝑒 ∈ 𝑊, 𝑠 ∈ 𝑅}
Step 2: Entity Scoring
We use Zemanta for Entity Linking
37
KNOWLEDGE BASE GENERATOR
Internal Links Extraction
LocalEntity-1LocalEntity-2
---LocalEntity-n
city-1 city-2 city-k
Weighted Local Entities
Entity Recognition and Scoring
Annotated Tweets
USER PROFILE GENERATOR
LOCATION PREDICTION
Location PredictorRanked
cities for user
KNOWLEDGE-BASE ENABLED APPROACH
38
LOCATION PREDICTION
• Compute an aggregate score for each city whose local entities are found in a user’s tweets
𝑙𝑜𝑐𝑆𝑐𝑜𝑟𝑒 𝑐, 𝑢 =
𝑗=1
𝐼𝑐𝑢
𝑙𝑜𝑐𝑙 𝑐, 𝑒𝑗 × 𝑠𝑒𝑗
where 𝐼𝑐𝑢 are local entities of city 𝑐 found in tweets of user 𝑢 , 𝑒𝑗 ∈ 𝐼𝑐𝑢 and 𝑙𝑜𝑐𝑙(𝑐, 𝑒𝑗) is the localness score of entity 𝑒𝑗 with respect to city 𝑐
• Rank 𝑙𝑜𝑐𝑆𝑐𝑜𝑟𝑒 𝑐, 𝑢 in descending order to predict the top-k locations of a user
39
San Francisco International Airport (6),San Francisco (4), Nob Hill (3), SanFrancisco Museum of Modern Art (1),Beach Blanket Babylon (2), San FranciscoMunicipal Railway (4), Golden Gate Park(1), San Francisco Bay Area (1), SF Weekly(1), Fox Oakland Theatre (2), Berkley (1),Green Day (1), Oakland (9), San FranciscoBay Area (1), The White Stripes (1),Detroit Metropolitan Wayne CountyAirport (1), Detroit Historical Museum(1), Detroit Red Wings (4), GeneralMotors (1), Palo Alto (6), SAP AG (8),Facebook (3), PARC (company) (2), Dell(1), Google (1), …
LOCATION PREDICTION
User Profile Knowledgebase
Nob Hill 0.48214
SF Weekly 0.1875
Golden Gate Park 0.16783
San Francisco International
Airport 0.06818
…
Fox Oakland Theatre 0.09375
SF Bay Area 0.12972
Green Day 0.02066
…
Detroit Historical
Museum 0.4838
General Motors 0.05538
Detroit Red Wings 0.0232
…
PARC (company) 0.03726
Google 0.04678
Facebook 0.05810
San Francisco
Oakland, CA
Detroit, MI
Palo Alto, CA
40
LOCATION PREDICTION
San Francisco International Airport (6), SanFrancisco (4), Nob Hill (3), San FranciscoMuseum of Modern Art (1), Beach BlanketBabylon (2), San Francisco Municipal Railway(4), Golden Gate Park (1), San Francisco BayArea (1), SF Weekly (1)
14.5531
Fox Oakland Theatre (2), Berkley (1), Green Day(1), Oakland (9), San Francisco Bay Area (1)
10.7584
The White Stripes (1), Detroit MetropolitanWayne County Airport (1), Detroit HistoricalMuseum (1), Detroit Red Wings (4), GeneralMotors (1)
8.0600
Palo Alto (6), SAP AG (8), Facebook (3), PARC(company) (2), Dell (1), Google (1)
6.9175
User Profile Knowledgebase Location Prediction
Nob Hill 0.48214
SF Weekly 0.1875
Golden Gate Park 0.16783
San Francisco International
Airport 0.06818
…
Fox Oakland Theatre 0.09375
SF Bay Area 0.12972
Green Day 0.02066
…
Detroit Historical
Museum 0.4838
General Motors 0.05538
Detroit Red Wings 0.0232
…
PARC (company) 0.03726
Google 0.04678
Facebook 0.05810
San Francisco
Oakland, CA
Detroit, MI
Palo Alto, CA
41
• All cities of United States with population > 5000 as published in censusestimates of 2012
• 4,661 cities and 500714 local entities
Knowledge base
IMPLEMENTATION
Baseline
• Considers all local entities to be equally local to the city• Location prediction based only on frequency of entities
42
• Published by Cheng, Caverlee, and Lee, 2010.
• Contains 5119 active users from continental United States withapproximately 1000 tweets per user.
• User’s location listed in the form of latitude and longitude.
Test Dataset
EVALUATION
43
• Error Distance
𝐸𝑟𝑟𝑜𝑟𝐷𝑖𝑠𝑡 𝑢 = 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑙𝑜𝑐𝑎𝑐𝑡 𝑢 , 𝑙𝑜𝑐𝑒𝑠𝑡 𝑢
Distance between actual location of the user and the estimated location
• Average Error Distance
𝐴𝐸𝐷 𝑈 = 𝑢∈𝑈 𝐸𝑟𝑟𝑜𝑟𝐷𝑖𝑠𝑡(𝑢)
|𝑈|
Average of error distance of all users in the test dataset
• Accuracy
𝐴𝐶𝐶 𝑈 =|{𝑢|𝑢∈𝑈 ˄ 𝐸𝑟𝑟𝑜𝑟𝐷𝑖𝑠𝑡 𝑢 ≤100}|
|𝑈|
Percentage of users predicted within 100 miles of their actual location
Evaluation Metrics
EVALUATION
44
Location Prediction Results
EVALUATION
Localness Measure
ACC (%) AED (in Miles)
ACC@2 ACC@3 ACC@5
Baseline 25.21 632.56 38.01 42.78 47.95
PMI 38.48 599.40 49.85 56.06 64.15
BC 47.91 478.14 57.39 62.18 66.98
Jaccard Index 53.21 433.62 67.41 73.56 78.84
Tversky Index 54.48 429.00 68.72 74.68 79.99
45
EVALUATION
Localness Measure
ACC (%) AED (in Miles) ACC@2 ACC@3 ACC@5
Baseline 25.21 632.56 38.01 42.78 47.95
PMI 38.48 599.40 49.85 56.06 64.15
BC 47.91 478.14 57.39 62.18 66.98
Jaccard Index 53.21 433.62 67.41 73.56 78.84
Tversky Index 54.48 429.00 68.72 74.68 79.99
• PMI is not normalized hence sensitive to the count of the occurrences of localentities in the Wikipedia corpus• E.g. PMI of local entities of Glenn Rock, New Jersey is higher than those of
San Francisco
46
EVALUATION
Localness Measure
ACC (%) AED (in Miles) ACC@2 ACC@3 ACC@5
Baseline 25.21 632.56 38.01 42.78 47.95
PMI 38.48 599.40 49.85 56.06 64.15
BC 47.91 478.14 57.39 62.18 66.98
Jaccard Index 53.21 433.62 67.41 73.56 78.84
Tversky Index 54.48 429.00 68.72 74.68 79.99
• Does a good job of assigning low scores to common entities.• E.g. community college, National Weather Service, start up company
etc.
• Fails for entities with some relevance to the city but no distinguishing factor• E.g. IBM with respect to Endicott, New York
47
LOCALNESS MEASURE OF ENTITIES
48
EVALUATION
Localness Measure
ACC (%) AED (in Miles) ACC@2 ACC@3 ACC@5
Baseline 25.21 632.56 38.01 42.78 47.95
PMI 38.48 599.40 49.85 56.06 64.15
BC 47.91 478.14 57.39 62.18 66.98
Jaccard Index
53.21 433.62 67.41 73.56 78.84
Tversky Index 54.48 429.00 68.72 74.68 79.99
• Underperforms for local entities with fewer entities than the city• E.g. Eureka Valley and California with respect to San Francisco.
49
EVALUATION
California
San Francisco
Eureka
Valley
50
0.03005
Overlap
Overlap
0.07092
EVALUATION
Localness Measure
ACC (%) AED (in Miles) ACC@2 ACC@3 ACC@5
Baseline 25.21 632.56 38.01 42.78 47.95
PMI 38.48 599.40 49.85 56.06 64.15
BC 47.91 478.14 57.39 62.18 66.98
Jaccard Index 53.21 433.62 67.41 73.56 78.84
TverskyIndex
54.48 429.00 68.72 74.68 79.99
• Best performing localness measure• Overcomes the disadvantage of Jaccard Index.
• For example: We are able to assign higher localness to Eureka Valley(0.7096) than California (0.1270) with respect to San Francisco
51
Top-k Accuracy
EVALUATION
52
Top-k Average Error Distance
EVALUATION
53
Distribution of all users in the dataset
Distribution of accurately predicted users
Distribution of users
54
Comparison with Existing Approaches
EVALUATION
Method ACC (%) AED (in miles)
Cheng, Caverlee, and Lee, 2010 51.00 535.56
Chang, Lee, Eltaher, and Lee, 2012 49.9 509.3
Wikipedia based Approach 54.48 429.00
55
Impact of Local Entities
EVALUATION
56
Top 100 Cities
EVALUATION
• 2172 users from the dataset are from the top-100 mostpopulated cities of United States
• 60% users predicted within 100 miles of their actual location
• 54% users predicted exactly at the city level
57
CONCLUSION
• Presented a crowd sourced knowledge based approach, that does notrequire geo-tagged tweets as a training dataset, to predict the locationof a user
• Introduced the concept of Local Entities and preprocessed WikipediaHyperlink Graph to extract local entities for each city
• Investigated relatedness measures to establish the degree ofassociation between a local entity and a city
• Evaluated the proposed approach against a benchmark datasetpublished by Cheng et al. For 5119 users, we are able to predict thelocation of 55% of users within 100 miles with an average errordistance of 429 miles
58
FUTURE WORK
• Compute the confidence score of the prediction based on top-k citiesand count of local entities in tweets
• Investigate other localness measures for score local entities
• Consider semantic types, categories of local entities and weight thecontribution based on types
• Explore other knowledge bases such as Wikitravel and GeoNames
59
ACKNOWLEDGEMENTS
THANK YOU!
Amit P. Sheth Krishnaprasad Thirunarayan
Derek Doran
60