tweetskb: a public and large-scale rdf corpus of annotated...
TRANSCRIPT
TweetsKB: A Public and Large-Scale RDFCorpus of Annotated Tweets
Pavlos Fafalios1 Vasileios Iosifidis1 Eirini Ntoutsi1
Stefan Dietze1
1L3S Research CenterLeibniz Universitat Hannover
Hannover, Germany
June 2018
Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 1/25
Micro-blogging for everyday life
I Micro-blogging services are extremely popular these days.
I Combination of blogging and instant messaging.
I Huge amount of data.
Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 2/25
TwitterI The most well known micro-blogging platform.I Twitter’s generated data has been used in a wide area of
research:I data science, sociology, psychology and so on and so forth.
Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 3/25
Figure: Number of tweets per month
Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 4/25
Entity Relatedness ?
Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 5/25
Opinions Overtime ?
Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 6/25
Journalists manually analysed 14,000 tweets1 !
1https://www.nytimes.com/interactive/2016/12/06/upshot/how-to-know-what-donald-trump-really-cares-about-look-at-who-hes-insulting.html
Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 7/25
Lack of large scale public social archives
Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 8/25
TweetsKB Overview
I Public RDF corpus of anonymized tweets.
I More than 1.5 billion English tweets.
I Spanning period: Jan-2013 till Nov-2017.
I Includes entities and sentiment annotations.I Ideal for:
I time-aware and entity-centric exploration.I data integration by existing knowledge base (like DBpedia).I multi-aspect and entity-centric analysis.
Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 9/25
Generating TweetsKB: FilteringI 6 billion tweets (Jan-13 - Nov-17).I Retweets, non-English tweets and spam2 tweets were removed.I We employ some of tweets’ metadata.
45,647
59,194
5,514,931
30,530,907
29,959,886
34,980,039
39,936,540
29,352,003
36,067,474
21,583,787
33,675,162
26,677,285
35,601,761
32,012,415
31,378,550
33,232,571
33,247,115
30,923,457
34,516,026
34,131,838
30,392,248
30,860,493
30,812,913
30,570,787
30,734,059
27,671,321
30,245,433
28,641,364
28,601,539
28,133,412
29,800,237
30,555,993
27,709,765
28,003,261
28,585,095
28,331,869
27,508,112
25,411,034
26,855,489
26,286,412
26,565,430
26,573,886
27,058,115
24,079,006
24,200,624
24,350,615
16,577,648
23,007,020
24,741,745
22,436,542
23,285,384
21,163,478
22,357,313
20,978,546
21,910,272
21,974,665
20,246,552
19,978,891
19,473,362
2013-01
2013-02
2013-03
2013-04
2013-05
2013-06
2013-07
2013-08
2013-09
2013-10
2013-11
2013-12
2014-01
2014-02
2014-03
2014-04
2014-05
2014-06
2014-07
2014-08
2014-09
2014-10
2014-11
2014-12
2015-01
2015-02
2015-03
2015-04
2015-05
2015-06
2015-07
2015-08
2015-09
2015-10
2015-11
2015-12
2016-01
2016-02
2016-03
2016-04
2016-05
2016-06
2016-07
2016-08
2016-09
2016-10
2016-11
2016-12
2017-01
2017-02
2017-03
2017-04
2017-05
2017-06
2017-07
2017-08
2017-09
2017-10
2017-11
Figure: TweetsKB: Number of tweets per month
2for spam detection HSpam14 dataset was used [Sedhai and Sun, 2015]Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 10/25
Entity Linking
I Yahoo’s FEL tool [Blanco et al., 2015].
I FEL has been specifically designed for short texts.
I 1.4 million distinct entities.I Evaluated on NEEL challenge dataset [Rizzo et al., 2016]:
I Dataset consists of 9,289 English tweets from 2011, 2013,2014 and 2015.
I Precisions = 85% , Recall = 39% and F1 = 54%
Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 11/25
Sentiment Annotation
I For sentiment analysis we have used SentiStrength [Thelwallet al., 2012].
I SentiStrength assigns both positive and negative scores toshort texts [5, -5].
I Evaluated on TSentiment15 dataset [Iosifidis and Ntoutsi,2017]:
I Dataset consists of 2.5 million English tweets from 2015.I Accuracy = 91% , F1 = 80%
Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 12/25
TweetKB’s overall statistics
I TweetsKB is available as N3 files through Zenodo repo (DOI:10.5281/zenodo.573852)3
I It’s also registered at datahub.ckan.io4
Number of tweets 1,560,096,518
Number of distinct users 125,104,569
Number of distinct hashtags 40,815,854
Number of distinct user mentions 81,238,852
Number of distinct entities 1,428,236
Number of tweets with sentiment 772,044,599
Number of RDF triples 48,207,277,042
3https://zenodo.org/record/5738524https://datahub.ckan.io/dataset/tweetskb
Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 13/25
Computational Cost
I Cluster: CPUs 504, RAM 6,784 GB, Disk 2.1 PB
I Sentiment annotation: 6M tweets per minute
I Entity Extraction: 4.8M tweets per minute
I Triplification: 14M tweets per minute
Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 14/25
RDF Schema Our schema follows established vocabularies such asSIOC, schema.org, and Onyx
schema:ShareAction
rdfs:Literal
rdfs:Literal nee:Entity
rdfs:Literalnee:confidence
nee:detectedAs
nee:hasMatchedURI
rdfs:Literal
sioc: http://rdfs.org/sioc/ns#sioc_t: http://rdfs.org/sioc/types#dc: http://purl.org/dc/terms/rdfs: http://www.w3.org/2000/01/rdf-schema#nee: http://www.ics.forth.gr/isl/oae/core#schema: http://schema.org/onyx: http://www.gsi.dit.upm.es/ontologies/onyx/nswna: http://www.gsi.dit.upm.es/ontologies/wnaffect/ns#
dc:created
sioc:Post
rdfs:Resource
sioc:id
rdfs:Literal
schema:mentions
sioc:UserAccountschema:mentions sioc:name
sioc_t:Tag rdfs:Literalrdfs:label
sioc:UserAccountsioc:has_creator
rdfs:Literal
sioc:id
rdfs:Literal
schema:mentionsonyx:hasEmotionSet
onyx:EmotionSet
onyx:Emotion
wna:positive-emotionwna:negative-emotion
onyx:hasEmotion
onyx:hasEmotionCategory
onyx:hasEmotionIntensity
rdfs:Literal
schema:InteractionCounter
schema:interactionStatistic
schema:LikeAction
schema:interactionType
schema:userInteractionCount
Figure: Employed RDF/S model
Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 15/25
Tweets in our RDF/S model are associated with six types ofelements:
I metadata: sioc : Post → tweet, sioc : UserAccount → user
I entity mentions: for each entity, we store surface form, URI,confidence score (NEE).
I user mentions: sioc : UserAccount → user
I hashtag mentions: sioc t : Tag
I sentiment scores: onyx : EmotionSet
I interaction statistics:schema : LikeAction → favorite count,schema : ShareAction → retweet count,
Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 16/25
Example
nee:Entitynee:confidence
nee:detectedAs
nee:hasMatchedURI
dc:created
sioc:Postsioc:id
schema:mentions
sioc:UserAccount
sioc:name
sioc_t:Tag
rdfs:label
sioc:has_creatorsioc:id
rdf:type
_:ent1
dbr:Roger_Federer
-1.54
“Federer”
rdf:type
_:usr8
rdf:type
“livetennis”
_:tag1 “usopen”
schema:mentions
schema:mentions
06.01.2012 06:40
“9565121266”
_:usr1“2356912”
12
0.75 0.0
_:stat1
schema:interactionStatistic
schema:LikeAction
schema:interactionType
schema:userInteractionCount
_:ems1
onyx:hasEmotionSet
_:em1 _:em2
onyx:hasEmotion
wna:positive-emotion wna:negative-emotion
_:tweet1
rdf:type
onyx:hasEmotionIntensity
onyx:hasEmotionCategory
Figure: Tweet example
Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 17/25
Analysis of the top 100K entities and hashtags
Figure: Distribution of top-100,000 entities.
Figure: Distribution of top-100,000 hashtags.Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 18/25
Analysis of the top 100K entities
DBpedia typeNumber of
distinct entities
http://dbpedia.org/ontology/Person 21,139 (21.1%)http://dbpedia.org/ontology/Organisation 14,815 (14.8%)http://dbpedia.org/ontology/Location 8,215 (8,2%)http://dbpedia.org/ontology/Athlete 5,192 (5.2%)http://dbpedia.org/ontology/Artist 3,737 (3.7%)http://dbpedia.org/ontology/City 2,563 (2.6%)http://dbpedia.org/ontology/Event 510 (0.5%)http://dbpedia.org/ontology/Politician 208 (0.2%)
Table: Overview of popular entity types of the top-100,000 entities.
Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 19/25
Data Exploration and IntegrationSPARQL queries can be used to integrate information fromexternal knowledge bases such as DBpedia.
SELECT DISTINCT ?tweetID ?sentNegScore ?retweetCount ?politician ?birthPlace WHERE {
SERVICE <http://dbpedia.org/sparql> {
?politician dc:subject dbc:German_politicians ; dbo:birthPlace ?birthPlace }
?tweet a sioc:Post ; dc:created ?date ; sioc:id ?tweetID FILTER(year(?date) = 2016) .
?tweet schema:mentions ?entity . ?entity a nee:Entity ; nee:hasMatchedURI ?politician .
?tweet schema:interactionStatistic ?stat . ?stat schema:interactionType schema:ShareAction .
?stat schema:userInteractionCount ?retweetCount FILTER(?retweetCount > 100) .
?tweet onyx:hasEmotionSet ?emotSet . ?emotSet onyx:hasEmotion ?emot .
?emot onyx:hasEmotionCategory wna:negative-emotion ;
onyx:hasEmotionIntensity ?sentNegScore FILTER (?sentNegScore >= 0.75) }
Figure: SPARQL query for retrieving popular tweets in 2016 mentioning German politicians with strong negativesentiment.
Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 20/25
Temporal Entity AnalyticsGiven an entity and a time period one can make temporal queriessuch as:SELECT ?month xsd:double(?cEnt)/xsd:double(?cAll)
WHERE {
{ SELECT (month(?date) AS ?month) (count(?tweet) AS ?cAll) WHERE {
?tweet a sioc:Post ; dc:created ?date FILTER(year(?date) = 2015)
} GROUP BY month(?date) }
{ SELECT (month(?date) AS ?month) (count(?tweet) AS ?cEnt) WHERE {
?tweet a sioc:Post ; dc:created ?date FILTER(year(?date) = 2015) .
?tweet schema:mentions ?entity .
?entity a nee:Entity ; nee:hasMatchedURI dbr:Alexis_Tsipras
} GROUP BY month(?date) }
} ORDER BY ?month
Figure: SPARQL query for retrieving the monthly popularity of Alexis Tsipras (Greek prime minister) in 2015
Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 21/25
Entity RecommendationsEntity co-occurancies indicate entity relatedness.
July’15 Oct’15Apr’15
Greece
Vladimir Putin
Russia
Yanis Varoufakis
Sanctions (law)
Referendum
Reuters
Greek withdrawal from the eurozone
Syriza
Athens
European migrant crisis
Lesbos
Debt relief
The New York Times
Austerity
Greek War of Independ.
Pragmatism
Greek gov. debt crisis
Intern. Monetary Fund
Figure: Co-occuring entities of Tsipras (Greek prime minister) duringApr’15 - Dec’15
Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 22/25
Sustainability and Maintenance
I Archive is deposed on Zenodo and registered atdatahub.ckan.io
I Integrated into AFEL5 and ALEXANDRIA6 projects.
I Endpoint7 which currently facilitates 5% of the whole corpus.
I Periodically update TweetsKB (every 6 months) with recentinformation.
5http://afel-project.eu/6http://www.alexandria-project.eu7http://l3s.de/tweetsKB/endpoint/
Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 23/25
Conclusions
I Largest annotated RDF corpus of tweets.
I Advanced entity-centric exploration.
I Data integration with other knowledge bases (at queryexecution time).
I Analytics for different disciplines and research problems.
Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 24/25
Thanks
Questions?Contact: {fafalios, iosifidis}@L3S.de
Vasileios Iosifidis | TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets | 25/25