on benchmarking online social media analytical queries
DESCRIPTION
Slides for GRADES 2013 (Workshop affiliated with SIGMOD 2013) (GRADES: Graph Data-management Experiences & Systems) http://event.cwi.nl/grades2013/TRANSCRIPT
On Benchmarking Online Social Media Analytical
QueriesWeining Qian
with Haixin Ma, Fan Xia, Jinxian Wei, Chengcheng Yu, and Aoying Zhou
http://database.ecnu.edu.cn/
6/23/2013 GRADES 2013 @ NY, USA 2
Outline
• Motivation• BSMA: Benchmark for Social Media
Analytical query processing– Data set– Queries– Measurements
• Preliminary results• Discussion/on-going work
6/23/2013 GRADES 2013 @ NY, USA 3
Motivation
• Social media has become a major source to sense the world– Emergent event monitoring, political election/stock
market predicting, product survey, etc.
• Social media = social network + media– Social network: large-scale static/dynamic networks– Media: content with timestamps
• Both collective behavior analysis and personalized data analysis has many applications– Variant kind of queries
6/23/2013 GRADES 2013 @ NY, USA 4
Motivation
• Many "big data" management/mining systems exist (and maybe more are coming)– Parallel RDBMS, NOSQL/NewSQL systems
(Hadoop-related ones, Cassandra, etc.)
• Which system/tech. is most suitable to a given problem?– A benchmark is needed
6/23/2013 GRADES 2013 @ NY, USA 5
Social media data
6/23/2013 GRADES 2013 @ NY, USA 6
Schema
6/23/2013 GRADES 2013 @ NY, USA 7
BSMA
Queries (to be extended/revised)
Data set(crawled from Sina Weibo)
Data generator(under development)
BSMA performance testing tool (based on YCSB)
6/23/2013 GRADES 2013 @ NY, USA 8
Data acquisition
• Crawled from Sina Weibo ("Chinese Twitter")
Haixin Ma, Weining Qian, Fan Xia, Xiaofeng He, Jun Xu, Aoying Zhou: Towards modeling popularity of microblogs. Frontiers of Computer Science 7(2): 171-
184 (2013)
6/23/2013 GRADES 2013 @ NY, USA 9
Data set
• Followship network– Seed users: 11 lawyers and opinion leaders and 21
researchers– 2nd level users from seeds: 120,000+ users– 3rd level users from seeds: 1.7+ million users– 4th level users from seeds: 18+ million users (incomplete)
• More than 1 billion following relationships– Tweets from 1.7+ million users– From Aug. 2009 to Jun. 2012– 480+ million tweets (about 51.11% of them are retweeted
tweets, and others are original tweets)
6/23/2013 GRADES 2013 @ NY, USA 10
Queries
• Queries on social networks– E.g. list common followees of uses A and B
• Queries on hotspots– Hotspots may be: users, tweets, topics, etc.– E.g. list the tweets with highest #retweet
• Queries on timelines– E.g. list 10 most recent tweets posted by
A's followees
6/23/2013 GRADES 2013 @ NY, USA 11
Query example (Q12)
⨝
⨝
⨝
Rank the tweets appearing in A's followees’ timelines according to the number of retweet.
6/23/2013 GRADES 2013 @ NY, USA 12
BSMA performance testing tool based on YCSB
• YCSB: Yahoo Cloud Service Benchmark– http://wiki.github.com/brianfrankcooper/
YCSB/
• BSMA modifications– Query argument and parameter generation
• User IDs, top-k, timespan, etc.
– Query wrappers– https://github.com/xiafan68/BSMA
6/23/2013 GRADES 2013 @ NY, USA 13
Measurements
• Throughput– The highest throughput of the system under
different settings of number of threads
• Latency– The (average) latency of the system under
the setting with the 2nd highest throughput
• Scalability– The slope of the throughput/latency plot
6/23/2013 GRADES 2013 @ NY, USA 14
WISE 2012 Challenge Performance Track
• A preliminary version of BSMA is used in WISE 2012 Challenge Performance Track
• 4 teams– A special purpose (in-memory) system– A Hbase-based system with secondary index– A SQLLite-based system with many
optimizations– A special purpose system with B+-tree
optimizations for different kind of queries
6/23/2013 GRADES 2013 @ NY, USA 15
Results Find the set of people who share the same followee with the specified user.
6/23/2013 GRADES 2013 @ NY, USA 16
Difficulties
• Joins of very large tables
• Skewness of the data distribution– Power-law
distribution
• Preserving the orders in results
6/23/2013 GRADES 2013 @ NY, USA 17
Future work
• Data generator– More than a social
network generator– Simulate user
activities• Followship network• Tweeting and
retweeting actions• Timeline• Topics
6/23/2013 GRADES 2013 @ NY, USA 18
Future work
• Queries related to content of tweets– Queries with keyword search– Real-life data set needed
• More queries
• Performance testing of more systems– RDBMS, graph database, etc.
6/23/2013 GRADES 2013 @ NY, USA 19
More on BSMA
• Original WISE 2012 Challenge page– http://www.wise2012.cs.ucy.ac.cy/
challenge.html• WISE 2012 Challenge follow-up
information– https://wnqian.wordpress.com/research/
wise2012challenge/• BSMA performance testing tool
– https://github.com/xiafan68/BSMA• Suggestions or comments are welcome!
– Mailto: [email protected]
Thanks!
http://database.ecnu.edu.cn/