crowd crawling: towards collaborative data collection for large-scale online social networks

Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks

Cong Ding, Yang Chen*, and Xiaoming Fu

University of Göttingen*Duke University

Significance of social network data crawling

•Understanding user behaviors

•Improving SNS architectures

•Handling privacy/security issues

•and so on...

Current data collection methods (1)

•ISP-based measurement [Schneider IMC’09]

Only ISP companiescan do that

•Cooperate with SNS companies [Yang IMC’11]

Most research groupsdo not have chance

•Crawl data by a single group (and share them to others)

[Gjoka INFOCOM’10]

Suffering requestrate limiting

Shortages of crawling by a single group

•Waste computing andnetwork resources

•Introduce overhead toservice providers (andmay lead stricter rate limiting)

•Lack of ground truth forthe research community

A new thought

Why not collect data collaboratively?

System overview

Coordinator

Crawlers

System design

•Fetching UIDs (BFS, etc.)

•Handling crawling failure (timeout)

•Bypassing request rate limiting (massive IP addresses)

•Data fidelity (redundant crawling)

Implementation

•A proof-of-concept prototype (without the data fidelity part)to crawl in Weibo

•472 PlanetLab servers as crawlers

Evaluation

•In 24 hours, we have crawled 2.22M users’ data from Weibo,including user profiles, all the posts, all the social connections

•Comparison:

•Fu et al. (PLOS ONE 2013) get 30K user’s data in 6 days•Guo et al. (PAM 2013) get 1M user’s data in 1 monthCrowd

CrawlingFu et al. Guo et al.

#UIDs/day 2.22M 5K 33K

Evaluation

Conclusion and Discussion

•Data sharing may violate some providers’ terms of servicesoTwitter does not allow to share data (even for

research)oWeibo allows to share data among researchers

•Unlimited data sharing might cause ethical issuesoThe data should be anonymized

•We will publish the data crawled in the evaluation

crowd crawling: towards collaborative data collection for large-scale online social networks

users data

crawl data

collaborative data collection

ethical issuesthe data

stricter rate

social connectionscomparison

network resources

yang chen

Documents

crawling and ranking

crawling and web indexes. today’s lecture crawling...

crowd in the cloud: collaborative frameworks for virtual dh...

advanced crawling techniques chapter 6. outline selective...

crawling and walking

cs276 lecture 14 crawling and web indexes. today’s lecture...

pedagogy of the crowd: collaborative critique in cyberspace

crawling and flying insects in...

cs276 lecture 17 crawling and web indexes. today’s lecture...

crawling the web for a search engine or why crawling is cool

advanced web crawling

three's a crowd-source: observations on collaborative genome...

5 benefits of web crawling services over in-house crawling

geographically focused collaborative crawling

web crawling & crawler

is crawling legal?

three’s a...

paying crowd workers for collaborative work - greg d'eon

preview of crowd companies playbook for collaborative...

crowd crawling: towards collaborative data collection for...