social media? properties of social media : scale twitter (chirp 2010) – more than 100m user...
TRANSCRIPT
![Page 1: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/1.jpg)
Information Retrieval Methods for Social Media
600.466
![Page 2: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/2.jpg)
Social Media?
![Page 3: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/3.jpg)
![Page 4: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/4.jpg)
Properties of Social Media : Scale
• Twitter (Chirp 2010)– More than 100M user accounts– more than 600M search queries a day– 55M tweets a day
• Facebook– More than 400M active users– More than 25 billion pieces of content (web links,
news stories, blog posts, notes, photo albums, etc.) shared each month.
![Page 5: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/5.jpg)
Properties of Social Media: Immediacy
• Need to share breaking news• Search : Content vs. Peer recommendation
![Page 6: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/6.jpg)
Properties of Social Media: Duplication
• Duplication of content– Blogs: Copy-Paste– Twitter: “Re-tweet”– Groups: Cross-posting– Email: Signature lines, Inline Replies
![Page 7: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/7.jpg)
Properties of Social Media: Semi-structuredness
• Informal but structured– Informal != low quality (eg. Wikipedia)
• Structure– Metadata– Connectivity
![Page 8: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/8.jpg)
Suggested Reading
“Towards a PeopleWeb”, Raghu Ramakrishnan & Andrew Tomkins, IEEE Computer, Aug 2007
“Important properties of users and objects will move from being tied to individual Web sites to being globally available. The conjunction of a global object model with portable user context will lead to a richer content structure and introduce significant shifts in online communities and information discovery.”
![Page 9: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/9.jpg)
Properties of Social Media
• Scale• Immediacy• Heterogeneity• Duplication• Semi-structuredness
![Page 10: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/10.jpg)
Properties of Social Media Graphs
• Small-world property– Six-degrees of separation – Facebook : 5.73 (Bunyan, 2009)– MS Messenger: ~7 (Leskovec & Horvitz, 2007)
• Mathematically– Low Average Path length– High Clustering coefficient
![Page 11: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/11.jpg)
Network Evolution and Path Size
![Page 12: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/12.jpg)
Properties of Social Media Graphs
• Power law degree distribution (asymptotically)
• Property of most real world networks• Existence of “hubs”• Scale free networks
€
P(d)∝ d −γ 2 <γ < 3
![Page 13: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/13.jpg)
Probabilistic Modeling of Networks
• Erdos-Renyi Model– Choose a pair of nodes uniformly at random and
add an edge. – G(n, p)– Not Scale Free (small avg. Path but low clustering
coefficient)– Scale Free networks don’t evolve by chance
![Page 14: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/14.jpg)
Probabilistic Modeling of Networks
• Preferential Attachment (Barabasi and Albert, 99)– Rich become richer– Stochastic process: Using Polya’s urn
![Page 15: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/15.jpg)
Why model?
• Study network evolution, degeneration– Develop algorithms
• Detect communities• Who are the movers and shakers?• Detect diffusion of ideas across networks• Detect anomalies
![Page 16: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/16.jpg)
Crawling Social Networks
• HTML (Slashdot)• RSS/Atom feeds (blogs)• API driven (Twitter, Facebook, …)– Data liberation
![Page 17: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/17.jpg)
Twitter twitter = new TwitterFactory().getInstance(twitterID,twitterPassword);
List<Status> statuses = twitter.getFriendsTimeline(); System.out.println("Showing friends timeline."); for (Status status : statuses) { System.out.println(status.getUser().getName() + ":" + status.getText()); }
http://twitter4j.org
![Page 18: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/18.jpg)
Storage and Indexing
• Graph stores can be more efficiently designed traditional RDBMS or flat files (document IR)
• A family of “triple stores” or graph databases (#NoSQL movement)– Neo4J– CouchDB– Hypertable– …
![Page 19: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/19.jpg)
Data is becoming more and more connected
(Eifrem, OSCON 2009)
![Page 20: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/20.jpg)
Social Media Graphs
(Eifrem, OSCON 2009)
![Page 21: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/21.jpg)
Social Media Graphs : Representation
• Nodes• Relationship between nodes • Properties on Both• Storing in Flat Files vs. Graph Databases• Neo4J, disk based solution – works well for sizes up to a few billion (Single JVM)
![Page 22: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/22.jpg)
Processing Large Scale Graph Data
• Better representation• Parallel computation– MapReduce– BSP
![Page 23: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/23.jpg)
Parallelism via Map-Reduce
• A paradigm to view input as (key, value) pairs and algorithms process these pairs in one of two stages– Map: Perform operations on individual pairs– Reduce: Combine all pairs with the same key– Functional programming origins– Abstracts away system specific issues– Manipulate large quantities of data
![Page 24: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/24.jpg)
Parallelism via Map-Reduce
• Input is a sequence of key value pairs (records)• Processing of any record is independent of the
others• Need to recast algorithms and sometimes data
to fit to this model– Think of structured data (Graphs!)
![Page 25: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/25.jpg)
![Page 26: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/26.jpg)
Input: Collection of documents Output: For each word find all documents with the
word
def mapper(filename, content):foreach word in
content.split():output(word, filename)
def reducer(key, values):output(key, unique(values))
Example: Building inverted indexes
![Page 27: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/27.jpg)
Map-Reducing graph data
• Note: By design the mappers cannot communicate with each other.
• The graph representation should be such that that all information (e.g. neighborhood) needed for processing a node should be locally available.
• The adjacency list representation is perfectly suited.
• Key: vertex in the graph• Value: neighbors of the vertex and their associated values
![Page 28: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/28.jpg)
Computing PageRank (MapReduce)
• PageRank update with dampening parameter α
where P is the transition probability matrix.• One map-reduce per iteration
![Page 29: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/29.jpg)
MapReduce: PageRank Iteration
Map(Key k, Value v){ r_old = k.rank; r = 0; foreach node n in v.getNeighbors() { r += p(n, k)*r_old + dampening_factor } v.rank = r; Emit(k, v);}
![Page 30: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/30.jpg)
Processing Large Scale Graph Data
• MapReduce is not the best model for large scale graph processing– Simple graph concepts (Pagerank, BFS, …) are not
easy to program– MapReduce does not preserve data locality in
consecutive operations
![Page 31: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/31.jpg)
A New Paradigm to Process Large Scale Graph Data
• Bulk Synchronous Parallel• Developed in the 80s by Leslie Valiant• Introduced by Google for Graph computation“Pregel: a system for large-scale graph
processing” (Malewicz et al, PODC 2009)
![Page 32: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/32.jpg)
Bulk Synchronous Parallel
• Sequence of steps – “SuperSteps”• Each SuperStep S– Execute a user defined Compute() function on
every vertex in parallel– Input to Compute(): All messages from SuperStep
S – 1– Output of Compute(): Messages to other vertices
1B vertices 80B Edges2000 WorkersBellman-Ford: 200s(Malewicz et al, PODC 2009)
![Page 33: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/33.jpg)
Why “SuperStep”?
• Internally consists of three stages
![Page 34: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/34.jpg)
Computing PageRank (BSP version)
Compute(){ r_old = r; r = 0; for each incoming message m { r += m.p* r_old + dampening_factor; } if(r – r_old < epsilon) done()}
![Page 35: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/35.jpg)
Suggested Reading
• “Truly, Madly, Deeply Parallel”, Robert Matthews, New Scientist, Feb 1996
• “Pregel: a system for large-scale graph processing” (Malewicz et al, PODC 2009)
![Page 36: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/36.jpg)
Social Network Analysis
• Retrieving information from structure• Example: Community Discovery• Many practical applications• One approach: “Edge Betweenness”– betweenness(e) = # triangles(e)/max(e)– iteratively prune edges with low betweenness
![Page 37: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/37.jpg)
Book
Networks, Crowds, and Markets: Reasoning About a Highly Connected World
By David Easley and Jon Kleinberg
http://www.cs.cornell.edu/home/kleinber/networks-book/
![Page 38: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/38.jpg)
Recap …• Properties of social media
– Scale– Immediacy– Heterogeneity– Duplication– Semi-structuredness
• Properties of social media graphs– Small-worldness– Scale free property– Evolution models
• Crawling– API driven
• Indexing & Retrieval– Graph databases
• Processing large scale social networks– MapReduce– Bulk Synchronous Parallel
• IR from structure– Social Network Analysis
![Page 39: Social Media? Properties of Social Media : Scale Twitter (Chirp 2010) – More than 100M user accounts – more than 600M search queries a day – 55M tweets](https://reader035.vdocuments.us/reader035/viewer/2022062511/551aa14455034656628b463a/html5/thumbnails/39.jpg)
Social Media Related Projects