community structure and information flow in usenet: improving analysis with a thread ownership model...
TRANSCRIPT
![Page 1: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/1.jpg)
1
Community Structure and Information Flow in Usenet:Improving Analysis with a Thread Ownership Model
Mary McGlohon, Carnegie Mellon*Matthew Hurst, Microsoft
*work completed at Microsoft
![Page 2: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/2.jpg)
2
Motivation•Comparing communities
of online social networks may lend insight into how groups form and thrive
•We would also like to understand how information diffuses between groups
Collaborations at Santa Fe Institute(Girvan & Newman)
![Page 3: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/3.jpg)
3
Why Usenet?•We delve into these questions by
analyzing data from Usenet•Public•Can be analyzed over a long time period•Has pre-defined, hierarchical community
structure•Two main goals:
▫Compare different group activity (size, reciprocity)
▫Observe diffusion between groups
![Page 4: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/4.jpg)
4
Data
•Posts from 200 politically-oriented newsgroups (bulletin boards)▫ “polit” in name
• January 2004-June 2008•Several countries,
state/provinces, and topics.
•19.6 million unique articles, 6.2 million cross-posted
RepliesParent
![Page 5: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/5.jpg)
5
Cross-posting
•A large percentage of articles are cross-posted to multiple groups.
•Somebody reading one group may “reply-to-all”, such that all groups see it.
Major issue: many are cross-posted to multiple groups.Where is conversation truly occurring?
{alt.politics, us.politics}
{alt.politics,
us.politics, pa.politics}
{alt.politics,
us.politics, pa.politics}
{alt.politics, us.politics}
![Page 6: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/6.jpg)
6
Outline•Motivation•Data description•Structural Analysis
▫Size▫Reciprocity▫Similarity
•Ownership model▫Effects of Cross-posting▫Information Flow based on Ownership▫Similarity
![Page 7: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/7.jpg)
7
Structural Analysis•We hope to compare the structure of
communities by answering the following questions:
•How do edges form?•How does the reciprocity of groups
compare?•How can we measure similarity?
![Page 8: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/8.jpg)
8
Sizes of groups
•How do edges form?
•To answer, we make a network of authors for each group
•If a1 has replied to a2 at any point, there is an edge from a1 to a2
![Page 9: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/9.jpg)
9
Sizes of groups
•Power law-like relationship between number of authors and number of edges.
•Similar to densification law [Leskovec+05], only with individual networks instead of snapshots of a network over time.
log(nodes)
log(edges)t=2004
t=2008
log(Number of authors)
log
(Nu
mb
er
of
ed
ges)
alt.politics
tw.bbs.politics
![Page 10: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/10.jpg)
10
Reciprocity• Which groups have highest reciprocity?• Reciprocity: percentage of reply-edges that are
mutual• Top 10 were European newsgroups (up to 0.58):
▫hun.politika▫relcom.politics▫hsv.politics▫ italia.modena.politica▫se.politik▫ it.discussioni.leggende.metropolitane▫ukr.politics▫yu.forum.politika▫ni.politics▫swnet.politik
• Lowest reciprocity occurred in tw.bbs.* (<0.1)
![Page 11: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/11.jpg)
11
Similarity•How can we measure similarity between
groups?•Use Jaccard coefficient for cross-posts:
# Shared articles (cross-posts) between 2 groups
Total number of articles in groups
•Can do the same with shared authors•Highest similarity ~0.54 (bc.politics and
ont.politics)
![Page 12: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/12.jpg)
12
Similarity•Each group is a node•Edge drawn if similarity > 0.1 (thick
edge >0.2)•Form clusters: parties, US regional,
countries, alt.politics subgroups
![Page 13: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/13.jpg)
13
Parties/topics
![Page 14: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/14.jpg)
14
US States
![Page 15: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/15.jpg)
15
English-speaking countries
![Page 16: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/16.jpg)
16
alt.politics.*
![Page 17: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/17.jpg)
17
Outline•Motivation•Data Description•Structural Analysis
▫Size▫Reciprocity▫Similarity
•Ownership model▫Information Flow based on Ownership▫Similarity
![Page 18: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/18.jpg)
18
Problem: Excessive cross-posting•We just saw that there is significant
overlap between groups in terms of articles
•However, cross-posting occurs often between unrelated groups (“edges below threshold”)
•We would like to find out in which group the activity is truly occurring
• How can we trace this?
![Page 19: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/19.jpg)
19
Solution: Thread Ownership•Answer: Assign “ownership” based on the
authors of the posts•First, assign authors to groups based on
devotion▫Devotion(a,g): what percentage of an
author a’s posts are exclusively posted to a given group g
•For each post, normalize devotion among groups where the post occurs.▫Group with highest devotion score for the
author has more “ownership” of a post
![Page 20: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/20.jpg)
20
Example: Thread Ownership•Suppose in the data authors have the
following numbers of non-cross-posts in each group:
•Then, they form a thread:{alt.politics, us.politics}
{alt.politics, us.politics}{alt.politics, us.politics,
pa.politics}
alt.politics
us.politics pa.politics
Author 1 6 4 0
Author 2 0 1 3
Author 3 0 1 2
{0.6, 0.4}
{0, 1}{0, 0.25, 0.75}
![Page 21: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/21.jpg)
21
Real thread
• Initially cross-posted to several groups (including talk.politics.misc), 38 groups in total
•Ownership concentrated in seattle.politics and or.politics
•Subject: “Kiss the National Parks Good-Bye”
![Page 22: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/22.jpg)
22
Applications of thread ownership
•Ownership model aids in analyzing threads▫Influence between groups: How are
threads discovered and posted to new groups?
▫Similarity of groups: How can ownership help us more precisely state when two groups are similar?
![Page 23: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/23.jpg)
23
Information flow between groups
• How are threads discovered and posted to new groups?
• Idea: Extend ownership to influence
• How often does an author in group 1 respond to a post they found in group 2?▫Author finds parent post pp by browsing group gp
▫Author writes child post pc to group gc
▫Then, we say gp influences gc
Influence(gp, gc) = Devotion(a, gp) * Devotion(a, gc)
• This helps pinpoint when an author decides to cross-post late in the thread
{alt.politics, us.politics}{alt.politics, us.politics, pa.politics}
![Page 24: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/24.jpg)
24
Example: Ownership-based influence
•Author 2 sees parent post•Replies, adding pa.politics.•Since Author 2 is not devoted
to alt.politics, he was most likely influenced by us.politics
•Influence(us.politics,pa.politics) =
1 * 0.75 = 0.75
{alt.politics, us.politics}
{alt.politics, us.politics, pa.politics}
alt.politics
us.politics pa.politics
Author 1 6 4 0
Author 2 0 1 3
![Page 25: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/25.jpg)
25
Who influences whom?•Information often diffuses from major to
minor groups
![Page 26: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/26.jpg)
26
Ownership-based Similarity• Q: How can ownership help us more precisely
state when two groups are similar?• A: Use “shared ownership” instead of shared posts
Western states Eastern states
![Page 27: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/27.jpg)
27
Applications and future work•Potential Applications
▫Link prediction▫Information retrieval and relevance▫Ownership for email lists
•Future Work▫Using comparative measures to predict
whether group will continue
![Page 28: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/28.jpg)
28
Related work: Discussion Groups• Backstrom, L.; Kumar, R.; Marlow, C.; Novak, J.; and
Tomkins, A. 2008. Preferential behavior in online groups. WSDM ’08
• Gomez, V.; Kaltenbrunner, A.; and Lopez, V. 2008. Statistical analysis of the social network and discussion threads in slashdot. WWW ’08
• Mishne, G., and Glance, N. 2006. Leave a reply: An analysis of weblog comments. WWE ’06
• Turner, T. C.; Smith, M. A.; Fisher, D.; and Welser, H. T. 2005. Picturing usenet: Mapping computer-mediated collective action. Journal of Computer-Mediated Communication 10(4).
• Viegas, F. B., and Smith, M. 2004. Newsgroup crowds and authorlines: visualizing the activity of individuals in conversational cyberspaces. HICSS 2004
![Page 29: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/29.jpg)
29
Related work: Information Diffusion
•Kossinets, G.; Kleinberg, J.; and Watts, D. 2008. The structure of information pathways in a social communication network. KDD’08
•Leskovec, J.; Kleinberg, J.; and Faloutsos, C. 2005. Graphs over time: densification laws, shrinking diameters and possible explanations. KDD ’05
•Nowell, D. L., and Kleinberg, J. 2008. Tracing the flow of information on a global scale using Internet chain-letter data. PNAS 105(12):4633–4638.
![Page 30: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/30.jpg)
30
Conclusions•Case study of nearly 200 newsgroups,
including 19 million unique posts•Demonstrated “densification” law as
applies to different groups•Compared groups in terms of reciprocity
and shared posts/authors•Proposed thread ownership model to cut
down on “noise” from cross-posts•Applied ownership to diffusion between
groups, group similarity
![Page 31: Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f1d5503460f94c342a1/html5/thumbnails/31.jpg)
31
Contact info•Mary McGlohon•www.cs.cmu.edu/~mmcgloho
•Matthew Hurst•datamining.typepad.com
•Special thanks to Christos Faloutsos, Michael Gamon, Kathy Gill, Christian Konig, Alexei Maykov, Purna Sarkar, Hassan Sayyadi, Marc Smith