Statistical Models of (Social) Networks
Andrew McCallum
Computer Science Department
University of Massachusetts Amherst
Joint work with
Xuerui Wang, Natasha Mohanty, Andres Corrada
Workplace effectiveness ~ Ability to leverage network of acquaintances
But filling Contacts DB by hand is tedious, and incomplete.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Email Inbox Contacts DB
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
WWW
Automatically
Managing and UnderstandingConnections of People in our Email World
System Overview
ContactInfo andPerson Name
Extraction
Person Name
Extraction
NameCoreference
HomepageRetrieval
Social NetworkAnalysis
KeywordExtraction
CRFWWW
names
Email QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
An ExampleTo: “Andrew McCallum” [email protected]
Subject ...
First Name:
Andrew
Middle Name:
Kachites
Last Name:
McCallum
JobTitle: Associate Professor
Company: University of Massachusetts
Street Address:
140 Governor’s Dr.
City: Amherst
State: MA
Zip: 01003
Company Phone:
(413) 545-1323
Links: Fernando Pereira, Sam Roweis,…
Key Words:
Information extraction,
social network,…
Search for new people
Summary of Results
Token
Acc
Field
Prec
Field
Recall
Field
F1
CRF 94.50 85.73 76.33 80.76
Person Keywords
William Cohen Logic programming
Text categorization
Data integration
Rule learning
Daphne Koller Bayesian networks
Relational models
Probabilistic models
Hidden variables
Deborah McGuiness
Semantic web
Description logics
Knowledge representation
Ontologies
Tom Mitchell Machine learning
Cognitive states
Learning apprentice
Artificial intelligence
Contact info and name extraction performance (25 fields)
Example keywords extracted
1. Expert Finding: When solving some task, find friends-of-friends with relevant expertise. Avoid “stove-piping” in large org’s by automatically suggesting collaborators. Given a task, automatically suggest the right team for the job. (Hiring aid!)
2. Social Network Analysis: Understand the social structure of your organization. Suggest structural changes for improved efficiency.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Outline
• Social Network Analysis with (Language) Attributes
– Roles and Topics (Author-Recipient-Topic Model)
– Groups and Topics (Group-Topic Model)
• Demo: Rexa, a Web portal for researchers
Outline
• Social Network Analysis with (Language) Attributes
– Roles and Topics (Author-Recipient-Topic Model)
– Groups and Topics (Group-Topic Model)
• Demo: Rexa, a Web portal for researchers
Clustering words into topics withLatent Dirichlet Allocation
[Blei, Ng, Jordan 2003]
Sample a distributionover topics,
For each document:
Sample a topic, z
For each word in doc
Sample a wordfrom the topic, w
Example:
70% Iraq war30% US election
Iraq war
“bombing”
GenerativeProcess:
STORYSTORIESTELL
CHARACTERCHARACTERS
AUTHORREADTOLD
SETTINGTALESPLOT
TELLINGSHORTFICTIONACTIONTRUE
EVENTSTELLSTALENOVEL
MINDWORLDDREAMDREAMSTHOUGHT
IMAGINATIONMOMENT
THOUGHTSOWNREALLIFE
IMAGINESENSE
CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE
WATERFISHSEASWIM
SWIMMINGPOOLLIKESHELLSHARKTANK
SHELLSSHARKSDIVING
DOLPHINSSWAMLONGSEALDIVE
DOLPHINUNDERWATER
DISEASEBACTERIADISEASESGERMSFEVERCAUSECAUSEDSPREADVIRUSES
INFECTIONVIRUS
MICROORGANISMSPERSON
INFECTIOUSCOMMONCAUSING
SMALLPOXBODY
INFECTIONSCERTAIN
Example topicsinduced from a large collection of text
FIELDMAGNETICMAGNETWIRE
NEEDLECURRENT
COILPOLESIRON
COMPASSLINESCORE
ELECTRICDIRECTION
FORCEMAGNETS
BEMAGNETISM
POLEINDUCED
SCIENCESTUDY
SCIENTISTSSCIENTIFIC
KNOWLEDGEWORK
RESEARCHCHEMISTRY
TECHNOLOGYMANY
MATHEMATICSBIOLOGYFIELD
PHYSICSLABORATORY
STUDIESWORLD
SCIENTISTSTUDYINGSCIENCES
BALLGAMETEAM
FOOTBALLBASEBALLPLAYERS
PLAYFIELD
PLAYERBASKETBALL
COACHPLAYEDPLAYING
HITTENNISTEAMSGAMESSPORTSBAT
TERRY
JOBWORKJOBS
CAREEREXPERIENCEEMPLOYMENTOPPORTUNITIES
WORKINGTRAININGSKILLS
CAREERSPOSITIONS
FINDPOSITIONFIELD
OCCUPATIONSREQUIRE
OPPORTUNITYEARNABLE
[Tennenbaum et al]
STORYSTORIESTELL
CHARACTERCHARACTERS
AUTHORREADTOLD
SETTINGTALESPLOT
TELLINGSHORTFICTIONACTIONTRUE
EVENTSTELLSTALENOVEL
MINDWORLDDREAMDREAMSTHOUGHT
IMAGINATIONMOMENT
THOUGHTSOWNREALLIFE
IMAGINESENSE
CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE
WATERFISHSEASWIM
SWIMMINGPOOLLIKESHELLSHARKTANK
SHELLSSHARKSDIVING
DOLPHINSSWAMLONGSEALDIVE
DOLPHINUNDERWATER
DISEASEBACTERIADISEASESGERMSFEVERCAUSECAUSEDSPREADVIRUSES
INFECTIONVIRUS
MICROORGANISMSPERSON
INFECTIOUSCOMMONCAUSING
SMALLPOXBODY
INFECTIONSCERTAIN
FIELDMAGNETICMAGNETWIRE
NEEDLECURRENT
COILPOLESIRON
COMPASSLINESCORE
ELECTRICDIRECTION
FORCEMAGNETS
BEMAGNETISM
POLEINDUCED
SCIENCESTUDY
SCIENTISTSSCIENTIFIC
KNOWLEDGEWORK
RESEARCHCHEMISTRY
TECHNOLOGYMANY
MATHEMATICSBIOLOGYFIELD
PHYSICSLABORATORY
STUDIESWORLD
SCIENTISTSTUDYINGSCIENCES
BALLGAMETEAM
FOOTBALLBASEBALLPLAYERS
PLAYFIELDPLAYER
BASKETBALLCOACHPLAYEDPLAYING
HITTENNISTEAMSGAMESSPORTSBAT
TERRY
JOBWORKJOBS
CAREEREXPERIENCEEMPLOYMENTOPPORTUNITIES
WORKINGTRAININGSKILLS
CAREERSPOSITIONS
FINDPOSITIONFIELD
OCCUPATIONSREQUIRE
OPPORTUNITYEARNABLE
Example topicsinduced from a large collection of text
[Tennenbaum et al]
From LDA to Author-Recipient-Topic(ART)
Inference and Estimation
Gibbs Sampling:- Easy to implement- Reasonably fast
r
Enron Email Corpus
• 250k email messages• 23k people
Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT)From: [email protected]: [email protected]: Enron/TransAltaContract dated Jan 1, 2001
Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions.
DP
Debra PerlingiereEnron North America Corp.Legal Department1400 Smith Street, EB 3885Houston, Texas [email protected]
Topics, and prominent senders / receiversdiscovered by ARTTopic names,
by hand
Topics, and prominent senders / receiversdiscovered by ART
Beck = “Chief Operations Officer”Dasovich = “Government Relations Executive”Shapiro = “Vice President of Regulatory Affairs”Steffes = “Vice President of Government Affairs”
Comparing Role Discovery
connection strength (A,B) =
distribution overauthored topics
Traditional SNA
distribution overrecipients
distribution overauthored topics
Author-TopicART
Comparing Role Discovery Tracy Geaconne Dan McCarty
Traditional SNA Author-TopicART
Similar roles Different rolesDifferent roles
Geaconne = “Secretary”McCarty = “Vice President”
Traditional SNA Author-TopicART
Different roles Very similarNot very similar
Geaconne = “Secretary”Hayslett = “Vice President & CTO”
Comparing Role Discovery Tracy Geaconne Rod Hayslett
Traditional SNA Author-TopicART
Different roles Very differentVery similar
Blair = “Gas pipeline logistics”Watson = “Pipeline facilities planning”
Comparing Role Discovery Lynn Blair Kimberly Watson
McCallum Email Corpus 2004
• January - October 2004• 23k email messages• 825 people
From: [email protected]: NIPS and ....Date: June 14, 2004 2:27:41 PM EDTTo: [email protected]
There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for:
NIPS registration receipt.CALO registration receipt.
Thanks,Kate
McCallum Email Blockstructure
Four most prominent topicsin discussions with ____?
Two most prominent topicsin discussions with ____?
Words Problove 0.030514house 0.015402
0.013659time 0.012351great 0.011334hope 0.011043dinner 0.00959saturday 0.009154left 0.009154ll 0.009009
0.008282visit 0.008137evening 0.008137stay 0.007847bring 0.007701weekend 0.007411road 0.00712sunday 0.006829kids 0.006539flight 0.006539
Words Probtoday 0.051152tomorrow 0.045393time 0.041289ll 0.039145meeting 0.033877week 0.025484talk 0.024626meet 0.023279morning 0.022789monday 0.020767back 0.019358call 0.016418free 0.015621home 0.013967won 0.013783day 0.01311hope 0.012987leave 0.012987office 0.012742tuesday 0.012558
Pairs with highestrank difference between ART & SNA
5 other professors3 other ML researchers
Role-Author-Recipient-Topic Models
Results with RART:People in “Role #3” in Academic Email
• olc lead Linux sysadmin• gauthier sysadmin for CIIR group• irsystem mailing list CIIR sysadmins• system mailing list for dept. sysadmins• allan Prof., chair of “computing
committee”• valerie second Linux sysadmin• tech mailing list for dept. hardware• steve head of dept. I.T. support
Roles for allan (James Allan)
• Role #3 I.T. support• Role #2 Natural Language
researcher
Roles for pereira (Fernando Pereira) • Role #2 Natural Language researcher• Role #4 SRI CALO project participant• Role #6 Grant proposal writer• Role #10 Grant proposal coordinator• Role #8 Guests at McCallum’s house
Traditional SNA Author-TopicART
Block structured NotNot
ART: Roles but not Groups
Enron TransWestern Division
Outline
• Social Network Analysis with (Language) Attributes
– Roles and Topics (Author-Recipient-Topic Model)
– Groups and Topics (Group-Topic Model)
• Demo: Rexa, a Web portal for researchers
Groups and Topics
• Input:– Observed relations between people– Attributes on those relations (text, or categorical)
• Output:– Attributes clustered into “topics”– Groups of people---varying depending on topic
Discovering Groups from Observed Set of Relations
Admiration relations among six high school students.
Student Roster
AdamsBennettCarterDavisEdwardsFrederking
Academic Admiration
Acad(A, B) Acad(C, B)Acad(A, D) Acad(C, D)Acad(B, E) Acad(D, E)Acad(B, F) Acad(D, F)Acad(E, A) Acad(F, A)Acad(E, C) Acad(F, C)
Adjacency Matrix Representing Relations
A B C D E FABCDEF
A B C D E FG1G2G1G2G3G3
G1G2G1G2G3G3
ABCDEF
A C B D E FG1G1G2G2G3G3
G1G1G2G2G3G3
ACBDEF
Student Roster
AdamsBennettCarterDavisEdwardsFrederking
Academic Admiration
Acad(A, B) Acad(C, B)Acad(A, D) Acad(C, D)Acad(B, E) Acad(D, E)Acad(B, F) Acad(D, F)Acad(E, A) Acad(F, A)Acad(E, C) Acad(F, C)
Group Model: Partitioning Entities into Groups
2Sv
β
2Gγ α
Stochastic Blockstructures for Relations[Nowicki, Snijders 2001]
S: number of entities
G: number of groups
Enhanced with arbitrary number of groups in [Kemp, Griffiths, Tenenbaum 2004]
BetaDirichlet
Binomial
SgMultinomial
Two Relations with Different Attributes
A C B D E FG1G1G2G2G3G3
G1G1G2G2G3G3
A C E B D FG1G1G1G2G2G2
G1G1G1G2G2G2
ACEBDF
Student Roster
AdamsBennettCarterDavisEdwardsFrederking
Academic Admiration
Acad(A, B) Acad(C, B)Acad(A, D) Acad(C, D)Acad(B, E) Acad(D, E)Acad(B, F) Acad(D, F)Acad(E, A) Acad(F, A)Acad(E, C) Acad(F, C)
Social Admiration
Soci(A, B) Soci(A, D) Soci(A, F)Soci(B, A) Soci(B, C) Soci(B, E)Soci(C, B) Soci(C, D) Soci(C, F)Soci(D, A) Soci(D, C) Soci(D, E)Soci(E, B) Soci(E, D) Soci(E, F)Soci(F, A) Soci(F, C) Soci(F, E)
ACBDEF
Goal:Model relations and their (textual) attributes simultaneously to obtain better groups and more meaningful topics.
budget, funding, annual, cash
document, corrections, review, annual
The Group-Topic Model: Discovering Groups and Topics Simultaneously
bNw
t
B
T
φ
η
DirichletMultinomial
Uniform
2Sv
β
2Gγ α
Beta
Dirichlet
Binomial
SgMultinomial
T
Inference and EstimationGibbs Sampling:- Many r.v.s can be integrated out- Easy to implement- Reasonably fast
We assume the relationship is symmetric.
Dataset #1:U.S. Senate
• 16 years of voting records in the US Senate (1989 – 2005)
• a Senator may respond Yea or Nay to a resolution
• 3423 resolutions with text attributes (index terms)
• 191 Senators in total across 16 years
S.543 Title: An Act to reform Federal deposit insurance, protect the deposit insurance funds, recapitalize the Bank Insurance Fund, improve supervision and regulation of insured depository institutions, and for other purposes. Sponsor: Sen Riegle, Donald W., Jr. [MI] (introduced 3/5/1991) Cosponsors (2) Latest Major Action: 12/19/1991 Became Public Law No: 102-242. Index terms: Banks and banking Accounting Administrative fees Cost control Credit Deposit insurance Depressed areas and other 110 terms
Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen (D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea Bradley (D-NJ), Nay Conrad (D-ND), Nay ……
Topics Discovered (U.S. Senate)Education Energy
MilitaryMisc.
Economic
education energy government federalschool power military laboraid water foreign insurance
children nuclear tax aiddrug gas congress tax
students petrol aid businesselementary research law employeeprevention pollution policy care
Mixture of Unigrams
Group-Topic Model
Education
+ DomesticForeign Economic
Social Security
+ Medicareeducation foreign labor socialschool trade insurance securityfederal chemicals tax insuranceaid tariff congress medical
government congress income caretax drugs minimum medicare
energy communicable wage disabilityresearch diseases business assistance
Groups Discovered (US Senate)
Groups from topic Education + Domestic
Senators Who Change Coalition the most Dependent on Topic
e.g. Senator Shelby (D-AL) votes with the Republicans on Economicwith the Democrats on Education + Domesticwith a small group of maverick Republicans on Social Security + Medicaid
Dataset #2:The UN General Assembly
• Voting records of the UN General Assembly (1990 - 2003)
• A country may choose to vote Yes, No or Abstain
• 931 resolutions with text attributes (titles)
• 192 countries in total
• Also experiments later with resolutions from 1960-2003
Vote on Permanent Sovereignty of Palestinian People, 87th plenary meeting
The draft resolution on permanent sovereignty of the Palestinian people in the occupied Palestinian territory, including Jerusalem, and of the Arab population in the occupied Syrian Golan over their natural resources (document A/54/591) was adopted by a recorded vote of 145 in favour to 3 against with 6 abstentions:
In favour: Afghanistan, Argentina, Belgium, Brazil, Canada, China, France, Germany, India, Japan, Mexico, Netherlands, New Zealand, Pakistan, Panama, Russian Federation, South Africa, Spain, Turkey, and other 126 countries. Against: Israel, Marshall Islands, United States. Abstain: Australia, Cameroon, Georgia, Kazakhstan, Uzbekistan, Zambia.
Topics Discovered (UN)
Everything Nuclear
Human RightsSecurity
in Middle East
nuclear rights occupiedweapons human israel
use palestine syriaimplementation situation security
countries israel calls
Mixture ofUnigrams
Group-TopicModel
NuclearNon-proliferation
Nuclear Arms Race
Human Rights
nuclear nuclear rightsstates arms humanunited prevention palestine
weapons race occupiednations space israel
GroupsDiscovered(UN)The countries list for each group are ordered by their 2005 GDP (PPP) and only 5 countries are shown in groups that have more than 5 members.
Do We Get Better Groups with the GT Model?
1. Cluster bills into topics using mixture of unigrams;
2. Apply group model on topic-specific subsets of bills.
Agreement Index (AI) measures group cohesion. Higher, better.
Datasets Avg. AI for Baseline Avg. AI for GT p-value
Senate 0.8198 0.8294 <.01
UN 0.8548 0.8664 <.01
1. Jointly cluster topic and groups at the same time using the GT model.
Baseline Model GT Model
Groups and Topics, Trends over Time (UN)
Outline
• Social Network Analysis with (Language) Attributes
– Roles and Topics (Author-Recipient-Topic Model)
– Groups and Topics (Group-Topic Model)
• Demo: Rexa, a Web portal for researchers
Previous Systems
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
ResearchPaper
Cites
Previous Systems
ResearchPaper
Cites
Person
UniversityVenue
Grant
Groups
Expertise
More Entities and Relations
Outline
• Examples of IE and Data Mining.
• Brief introduction of Conditional Random Fields
• Joint inference: Motivation and examples
– Joint Labeling of Cascaded Sequences (Belief Propagation)
– Joint Labeling for Transfer Learning (Piecewise Training & BP)
– Joint Labeling of Distant Entities (BP by Tree Reparameterization)
– Joint Co-reference Resolution (Graph Partitioning)
– Joint Segmentation and Co-ref (Sparse BP)
• Joint Topic Discovery and Social Network Analysis
– Roles and Topics (Author-Recipient-Topic Model)
– Groups and Topics (Group-Topic Model)
• Demo: Rexa, a Web portal for researchers
End of Talk
Summary• Traditionally, SNA examines links,
but not the language content on those links.
• Presented ART, an Bayesian network for messages sent in a social network: captures topics and role-similarity.
• RART explicitly represents roles.
• Additional work– Group-Topic model discovers groups
and clusters attributes of relations.[Wang, Mohanty, McCallum, LinkKDD 2005]