extracting hidden information from knowledge networks sergei maslov brookhaven national laboratory,...
Post on 18-Dec-2015
222 views
TRANSCRIPT
![Page 1: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/1.jpg)
Extracting hidden information from knowledge networks
Sergei MaslovBrookhaven
National Laboratory,
New York, USA
![Page 2: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/2.jpg)
Hanse Institute for Advanced Study, March 2002
Outline of the talk
What is a knowledge network and how is it different from an ordinary graph or network?
Knowledge networks on the internet: matching products to customers
Knowledge networks in biology: large ensembles of interacting biomolecules
Empirical study of correlations in the network of interacting proteins
Collaborators: Y-C. Zhang, and K. Sneppen
![Page 3: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/3.jpg)
Hanse Institute for Advanced Study, March 2002
Networks in complex systems
Network is the backbone of a complex system Answers the question: who interacts with whom? Examples:
– Internet and WWW– Interacting biomolecules (metabolic, physical,
regulatory)– Food webs in ecosystems– Economics: customers and products; Social: people
and their choice of partners
![Page 4: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/4.jpg)
Hanse Institute for Advanced Study, March 2002
Predicting tastes of customers based on their opinions on products
Each of us has personal tastes These tastes are sometimes unknown even to
ourselves (hidden wants) Information is contained in our opinions on
products Matchmaking: customers with similar tastes
can be used to predict future opinions Internet allows to do it on a large scale
![Page 5: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/5.jpg)
Hanse Institute for Advanced Study, March 2002
Types of networks
read
ers
book
s2
1
3
4
1
2
3
Plain network Knowledge or opinion network
read
er’s
ta
stes
book
’s f
eatu
res
opinion
2
1
3
4
1
2
3
![Page 6: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/6.jpg)
Hanse Institute for Advanced Study, March 2002
Storing opinions
X X X 2 9 ? ?
X X X ? 8 ? 8
X X X ? ? 1 ?
2 ? ? X X X X
9 8 ? X X X X
? ? 1 X X X X
? 8 ? X X X X
book
s2
1
3
4
read
ers
98
81
21
2
3
Matrix of opinions IJNetwork of opinions
![Page 7: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/7.jpg)
Hanse Institute for Advanced Study, March 2002
Using correlations to reconstruct customer’s tastes
Similar opinions similar tastes
Simplest model: – Readers M-dimensional
vector of tastes rI
– Books M-dimensional
vector of features bJ
– Opinions scalar product:
IJ= rIbJ
cust
omer
s
book
s
98
81
21
2
2
1
3
43
![Page 8: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/8.jpg)
Hanse Institute for Advanced Study, March 2002
Loop correlationcu
stom
ers
book
s
98
8
1
2
2
1
3
43
predictive power 1/M(L-1)/2
one needs many loops to completely freezemutual orientation of vectors
an unknown opinion
L known opinions
![Page 9: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/9.jpg)
Hanse Institute for Advanced Study, March 2002
Field Theory Approach
• If all components of vectors are Gaussian and uncorrelated:
• Generating functional is: det(1+i)-M/2
• All irreducible correlations are proportional to M• All loop correlations <12 23 34 … L1>=M• Since each is IJ~M sign correlation scales as M–(L-1)/2
![Page 10: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/10.jpg)
Hanse Institute for Advanced Study, March 2002
Main parameter: density of edges
The larger is the density of edges p the easier is the prediction
At p1 1/N (N=Nreaders+Nbooks) macroscopic prediction
becomes possible. Nodes are connected but vectors rI
bJ are not fixed: ordinary percolation threshold
At p2 2M/N > p1 all tastes and features (rI and bJ)
can be uniquely reconstructed: rigidity percolation threshold
![Page 11: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/11.jpg)
Hanse Institute for Advanced Study, March 2002
Spectral properties of
For M<N the matrix IJ has N-M zero eigenvalues and M positive ones: = R R+.
Using SVD one can “diagonalize” R = U D V+ such that matrices V and U are orthogonal V+ V = 1, U U+ = 1, and D is diagonal. Then = U D2 U+
The amount of information contained in : NM-M(M-1)/2 << N(N-1)/2 - the # of off-diagonal elements
![Page 12: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/12.jpg)
Hanse Institute for Advanced Study, March 2002
Practical recursive algorithm of prediction of unknown opinions
1. Start with 0 where all unknown elements are filled with <> (zero in our case)
2. Diagonalize and keep only M largest eigenvalues and eigenvectors
3. In the resulting truncated matrix ’0
replace all
known elements with their exact values and go to step 1
![Page 13: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/13.jpg)
Hanse Institute for Advanced Study, March 2002
Convergence of the algorithm
• Above p2 the algorithm exponentially converges to theexact values of unknown elements
• The rate of convergence scales as (p-p2)2
![Page 14: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/14.jpg)
Hanse Institute for Advanced Study, March 2002
Reality check: sources of errors
Customers are not rational! IJ= rIbJ + Ij
(idiosyncrasy)
Opinions are delivered to the matchmaker through a narrow channel:– Binary channel SIJ = sign(IJ) : 1 or 0 (liked or not)
– Experience rated on a scale 1 to 5 or 1 to 10 at best
If number of edges K, and size N are large, while M is small these errors can be reduced
![Page 15: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/15.jpg)
Hanse Institute for Advanced Study, March 2002
How to determine M?
In real systems M is not fixed: there are always finer and finer details of tastes
Given the number of known opinions K one should choose Meff K/(Nreaders+Nbooks) so that systems are below the second transition p2 tastes should be determined hierarchically
![Page 16: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/16.jpg)
Hanse Institute for Advanced Study, March 2002
Avoid overfitting
Divide known votes into training and test sets Select Meff so that to avoid overfitting !!!
Reasonable fit Overfit
![Page 17: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/17.jpg)
Hanse Institute for Advanced Study, March 2002
Knowledge networks in biology
Interacting biomolecules: key and lock principle
Matrix of interactions (binding energies) IJ= kIlJ+ lIkJ
Matchmaker (bioinformatics researcher) tries to guess yet unknown interactions based on the pattern of known ones
Many experiments measure SIJ =(IJ-th)
k(1) k(2) l(2)l(1)
![Page 18: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/18.jpg)
Hanse Institute for Advanced Study, March 2002
Real systems
Internet commerce: the dataset of opinions on movies collected by Compaq systems research center:
– 72916 users entered a total of 2811983 numeric ratings (* to *****) for 1628 different movies: Meff~40
– Default set for collaborative filtering research Biology: table of interactions between yeast proteins
from Ito et al. high throughput two-hybrid experiment– 6000 proteins (~3300 have at least one interaction partner)
and 4400 known interactions– Binary (interact or not)– Meff~1: too small!
![Page 19: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/19.jpg)
Hanse Institute for Advanced Study, March 2002
Yeast Protein Interaction Network
• Data from T. Ito, et al. PNAS (2001) • Full set contains 4549 interactions among
3278 yeast proteins• Here are shown only nuclear proteins interacting with at least one other nuclear protein
![Page 20: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/20.jpg)
Hanse Institute for Advanced Study, March 2002
Correlations in connectivities
Basic design principles of the network can be revealed by comparing the frequency of a pattern in real and random networks
P(k0,k1) – probability that nodes with connectivities k0 and k1 directly interact
Should be normalized by Pr(k0,k1) – the same property in a randomized network such that:
– Each node has the same number of neighbors (connectivity)– These neighbors are randomly selected– The whole ensemble of random networks can be generated
![Page 21: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/21.jpg)
Hanse Institute for Advanced Study, March 2002
Correlation profile of the protein interaction network
P(k0,k1)/Pr(k0,k1) Z(k0,k1) =(P(k0,k1)-Pr(k0,k1))/r(k0,k1)
![Page 22: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/22.jpg)
Hanse Institute for Advanced Study, March 2002
Correlation profile of the internet
![Page 23: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/23.jpg)
Hanse Institute for Advanced Study, March 2002
What it may mean?
Hubs avoid each other (like in the internet R. Pastor-Satorras, et al. Phys. Rev. Lett. (2001))
Hubs prefer to connect to terminal ends (low connected nodes)
Specificity: network is organized in modules clustered around individual hubs
Stability: the number of second nearest neighbors is suppressed harder to propagate deleterious perturbations
![Page 24: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/24.jpg)
Hanse Institute for Advanced Study, March 2002
Conclusion
Studies of networks are similar to paleontology: learning about an organism from its backbone
You can learn a lot about a complex system from its network !! But not everything…
![Page 25: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/25.jpg)
Hanse Institute for Advanced Study, March 2002
THE END
![Page 26: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/26.jpg)
Hanse Institute for Advanced Study, March 2002
Entropy of unknown opinions
Density of knownopinions p
p1 p2
Entropy
0 1
![Page 27: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/27.jpg)
Hanse Institute for Advanced Study, March 2002
How to determine p2?
K known elements of an NxN matrix IJ= rIbJ
(N=Nr+Nb) Approximately N x M degrees of freedom
(minus M(M-1)/2 gauge parameters) For K>MN all missing elements can be
reconstructed p2 =K2/(N(N-1)/2) 2M/N
![Page 28: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/28.jpg)
Hanse Institute for Advanced Study, March 2002
What is a knowledge network?
Undirected graph with N vertices and K edges Each vertex has a (hidden) M-dimensional
vector of tastes/features Each edge carries a scalar product (opinion) of
vectors on vertices it connects The centralized matchmaker is trying to guess
vectors (tastes) based on their scalar products (opinions) and to predict unknown opinions
![Page 29: Extracting hidden information from knowledge networks Sergei Maslov Brookhaven National Laboratory, New York, USA](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d255503460f949fc4b9/html5/thumbnails/29.jpg)
Hanse Institute for Advanced Study, March 2002
Versions of knowledge networks
Regular graph: every link is allowed. Example: recommending people to other people according to their areas of interests
Bipartite graphs: Example: Customers to products
Non-reciprocal opinions: each vertex has two vectors dI, qI so that IJ= dIqJ . Example: Real
matchmaker recommending men to women.