Title: Extracting hidden information from knowledge networks
1Extracting hidden information from knowledge
networks
- Sergei Maslov
- Brookhaven
- National Laboratory,
- New York, USA
2Outline of the talk
- What is a knowledge network and how is it
different from an ordinary graph or network? - Knowledge networks on the internet matching
products to customers - Knowledge networks in biology large ensembles of
interacting biomolecules - Empirical study of correlations in the network of
interacting proteins - Collaborators Y-C. Zhang, and K. Sneppen
3Networks in complex systems
- Network is the backbone of a complex system
- Answers the question who interacts with whom?
- Examples
- Internet and WWW
- Interacting biomolecules (metabolic, physical,
regulatory) - Food webs in ecosystems
- Economics customers and products Social people
and their choice of partners
4Predicting tastes of customers based on their
opinions on products
- Each of us has personal tastes
- These tastes are sometimes unknown even to
ourselves (hidden wants) - Information is contained in our opinions on
products - Matchmaking customers with similar tastes can be
used to predict future opinions - Internet allows to do it on a large scale
5Types of networks
Plain network
Knowledge or opinion network
readers
6Storing opinions
Matrix of opinions ?IJ
Network of opinions
X X X 2 9 ? ?
X X X ? 8 ? 8
X X X ? ? 1 ?
2 ? ? X X X X
9 8 ? X X X X
? ? 1 X X X X
? 8 ? X X X X
7Using correlations to reconstruct customers
tastes
- Similar opinions ? similar tastes
- Simplest model
- Readers ? M-dimensional vector of tastes rI
- Books ? M-dimensional vector of features bJ
- Opinions ? scalar product
- ?IJ rI?bJ
1
2
9
1
2
customers
8
2
books
3
8
1
3
4
8Loop correlation
- predictive power 1/M(L-1)/2
- one needs many loops to completely
freezemutual orientation of vectors
9Field Theory Approach
- If all components of vectors are Gaussian and
uncorrelated
- Generating functional is det(1i?)-M/2
- All irreducible correlations are proportional to
M - All loop correlations lt?12 ?23 ?34 ?L1gtM
- Since each is ?IJ?M sign correlation scales
as M(L-1)/2
10Main parameter density of edges
- The larger is the density of edges p the easier
is the prediction - At p1 ? 1/N (NNreadersNbooks) macroscopic
prediction becomes possible. Nodes are connected
but vectors rI bJ are not fixed ordinary
percolation threshold - At p2 ? 2M/N gt p1 all tastes and features (rI
and bJ) can be uniquely reconstructed rigidity
percolation threshold
11Spectral properties of ?
- For MltN the matrix ?IJ has N-M zero eigenvalues
and M positive ones ? R ? R. - Using SVD one can diagonalize R U ? D ? V
such that matrices V and U are orthogonal V ? V
1, U ? U 1, and D is diagonal. Then ? U ?
D2? U - The amount of information contained in ?
NM-M(M-1)/2 ltlt N(N-1)/2 - the of off-diagonal
elements
12Practical recursive algorithm of prediction of
unknown opinions
- Start with ?0 where all unknown elements are
filled with lt?gt (zero in our case) - Diagonalize and keep only M largest eigenvalues
and eigenvectors - In the resulting truncated matrix ?0 replace
all known elements with their exact values and go
to step 1
13Convergence of the algorithm
- Above p2 the algorithm exponentially converges
to theexact values of unknown elements - The rate of convergence scales as (p-p2)2
14Reality check sources of errors
- Customers are not rational! ?IJ rI?bJ
?Ij(idiosyncrasy) - Opinions are delivered to the matchmaker through
a narrow channel - Binary channel SIJ sign(?IJ) 1 or 0 (liked or
not) - Experience rated on a scale 1 to 5 or 1 to 10 at
best - If number of edges K, and size N are large,
while M is small these errors can be reduced
15How to determine M?
- In real systems M is not fixed there are always
finer and finer details of tastes - Given the number of known opinions K one should
choose Meff ? K/(NreadersNbooks) so that systems
are below the second transition p2 ? tastes
should be determined hierarchically
16Avoid overfitting
- Divide known votes into training and test sets
- Select Meff so that to avoid overfitting !!!
17Knowledge networks in biology
- Interacting biomolecules key and lock principle
- Matrix of interactions (binding energies) ?IJ
kI?lJ lI?kJ - Matchmaker (bioinformatics researcher) tries to
guess yet unknown interactions based on the
pattern of known ones - Many experiments measure SIJ ?(?IJ-?th)
k(1)
k(2)
l(2)
l(1)
18Real systems
- Internet commerce the dataset of opinions on
movies collected by Compaq systems research
center - 72916 users entered a total of 2811983 numeric
ratings ( to ) for 1628 different movies
Meff40 - Default set for collaborative filtering research
- Biology table of interactions between yeast
proteins from Ito et al. high throughput
two-hybrid experiment - 6000 proteins (3300 have at least one
interaction partner) and 4400 known interactions - Binary (interact or not)
- Meff1 too small!
19Yeast Protein Interaction Network
- Data from T. Ito, et al. PNAS (2001)
- Full set contains 4549 interactions among 3278
yeast proteins - Here are shown only nuclear proteins interacting
with at least one other nuclear protein
20Correlations in connectivities
- Basic design principles of the network can be
revealed by comparing the frequency of a pattern
in real and random networks - P(k0,k1) probability that nodes with
connectivities k0 and k1 directly interact - Should be normalized by Pr(k0,k1) the same
property in a randomized network such that - Each node has the same number of neighbors
(connectivity) - These neighbors are randomly selected
- The whole ensemble of random networks can be
generated
21Correlation profile of the protein interaction
network
P(k0,k1)/Pr(k0,k1)
Z(k0,k1) (P(k0,k1)-Pr(k0,k1))/?r(k0,k1)
22Correlation profile of the internet
23What it may mean?
- Hubs avoid each other (like in the internet R.
Pastor-Satorras, et al. Phys. Rev. Lett. (2001)) - Hubs prefer to connect to terminal ends (low
connected nodes) - Specificity network is organized in modules
clustered around individual hubs - Stability the number of second nearest neighbors
is suppressed ? harder to propagate deleterious
perturbations
24Conclusion
- Studies of networks are similar to paleontology
learning about an organism from its backbone - You can learn a lot about a complex system from
its network !! But not everything
25THE END
26Entropy of unknown opinions
Entropy
Density of knownopinions p
p1
p2
0
1
27How to determine p2?
- K known elements of an NxN matrix ?IJ rI?bJ
(NNrNb) - Approximately N x M degrees of freedom (minus
M(M-1)/2 gauge parameters) - For KgtMN all missing elements can be
reconstructed ? p2 K2/(N(N-1)/2) ? 2M/N
28What is a knowledge network?
- Undirected graph with N vertices and K edges
- Each vertex has a (hidden) M-dimensional vector
of tastes/features - Each edge carries a scalar product (opinion) of
vectors on vertices it connects - The centralized matchmaker is trying to guess
vectors (tastes) based on their scalar products
(opinions) and to predict unknown opinions
29Versions of knowledge networks
- Regular graph every link is allowed. Example
recommending people to other people according to
their areas of interests - Bipartite graphs Example Customers to products
- Non-reciprocal opinions each vertex has two
vectors dI, qI so that ?IJ dI?qJ . Example Real
matchmaker recommending men to women.