Detecting topological patterns in protein networks - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Detecting topological patterns in protein networks

Description:

Nodes in a given module (or community group or a functional unit) tend to ... Matrix formalism. Eigenvectors of the. transfer matrix Tij. Similarity transformation ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 58
Provided by: sergei9
Category:

less

Transcript and Presenter's Notes

Title: Detecting topological patterns in protein networks


1
Lecture 4
2
  • Modules/communities
  • in networks

3
What is a module?
  • Nodes in a given module (or community group or a
    functional unit) tend to connect with other nodes
    in the same module
  • Biology proteins of the same function (e.g. DNA
    repair) or sub-cellular localization (e.g.
    nucleus)
  • WWW websites on a common topic (e.g. physics)
    or organization (e.g. EPFL)
  • Internet Autonomous systems/routers by
    geography (e.g. Switzerland) or domain (e.g.
    educational or military)

4
Sometimes easy to discover
5
Sometimes hard
6
Hierarchical clustering
  • calculating the similarity weight Wij for all
    pairs of vertices (e.g. of independent paths i
    ? j)
  • start with all n vertices disconnected
  • add edges between pairs one by one in order of
    decreasing weight
  • result nested components, where one can take a
    slice at any level of the tree

7
Girvan Newman (2002) betweenness clustering
  • Betweenness of and edge i -- j is the of
    shortest paths going through this edge
  • Algorithm
  • compute the betweenness of all edges
  • remove edge with the lowest betweenness
  • recalculate betweenness
  • Caveats
  • Betweenness needs to be recalculated at each step
  • very expensive all pairs shortest path O(N3)
  • may need to repeat up to N times
  • does not scale to more than a few hundred nodes,
    even with the fastest algorithms

8
  • Using random walks/diffusion to discover modules
    in networks

K. Eriksen, I. Simonsen, S. Maslov, K. Sneppen,
PRL 90, 148701(2003)
9
Why diffusion?
  • Any dynamical process would equilibrate faster on
    modules and slower between modules
  • Thus its slow modes reveal modules
  • Diffusion is the simplest dynamical process
    (people also use others like Ising/Potts model,
    etc.)

10
Random walkers on a network
  • Study the behavior of many VIRTUAL random walkers
    on a network
  • At each time step each random walker steps on a
    randomly selected neighbor
  • They equilibrate to a steady state ni ki
    (solid state physics ni const)
  • Slow modes of equilibration to the steady state
    allow to detect modules in a network

11
Matrix formalism
12
Eigenvectors of the transfer matrix Tij
13
Similarity transformation
  • Matrix Tij is asymmetric ?
  • Could in principle result to complex
    eigenvalues/eigenvectors
  • Luckily, Sij1/(?Ki ?Kj) has the same eigenvalues
    and eigenvectors vi /?Ki
  • Known as similarity transformation

14
Density of states ?(?)
  • filled circles real AS-network
  • empty squares degree-preserving randomized
    version

15
Participation ratio PR(?) ?i1/(v(?)i)4
250
200
150
Participation Ratio
100
50
0
-1
-0.5
0
0.5
1
l
16
US Military
17
2 0.9626 RU RU RU RU CA RU RU
?? ?? US US US US ?? (US
Department of Defence) 3 0.9561 ?? FR FR FR
?? FR ?? RU RU RU ?? ?? RU ??
4 0.9523 US ?? US ?? ?? ?? ?? (US Navy)
NZ NZ NZ NZ NZ NZ NZ 5. 0.9474
KR KR KR KR KR ?? KR UA UA UA
UA UA UA UA
18
Hacked Ford AS
19
(No Transcript)
20
  • Using random walks/diffusion to rank information
    networks

e.g. Googles PageRank made it 160 billion
21
Information networks
  • 3x105 Phys Rev articles connected by 3x106
    citation links
  • 1010 webpages in the world
  • To find relevant information one needs to
    efficiently search and rank!!

22
Ranking webpages
  • Assign an importance factor Gi to every webpage
  • Given a keyword (say jaguar) find all the pages
    that have it in their text and display them in
    the order of descending Gi.
  • One solution still used in scientific publishing
    is GiKin(i) (the number of incoming links), but
  • Too democratic It doesnt take into account the
    importance of nodes sending links
  • Easy to trick and artificially boost the ranking
    (for the WWW)

23
How Google works
  • Googles recipe (circa 1998) is to simulate the
    behavior of many virtual random surfers
  • PageRank Gi the number of virtual hits the
    page gets. It is also the steady state number
    of random surfers at a given page
  • Popular pages send more surfers your way ?
    PageRank Kin is weighted by the popularity of a
    webpage sending each hyperlink
  • Surfers get bored following links ? with
    probability ?0.15 a surfer jumps to a randomly
    selected page (not following any hyperlinks)

24
  • How communities in the WWW influence Google
    ranking

H. Xie, K.-K. Yan, SM, cond-mat/0409087 physics/05
10107 Physica A 373 (2007) 831836
25
How do WWW communities influence their average Gi?
  • Pages in a web-community preferentially link to
    each other. Examples
  • Pages from the same organization (e.g. EPFL)
  • Pages devoted to a common topic (e.g. Physics)
  • Pages in the same geographical location (e.g
    Switzerland)
  • Naïve argument communities trap random surfers
    to spend more time inside ? they should increase
    the average Google ranking of the community

26
Test of a naïve argument
Community 1
log10(ltGgtc)
Community 2
of intra-community links
  • Naïve argument is wrong it could go either way

27
Eww
Ecc
28
  • Gc average Google rank of pages in the
    community Gw ? 1 in the outside world
  • Ecw Gc/ltKoutgtc current from C to W
  • It must be equal to Ewc Gw/ltKoutgtw
    current from W to C
  • Thus Gc depends on the ratio between Ecw and
    Ewc the number of edges (hyperlinks) between
    the community and the world

29
Balancing currents for nonzero ?
  • Jcw(1- ?) Ecw Gc/ltKoutgtc ? Gc Nc current
    from C to W
  • It must be equal to Jcw(1- ?) Ewc Gw/ltKoutgt
    ? Gw Nw(Nc/Nw) current from W to C

30
What are the consequences?
  • For very isolated communities (Ecw/E(r)cwlt? and
    Ewc/E(r)wclt?) one has Gc1. Their Google rank is
    decoupled from the outside world!
  • Overall range ? ltGclt1/?

31
WWW - the empirical data
  • We have data for 10 US universities ( all UK
    and Australian Universities)
  • Looked closely at UCLA and Long Island University
    (LIU)
  • UCLA has different departments
  • LIU has 4 campuses

32
?0.15
33
?0.001
Abnormally high PageRank
34
Top PageRank LIU websites for ?0.001 dont make
sense
  • 1 www.cwpost.liu.edu/cwis/cwp/edu/edleader/highe
    r_ed/ hear.html'
  • 5 /higher_ed/ index.html
  • 9 /higher_ed/courses.html

Strongly connected component
World
35
(No Transcript)
36
(No Transcript)
37
What about citation networks?
  • Better use ?0.5 instead of ?0.15 people dont
    click through papers as easily as through
    webpages
  • Time arrow papers only cite older papers Small
    values of ? give older papers unfair advantage
  • New algorithm CiteRank (as in PageRank). Random
    walkers start from recent papers exp(-t/?d)

38
(No Transcript)
39
Summary
  • Diffusion and modules (communities) in a network
    affect each other
  • In the hardware part of the Internet
    (Autonomous systems or routers ) diffusion allows
    one to detect modules
  • In the software part
  • Diffusion-like process is used for ranking
    (Googles PageRank)
  • WWW communities affect this ranking in a
    non-trivial way

40
THE END
41
(No Transcript)
42
Part 2 Opinion networks
"Extracting Hidden Information from Knowledge
Networks", S. Maslov, and Y-C. Zhang, Phys. Rev.
Lett. (2001). "Exploring an opinion network for
taste prediction an empirical study", M.
Blattner, Y.-C. Zhang, and S. Maslov, in
preparation.
43
Predicting customers tastes from their opinions
on products
  • Each of us has personal tastes
  • Information about them is contained in our
    opinions on products
  • Matchmaking opinions of customers with tastes
    similar to mine could be used to forecast my
    opinions on untested products
  • Internet allows to do it on large scale (see
    amazon.com and many others)

44
Opinion networks
Opinions of movie-goers on movies
WWW
Other webpages
1
opinion
1
Movies
1
Webapges
Customers
2
1
2
2
3
2
3
3
4
3
4
45
Storing opinions
Matrix of opinions ?IJ
Network of opinions
Movies
1
2
9
Customers
1
2
8
2
3
8
1
3
4
46
Using correlations to reconstruct customers
tastes
  • Similar opinions ? similar tastes
  • Simplest model
  • Movie-goers ? M-dimensional vector of tastes TI
  • Movies ? M-dimensional vector of features FJ
  • Opinions ? scalar product
  • ?IJ TI?FJ

Movies
2
1
1
Customers
9
1
2
8
2
3
8
3
4
47
Loop correlation
  • Predictive power 1/M(L-1)/2
  • One needs many loops to best reconstruct
    unknown opinions

L5 known opinions Predictive power of an
unknown opinion is 1/M2
An unknown opinion
48
Main parameter density of edges
  • The larger is the density of edges p the easier
    is the prediction
  • At p1 ? 1/N (NNcostomersNmovies) macroscopic
    prediction becomes possible. Nodes are connected
    but vectors TI and FJ are not fixed ordinary
    percolation threshold
  • At p2 ? 2M/N gt p1 all tastes and features (TI
    and FJ) can be uniquely reconstructed rigidity
    percolation threshold

49
Real empirical data (EachMovie dataset) on
opinions of customers on movies 5-star ratings
of 1600 movies by 73000 users 1.6 million
opinions!
50
Spectral properties of ?
  • For MltN the matrix ?IJ has N-M zero eigenvalues
    and M positive ones ? R ? R.
  • Using SVD one can diagonalize R U ? D ? V
    such that matrices V and U are orthogonal V ? V
    1, U ? U 1, and D is diagonal. Then ? U ?
    D2? U
  • The amount of information contained in ?
    NM-M(M-1)/2 ltlt N(N-1)/2 - the of off-diagonal
    elements

51
Recursive algorithm for the prediction of unknown
opinions
  • Start with ?0 where all unknown elements are
    filled with lt?gt (zero in our case)
  • Diagonalize and keep only M largest eigenvalues
    and eigenvectors
  • In the resulting truncated matrix ?0 replace
    all known elements with their exact values and go
    to step 1

52
Convergence of the algorithm
  • Above p2 the algorithm exponentially converges
    to theexact values of unknown elements
  • The rate of convergence scales as (p-p2)2

53
Reality check sources of errors
  • Customers are not rational! ?IJ rI?bJ
    ?IJ(idiosyncrasy)
  • Opinions are delivered to the matchmaker through
    a narrow channel
  • Binary channel SIJ sign(?IJ) 1 or 0 (liked or
    not)
  • Experience rated on a scale 1 to 5 or 1 to 10 at
    best
  • If number of edges K, and size N are large,
    while M is small these errors could be reduced

54
How to determine M?
  • In real systems M is not fixed there are always
    finer and finer details of tastes
  • Given the number of known opinions K one should
    choose Meff ? K/(NreadersNbooks) so that systems
    are below the second transition p2 ? tastes
    should be determined hierarchically

55
Avoid overfitting
  • Divide known votes into training and test sets
  • Select Meff so that to avoid overfitting !!!

56
Knowledge networks in biology
  • Interacting biomolecules key and lock principle
  • Matrix of interactions (binding energies) ?IJ
    kI?lJ lI?kJ
  • Matchmaker (bioinformatics researcher) tries to
    guess yet unknown interactions based on the
    pattern of known ones
  • Many experiments measure SIJ ?(?IJ-?th)

k(1)
k(2)
l(2)
l(1)
57
THE END
Write a Comment
User Comments (0)
About PowerShow.com